Compiler Architecture

Overview

Compilers perform translation. Every non-trivial translation requires analysis and synthesis

analsyn.png

Both analysis and synthesis are made up of internal phases.

Compiler Components

These are the main functional components of a production compiler that looks to generate assembly or machine language (if you are just targeting a high-level language like C, or a virtual machine, you might not have so many phases):

compilerphases.png

You might also identify an error recovery subsystem and a symbol table manager functional components, too.

Lexical Analysis (Scanning)

The scanner converts the source program's stream of characters into a stream of tokens, removing whitespace, removing comments and expanding macros along the way.

An example in C

#define ZERO 0
    unsigned  gcd(   unsigned   int  // Euclid's algorithm
      x,unsigned   y) {   while ( /* hello */  x>   ZERO
   ){unsigned temp=x;x=y   %x;y  = temp ;}return y ;}

gets tokenized into

unsigned ID(gcd) ( unsigned int ID(x) , unsigned y ) { while ( x > INTLIT(0) ) { unsigned ID(temp) = ID(x) ; ID(x) = ID(y) % ID(x) ; ID(y) = ID(temp) ; } return ID(y) ; }

Scanners are concerned with issues such as

Errors that might occur during scanning, called lexical errors include

Syntax Analysis (Parsing)

The parser turns the token sequence into an abstract syntax tree. For the example above, we get this tree:

gcdast1.png

The tree can also be stored as a string

(fundecl unsigned gcd
  (params (param unsigned x) (param unsigned y))
  (block
    (while
      (> x 0)
      (block (vardecl unsigned temp y) (= x (% y x)) (= y temp)))
    (return y)))

Or, each node in the tree is stored as an object with named fields, many of whose values are themselves nodes in the tree. Note that at this stage in the compilation, the tree is definitely just a tree. There are no cycles.

gcdast2.png

When constructing a parser, one needs to be concerned with grammar complexity (such as whether the grammar is LL or LR), and whether there are any disambiguation rules that might have to be hacked in. Some languages actually require a bit of semantic analysis to parse.

Exercise: Show that the expression (x)-y in C can have two different syntactic interpretations. Hint: Your answer will probably contain the words "subtraction", "typedef", "cast", and "negation".

Errors that might occur during parsing, called syntax errors include things like following, in C

Semantic Analysis

During semantic analysis we have to check legality rules and while doing so, we tie up the pieces of the syntax tree (by resolving identifier references, inserting cast operations for implicit coercions, etc.) to form a semantic graph.

Continuing the example above:

gcdsemgraph.png

Obviously, the set of legality rules is different for each language. Examples of legality rules you might see in a Java-like language include:

Exercise: Give examples of each of the above.

Errors occurring during this phase are called static semantic errors.

Exercise: The language Pascal has an unusual syntax when it comes to expressions: it gives the "and" operator (which requires boolean operands) higher precedence than relational operators! Show how this implies that the expression x-4<=5 and 2<y is a static semantic error.

Intermediate Code Generation

The intermediate code generator produces a flow graph made up of tuples grouped into basic blocks. For the example above, we'd see:

gcdflowgraph.png

You can read more about intermediate representations elsewhere.

Machine Independent Code Improvement

Code improvement that is done on the semantic graph or on the intermediate code is called machine independent code optimization. In practice there are zillions of known optimzations (er, improvements), but none really apply to our running example.

Code Generation

Code generation produces the actual target code, or something close. In traditional compilers targeting a RISC machine, the first code generation phase produces assembly language that assumes an infinite set of virtual registers. For CISC machines this probably won't be the case, but you never know. Here is code for our running example generated by gcc 3.4.2 for the x86:

	.file	"gcd.c"
	.text
.globl _gcd
	.def	_gcd;	.scl	2;	.type	32;	.endef
_gcd:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$4, %esp
L2:
	cmpl	$0, 8(%ebp)
	je	L3
	movl	8(%ebp), %eax
	movl	%eax, -4(%ebp)
	movl	12(%ebp), %eax
	movl	$0, %edx
	divl	8(%ebp)
	movl	%edx, 8(%ebp)
	movl	-4(%ebp), %eax
	movl	%eax, 12(%ebp)
	jmp	L2
L3:
	movl	12(%ebp), %eax
	leave
	ret

Machine Dependent Code Improvement

Usually the final phase in compilation is cleaning up and improving the target code. For the example above, gcc 3.4.2 came up with this:

	.file	"gcd.c"
	.text
	.p2align 4,,15
.globl _gcd
	.def	_gcd;	.scl	2;	.type	32;	.endef
_gcd:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	subl	$4, %esp
	movl	8(%ebp), %edx
	movl	12(%ebp), %ecx
	jmp	L9
	.p2align 4,,7
L11:
	movl	%edx, %ebx
	movl	%ecx, %eax
	xorl	%edx, %edx
	divl	%ebx
	movl	%ebx, %ecx
L9:
	testl	%edx, %edx
	jne	L11
	popl	%edx
	movl	%ecx, %eax
	popl	%ebx
	popl	%ebp
	ret