This page describes a compiler for the language Carlos. The idea is to show how a medium sized program is designed and implemented. Please download the source code. The compiler has a number of bugs that can be fixed for extra credit.
Before working with the compiler you should read and study the formal definition for the language Carlos.
Make sure your development box includes Maven, JavaCC, and JUnit. If you are using Eclipse, get the maven and javacc plugins (JUnit is already included with Eclipse).
If you are not using Eclipse, then, after downloading the source, run
mvn package
in the project root directory. You can then run a mini-Carlos IDE with:
java -cp target/carlos-1.0.jar edu.lmu.cs.xlg.carlos.Viewer
If you using Eclipse, you should be able to check out the project directly, since I have included a .project file which identifies the project as a Maven file. You might have to tweak some of your own preferences and settings, but this is to be expected since Eclipse is so customizable.
If you were writing this compiler from scratch, you would first come up with a list of things you need. This project requires:
We've set up this project the "right way" — the way that most Java projects in the world are arranged these days. The basic idea is that /src contains everything checked into version control, and /src/main contains everything bundled in the released product.

Here is a brief description of each of the directories in the project:
Start by browsing the pom file
Some code will be useful to all parts of the compiler. For example, we'd like to do some global error counting and logging, and have a consistent way to do internationalization. I have a little logger that can do these things:
Professional software is internationalized, so you'll want resource files. Here's the default resource file (with U.S. English):
There is one for Spanish:
For extra credit, I'll gladly accept versions of this file in other languages.
Carlos abstract syntax trees are made up of such things as programs, blocks, variables, types, functions, expressions, statements, and so on. The class hierarchy I came up with for Carlos is:
| Class | Attributes |
|---|---|
| Entity | — |
| Block | statements |
| Program | — |
| Declarable | name |
| Variable | typename, initializer |
| Type | — |
| ArrayType | baseTypeName |
| StructType | fields |
| Function | returnTypeName, parameters, body |
| Statement | — |
| Declaration | declarable |
| IncrementStatement | op, target |
| CallStatement | functionName, args |
| AssignmentStatement | left, right |
| BreakStatement | — |
| ReturnStatement | returnExpression |
| IfStatement | cases |
| WhileStatement | condition, body |
| ClassicForStatement | typeName, index, initial, final, each, body |
| Expression | — |
| Literal | lexeme |
| IntegerLiteral | value |
| RealLiteral | value |
| StringLiteral | vaue |
| CharLiteral | value |
| BooleanLiteral | value |
| NullLiteral | — |
| VariableExpression | — |
| SimpleVariableReference | name |
| CallExpression | functionName, args |
| SubscriptedVariable | array, index |
| DottedVariable | struct, fieldname |
| ArrayAggregate | typename, args |
| StructAggregate | typename, args |
| EmptyArray | typename, bounds |
| PrefixExpression | op, operand |
| PostfixExpression | op, operand |
| InfixExpression | op, left, right |
| StructField | name, typename |
| Case | condition, body |
We have a class for each entity. There is a field for
each syntactic attribute and the constructor fills in those fields.
The classes are in src/main/java/edu/lmu/cs/xlg/carlos/entities.
The src/main/java
directory will be the base directory for Java compilations, so the
package name for the entity classes will be
edu.lmu.cs.xlg.carlos.semantics.
Here is an example
package edu.lmu.cs.xlg.carlos.semantics;
/**
* A statement of the form "while (e) b".
*/
public class WhileStatement extends Statement {
Expression condition;
Block body;
/**
* Creates a while statement object.
*/
public WhileStatement(Expression condition, Block body) {
this.condition = condition;
this.body = body;
}
}
The attributes are package-private, not private, because all of the dozens of classes for entities all essentially exist for one purpose, which is to describe a Carlos program. Writing getters isn't really a pain, but these classes make a nice cohesive family that package-private access is justified.
The Carlos grammar is not LL(k) so in order to use JavaCC we had to tweak it a little.
The pattern used for parsing functions in the compiler is the one in which all parsing functions return an entity. For example:
Program parseProgram():
{List<Statement> statements = new ArrayList<Statement>(); Statement s;}
{
( s = parseStmt() {statements.add(s);} )+
<EOF>
{return new Program(statements);}
}
There are a couple places where you'll want to change the grammar: since increments and calls can be both expressions and statements, you'll want to split up parsing functions; for example, parseCallExpression and parseCallStatement.
My completed JJ file for Carlos is:
Semantic analysis turns the abstract syntax tree into a semantic graph
by resolving names (such as type names, variable names and function names)
into the actual objects they denote. A nice way to do this is to write
analyze() methods for each entity class. These methods
"fill in semantic fields" and do static semantic checking. See,
for example
It's easier to do semantic analysis if you have a good symbol table class. Here's mine:
Semantic analysis is a LOT of work. We spend several weeks on it in class.
Remember that unit testing involves making a non-interactive tester that can run on its own and test for hours. It should never rely on humans to "look at the output to see if each test is okay." Instead write code that processes zillions of inputs and does a lot of assertions. All modern Java systems use the JUnit framework for testing.
This tester works by picking up all the files named *.carlos
in the current directory and running them through
the front end. Filenames beginning with "synerror" are expected to
have syntax errors. Filenames beginning with "semerror" are expected
to have static semantic errors. All other files are expected to pass
semantic analysis with no errors. The tester counts the number of tests
and the number of failures and gives you a report.
Note that Maven by default runs tests.
The Carlos programs in the test suite are in src/test/resources. Remember the tester is really only a shell; the tests themselves are only thorough if you write hundreds of little Carlos apps.
During development, you will find yourself needing to "see" the syntactic and semantic structures output by the compiler. I wrote a very simple, but fairly nice, GUI tool for this:
This is in no way a substitute for the unit tests; it's just a neat learning tool.
I'm using an intermediate representation called Squid for this project. A standalone translator called CarlosToSquidTranslator produces a squid object from a semantic graph. Note that none of the code in the Carlos packages knows anything about Squid, or any other intermediate representation. The compiler is architected so that the translators are written completely separate from the Carlos classes; the translators simply read and traverse semantic graphs. Thus, many different intermediate code generators can be written without the need to ever mess with the front-end code at all.
The code generator translates intermediate code into target code; I wrote one to translate Squid objects into NASM files. The NASM code naturally has a lot of calls to supporting functions in a standard library, written in C.
Optimizations are done in several places throughout the compiler. But adding a few here and there are best done after everything else is done and working.
There are two classes in the compiler with main methods:
You can run the compiler from the command line by saying:
java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler
though it would be easier perhaps to wrap this call in a shell script, something called carlos.bat on Windows, and just plain carlos everywhere else. For bash, your script would look like this:
java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler $@
and for Windows
java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler %1 %2
The Carlos compiler accepts several command line arguments:
carlos (display help)
carlos -syn myprogram (checks syntax only)
carlos -sem myprogram (checks semantics only)
carlos -asm myprogram (produces asm file only)
carlos -obj myprogram (produces object code only)
carlos -exe myprogram (produces executable)
To get the compiler to use Spanish, you can invoke (for example)
java -Duser.language=es -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler -asm myprogram
I'll leave to to you to figure out how to get your script to deal with that.
Here is a program with semantic errors (in the file bad.carlos):
int x; int x; x = y + z;
If we ask the compiler to check syntax only
which shows there are no syntax errors, but:
This one is okay (the file is hello.carlos):
printf("Hello");
The compiler produces, in hello.asm:
global _main
extern _Carlos__printf
section .text
_main:
push dword $s0
call _Carlos__printf
add esp, 4
ret
section .rodata
dd 5
$s0: dd 72, 101, 108, 108, 111
Look in the class edu.lmu.cs.xlg.carlos.Compiler — you'll see methods you can call to compile a Carlos program. If you would like to compile Carlos programs as part of a different application, just include carlos.jar and you're ready.