A Compiler for Carlos

Overview

This page describes a compiler for the language Carlos. The idea is to show how a medium sized program is designed and implemented. Please download the source code. The compiler has a number of bugs that can be fixed for extra credit.

Getting Started

Before working with the compiler you should read and study the formal definition for the language Carlos.

Make sure your development box includes Maven, JavaCC, and JUnit. If you are using Eclipse, get the maven and javacc plugins (JUnit is already included with Eclipse).

If you are not using Eclipse, then, after downloading the source, run

mvn package

in the project root directory. You can then run a mini-Carlos IDE with:

java -cp target/carlos-1.0.jar edu.lmu.cs.xlg.carlos.Viewer

If you using Eclipse, you should be able to check out the project directly, since I have included a .project file which identifies the project as a Maven file. You might have to tweak some of your own preferences and settings, but this is to be expected since Eclipse is so customizable.

Design

If you were writing this compiler from scratch, you would first come up with a list of things you need. This project requires:

Architectural Overview

Project setup

We've set up this project the "right way" — the way that most Java projects in the world are arranged these days. The basic idea is that /src contains everything checked into version control, and /src/main contains everything bundled in the released product.

carlosdirectories.png

Here is a brief description of each of the directories in the project:

/ (project root, not the file system root)
Contains only the Maven project file. If you are using Eclipse, you will see the files .project and .classpath there too. If you are using CVS, then you should put a .cvsignore in here so that the target directory doesn't get version-controlled.
src
Ancestor directory for all source code that makes up the project.
src/main
Ancestor directory for all source code that will be compiled into the distrubted product.
src/main/java
Ancestor directory for all main java code.
src/main/java/edu/lmu/cs/xlg/carlos
Sources for the code specific to the Carlos langauge and compiler go in this directory tree. The carlos directory itself contains applications such as the command-line based compiler and a graphical code viewer.
src/main/java/edu/lmu/cs/xlg/carlos/entities
Classes for all of the syntactically and semantically significant Carlos objects, such as statements, expressions, blocks, types, functions, ... These entities understand their own semantics (meaning they can take care of their own semantic analysis), but they know absolutely nothing regarding what they will be translated into. Translation is fully non-intrusive.
src/main/java/edu/lmu/cs/xlg/squid
Specification of a quadruple-based intermediate language, with some tools for pretty-printing and optimizing.
src/main/java/edu/lmu/cs/xlg/nasm
Specification of the IA-32 architecture and the NASM assembly language, used by code generators.
src/main/java/edu/lmu/cs/xlg/translators
A package for translations between languages. Each of the language packages above (carlos, squid, nasm) are completely unaware of the existence of each other. The only connection between the different "languages" exist within translator objects. In my example compiler, I have two translators, one taking a Carlos semantic graph into a squid module, and one for mapping a squid module into a NASM program. Of course, one could write millions more.
src/main/java/edu/lmu/cs/xlg/util
Classes used by all of the other packages, such as a smart logger and an id generator.
src/main/javacc
All javacc scripts go here. We only have one, Carlos.jj
src/main/resources
All resource files that need to be present at runtime. This includes properties files (for internationalization).
src/test
Ancestor directory for unit tests. Only things under src/main will get compiled and/or placed in the distributed product. Anything under src/test will not.
src/test/java
Ancestor directory for unit tests written in Java.
src/test/resources
Non-Java files that have to be around when tests run. This includes the dozens of Carlos programs used in testing.
target
Ancestor directory that holds all built artifacts. The target directory is never checked into version control. Doing a "clean build" begins with completely deleting this directory and all descendants.
target/generated-sources
Holds all the Java files generated from running javacc on the Carlos.jj file.
target/classes
Holds all the compiled class files.

The POM

Start by browsing the pom file

Utility classes and resources

Some code will be useful to all parts of the compiler. For example, we'd like to do some global error counting and logging, and have a consistent way to do internationalization. I have a little logger that can do these things:

Professional software is internationalized, so you'll want resource files. Here's the default resource file (with U.S. English):

There is one for Spanish:

For extra credit, I'll gladly accept versions of this file in other languages.

Syntax Tree Entities

Carlos abstract syntax trees are made up of such things as programs, blocks, variables, types, functions, expressions, statements, and so on. The class hierarchy I came up with for Carlos is:

ClassAttributes
Entity
Blockstatements
Program
Declarablename
Variabletypename, initializer
Type
ArrayTypebaseTypeName
StructTypefields
FunctionreturnTypeName, parameters, body
Statement
Declarationdeclarable
IncrementStatementop, target
CallStatementfunctionName, args
AssignmentStatementleft, right
BreakStatement
ReturnStatementreturnExpression
IfStatementcases
WhileStatementcondition, body
ClassicForStatementtypeName, index, initial, final, each, body
Expression
Literallexeme
IntegerLiteralvalue
RealLiteralvalue
StringLiteralvaue
CharLiteralvalue
BooleanLiteralvalue
NullLiteral
VariableExpression
SimpleVariableReferencename
CallExpressionfunctionName, args
SubscriptedVariablearray, index
DottedVariablestruct, fieldname
ArrayAggregatetypename, args
StructAggregatetypename, args
EmptyArraytypename, bounds
PrefixExpressionop, operand
PostfixExpressionop, operand
InfixExpressionop, left, right
StructFieldname, typename
Casecondition, body

The "Syntax Portion" of the Entity Classes

We have a class for each entity. There is a field for each syntactic attribute and the constructor fills in those fields. The classes are in src/main/java/edu/lmu/cs/xlg/carlos/entities. The src/main/java directory will be the base directory for Java compilations, so the package name for the entity classes will be edu.lmu.cs.xlg.carlos.semantics.

Here is an example

package edu.lmu.cs.xlg.carlos.semantics;

/**
 * A statement of the form "while (e) b".
 */
public class WhileStatement extends Statement {

    Expression condition;
    Block body;

    /**
     * Creates a while statement object.
     */
    public WhileStatement(Expression condition, Block body) {
        this.condition = condition;
        this.body = body;
    }
}

The attributes are package-private, not private, because all of the dozens of classes for entities all essentially exist for one purpose, which is to describe a Carlos program. Writing getters isn't really a pain, but these classes make a nice cohesive family that package-private access is justified.

The Abstract Syntax Tree Generator

The Carlos grammar is not LL(k) so in order to use JavaCC we had to tweak it a little.

The pattern used for parsing functions in the compiler is the one in which all parsing functions return an entity. For example:

Program parseProgram():
  {List<Statement> statements = new ArrayList<Statement>(); Statement s;}
{
  ( s = parseStmt() {statements.add(s);} )+
  <EOF>
  {return new Program(statements);}
}

There are a couple places where you'll want to change the grammar: since increments and calls can be both expressions and statements, you'll want to split up parsing functions; for example, parseCallExpression and parseCallStatement.

My completed JJ file for Carlos is:

Semantic Analysis

Semantic analysis turns the abstract syntax tree into a semantic graph by resolving names (such as type names, variable names and function names) into the actual objects they denote. A nice way to do this is to write analyze() methods for each entity class. These methods "fill in semantic fields" and do static semantic checking. See, for example

It's easier to do semantic analysis if you have a good symbol table class. Here's mine:

Semantic analysis is a LOT of work. We spend several weeks on it in class.

Front End Testing

Remember that unit testing involves making a non-interactive tester that can run on its own and test for hours. It should never rely on humans to "look at the output to see if each test is okay." Instead write code that processes zillions of inputs and does a lot of assertions. All modern Java systems use the JUnit framework for testing.

This tester works by picking up all the files named *.carlos in the current directory and running them through the front end. Filenames beginning with "synerror" are expected to have syntax errors. Filenames beginning with "semerror" are expected to have static semantic errors. All other files are expected to pass semantic analysis with no errors. The tester counts the number of tests and the number of failures and gives you a report.

Note that Maven by default runs tests.

The Carlos programs in the test suite are in src/test/resources. Remember the tester is really only a shell; the tests themselves are only thorough if you write hundreds of little Carlos apps.

Tools

During development, you will find yourself needing to "see" the syntactic and semantic structures output by the compiler. I wrote a very simple, but fairly nice, GUI tool for this:

This is in no way a substitute for the unit tests; it's just a neat learning tool.

Intermediate Code Generation

I'm using an intermediate representation called Squid for this project. A standalone translator called CarlosToSquidTranslator produces a squid object from a semantic graph. Note that none of the code in the Carlos packages knows anything about Squid, or any other intermediate representation. The compiler is architected so that the translators are written completely separate from the Carlos classes; the translators simply read and traverse semantic graphs. Thus, many different intermediate code generators can be written without the need to ever mess with the front-end code at all.

Code Generator and Run Time Library

The code generator translates intermediate code into target code; I wrote one to translate Squid objects into NASM files. The NASM code naturally has a lot of calls to supporting functions in a standard library, written in C.

Optimizations

Optimizations are done in several places throughout the compiler. But adding a few here and there are best done after everything else is done and working.

Running The Compiler

There are two classes in the compiler with main methods:

Running from the command line.

You can run the compiler from the command line by saying:

java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler

though it would be easier perhaps to wrap this call in a shell script, something called carlos.bat on Windows, and just plain carlos everywhere else. For bash, your script would look like this:

java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler $@

and for Windows

java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler %1 %2

The Carlos compiler accepts several command line arguments:

    carlos                (display help)
    carlos -syn myprogram (checks syntax only)
    carlos -sem myprogram (checks semantics only)
    carlos -asm myprogram (produces asm file only)
    carlos -obj myprogram (produces object code only)
    carlos -exe myprogram (produces executable)

To get the compiler to use Spanish, you can invoke (for example)

java -Duser.language=es -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler -asm myprogram

I'll leave to to you to figure out how to get your script to deal with that.

Here is a program with semantic errors (in the file bad.carlos):

int x;
int x;
x = y + z;

If we ask the compiler to check syntax only

$ carlos -syn bad Checking syntax

which shows there are no syntax errors, but:

$ carlos -sem bad Checking syntax Checking semantics Redeclaration of x Variable y not found Variable z not found Arithmetic expression expected for "+" Arithmetic expression expected for "+"

This one is okay (the file is hello.carlos):

printf("Hello");
$ carlos -asm hello Checking syntax Checking semantics Generating assembly language

The compiler produces, in hello.asm:

        global  _main
        extern  _Carlos__printf

        section .text
_main:
        push    dword $s0
        call    _Carlos__printf
        add     esp, 4
        ret

        section .rodata
        dd      5
$s0:    dd      72, 101, 108, 108, 111

Running from another Java program

Look in the class edu.lmu.cs.xlg.carlos.Compiler — you'll see methods you can call to compile a Carlos program. If you would like to compile Carlos programs as part of a different application, just include carlos.jar and you're ready.