LMU | CMSI 488/588
COMPILER CONSTRUCTION
Practice Questions
  1. One of the freshmen tried, and failed, to write a grammar for the language L = {w ∈ {a,b}* | w has exactly twice as many a's as b's} and came up with:
        S → aab | aba | baa | aaSb | abSa | baSa | aSab | aSba | bSaa | SS
    
    1. Prove that aaabbbbaaaaa is not in the language generated by this grammar.
    2. Give a correct context free grammar for L (and don't forget, like the freshman did, that the empty string belongs, too).
  2. We've seen the EBNF form A ^ B which denotes A | ABA | ABABA | .... Such a form makes it convenient to write rules involving separators, such as
        IDLIST → ID ^ ","
    

    This form can also be used to model a construct representing "one or more" A's, rather than using AA* or A*A. Show how to do this.

  3. Consider an extension of Iki that features if and while statements, variable declarations, and a large set of arithmetic, relational and logical operators. Variables in this language may only have type integer, but expressions in if and while statements are to be Boolean-valued. Give a syntax for this language that enforces typing. (In other words, type errors are to be syntactic, rather than static semantic, errors.)
  4. The Ada language has slightly unusual precedence rules; unary negation is placed at the level of the additive operators, a couple levels lower than other unary operators. The grammar fragment is:
        EXP  → EXP1 ('and' EXP1)*  |  EXP1 ('or' EXP1)*
        EXP1 → EXP2 (RELOP EXP2)?
        EXP2 → '-'? EXP3 (ADDOP EXP3)*
        EXP3 → EXP4 (MULOP EXP4)*
        EXP4 → EXP5 ('**'  EXP5)?  |  'not' EXP5  |  'abs' EXP5
    
    1. Show the abstract syntax tree for the expression -8 * 5
    2. Suppose the designers of Ada decided to remove the optional unary negation operator from EXP2 and instead add a clause to the EXP4 rule that said
          '-' EXP5
      
      Draw the abstract syntax tree for -8 * 5 under this supposition.
    3. Do you think one reason that the Ada designers crafted their unusual syntax had anything to do with the fact that Ada comments start with "--" and go to the end of the line? In other words, would the rule EXP4 → '-' EXP5 possibly confuse programmers attempting to write double negatives?
  5. Rewrite the syntactic clauses EXP, EXP1, EXP2, EXP3 and so on through EXP10 of Hana in the classic LR grammar style.
  6. Suppose a new computer called the X1234 has just come out and it doesn't have an Ada compiler. But you want to make a resident Ada compiler on that machine. Fortunately you have a resident Ada compiler that runs on a MIPS machine. Describe exactly how you can construct the desired resident Ada compiler for the X1234 using the one for the MIPS.
  7. C does not allow structures (i.e., non-atomic objects) to be tested for equality. Ada does. Maybe the designers of C wanted to keep things simple. How exactly would equality operations for structures complicate a C compiler or the runtime system?
  8. If possible, write a program in Modula 3 that makes a variable point to itself. That is, for some designator X, make it so that X^ = X. If this is not possible, state why it is not possible.
  9. If possible, write a program in Ada that makes a variable point to itself. That is, for some designator X, make it so that X.all = X. If this is not possible, state why it is not possible.
  10. If possible, show how to make a ML variable x of type x such that x.x = x, or state why this is impossible.
  11. If possible, show how to make a Hana variable x of type x such that x.x == x, or state why this is impossible.
  12. In C++ you can say (x += 7) *= z but you can't say this in C. Explain the reason why, using precise, technical terminology. See if this same phenomenon holds for conditional expressions, too. What other languages behave like C++ in this respect?
  13. Identify the following errors as syntactic, static semantic, or dynamic semantic (runtime): If no language is mentioned for a particular case, it probably does not matter. Assume either C or Ada and write your assumption.
    1. Redeclaration of an identifier.
    2. Unbalanced parentheses.
    3. Applying an operator to an element of the wrong type.
    4. Array index out of bounds (in C, in Ada, ...).
    5. Division by zero.
    6. Semicolon after a block in C.
    7. Wrong number of arguments supplied to a call.
    8. Assignment of a variable of type T to a variable of type subtype of T where the first variable is out of the range of the second in Ada.
    9. An unwanted infinite loop.
    10. Dereference of a null pointer.
    11. Application of the "." to an identifier which is not a field of the record.
    12. Use of an uninitialized variable.
    13. for x (a) {printf("*"); x = x++;} in Hana.
  14. Draw the abstract syntax tree for the following C fragment:
        for (int i = x-3; q<=4&m.z[r |- 4]&2-8*r>- 5/~x;) {
            while (a) {
                y;
                2,y;
            }
        }
    
  15. Write an assembly language program that displays a multiplication table of size 12 × 12.
  16. Consider the continue statement of C.

    1. What kind of static semantic checks are required for this statement?
    2. Give an example piece C code that has a continue statement in it, and show the intermediate and target code for it.
  17. Under what circumstances can you safely replace the x86 code fragment
          je    L6
          jmp   L4
    L6:
    

    with the single instruction jne L4?

  18. Show that the addressing modes immediate, absolute memory, and register indirect can be simulated by register and register-offset alone.
  19. Give grammars for the languages
    1. {w in {a,b,c}* | w has at most one occurrence of any symbol}
    2. {ambncm+n | m,n >= 1 }
    3. Palindromes over {a, b}
    4. {ambn | m >= n }
    5. Strings of parentheses, brackets and braces, all properly balanced and nested.
    6. Semicolon terminated statements
    7. Comma separated expressions
    8. Strings over {a, b, c, d, e} containing at most one occurrence of any symbol.
  20. Give grammars for the languages
    1. {anbncn | n >= 0 }
    2. {aibjck | i = j or j = k }
    3. {ww | w in {a,b}* }
  21. Give two reasons why compilers are often written in the language they compile.
  22. One way to implement a runtime system for a language with exceptions is to place two return addresses in an activation record. Sketch a small Ada or C++ function that can throw (a possibly user-defined) exception, and a code fragment that calls the function. Give a stack frame layout with two return addresses, one is the normal return address and the other is the address of the handler in the caller. Show the assembly language for the caller and the function itself.
  23. Show the target code that is generated for the source statement X := Y; where X and Y are both one step down the static chain from the current subprogram, by a code generator which emits access code for the two values independently. Assume X is at offset –8 and Y is at offset –12. How many registers are used? Then generate code for this statement by hand, intelligently.

  24. Discuss advantages and disadvantages of a subprogram call implementation in which (a) the calling subprogram saves registers and (b) the called subprogram saves registers. Explain why the x86's C calling convention is a nice compromise.
  25. Some languages not require the parameters to a subprogram call to be evaluated in any particular order. Is it possible that different evaluation orders can lead to different arguments being passed? If so, give an example to illustrate this point, and if not, prove that no such event could occur.

  26. Give three examples of how aliasing can occur (you can use examples from several different languages). How does aliasing make copy propagation difficult? When, if ever, can an algorithm determine that a semantic object cannot possibly be aliased?
  27. Ada allows subprograms to be objects, as in the following code fragment:
        type Real_To_Real is access function (Real) return Real;
        type Foo is access procedure (Integer; in out Boolean);
        Sine, Cosine: Real_To_Real;
        P: Foo;
        Q: Real_To_Real;
        function Integrate (F: Real_To_Real; A, B: Real);
        ...
        function Square (X: Real) return Real is
        begin
            return X * X;
        end;
        ...
        Put (Integrate(Square'Access, 3, 10));
        Q := Cosine;
        if Q(Pi)> X then ...
    

    Describe the semantic rules relating to this facility in Ada, and how you would enforce them in a compiler.

  28. Suppose the variable A was declared in an Ada program with type array (21..38) of String(1..10), and happened to have offset –42 in the frame of the subprogram in which it was declared. Suppose further that the variable J was declared in the same subprogram and had offset –26.

    1. Show the target code that loads the value of A(J-1) into register eax that would be generated naïvely. Do not forget to show the bounds checking!
    2. Show target code to load the value of A(J-1) into register eax in which the "-1" computation is "folded in" to the computation of the base address of A. Note that the bounds checking code will look a little different than in part (a).

  29. Can loop unrolling ever be unsafe? Why or why not?
  30. Write an assembly language program that takes zero or more command line arguments, which should all be integers, and displays the average of the parameters to standard output.
  31. Occasionally a compiler may output a sequence such as
        mov    [ebp-8], eax
        mov    eax, [ebp-8]
    

    The second instruction might be able to be removed. But it is easy to see that whether we are able to remove this instruction is undecidable. Why, exactly?

  32. It is a well-known irritation that Ada does not allow you to write array aggregates for zero- or one-element arrays, e.g., A := (3) gives a static semantic error when A is a one-element array of Integer. Why is this so? Propose a (trivial) syntactic extension to Ada that would remove this irritation.
  33. In a language that supports recursion, there may be multiple activations of a subprogram on the dynamic chain, and hence stack allocations of frames are generally used. However, subprograms that do not themselves make calls need not use stack frames. More generally, any subprogram that can never appear twice on a dynamic chain does not require a stack frame. Describe how to compute the set of all such subprograms at compile time.
  34. What exactly must be the case for a subprogram to not need a static link in its stack frame? Think up as many cases as possible.
  35. Give an abstract syntax tree for the following Java code fragment:
        if (x > 2 || !String.matches(f(x))) {
          write(-3 * q);
        } else if (! here || there) {
          do {
            while (close) tryHarder();
            x = x >>> 3 & 2 * x;
          } while (false);
          q[4].g(6) = person.list[2];
        } else {
          throw up;
        }
    
  36. The reachability problem is to determine for a given instruction, whether or not it might be executed for some run of the program. To optimize a program for space, we need to solve the reachability problem and remove all unreachable instructions. Show that this is impossible by reducing the halting problem to the reachability problem.
  37. In Ada, the declarations
        X: Integer := X + 1;
        Foo: Foo;
        Bar: Real := Bar(Foo);
    

    (where global declarations of X, Foo and Bar are visible) are all illegal, since a declaration of an identifier hides global declarations of the same name immediately at the point it appears in the text, but the identifier may not be used until its declaration is complete. Give an alternate interpretation under which these declarations would be legal and explain the advantages and disadvantages of it from both the programmer's and the compiler writer's perspectives.

  38. In C++ it is not permitted to have two functions that differ only in return type overload each other. In Ada it is allowed. What is the reason for this situation? Even though Ada does allow this flexibility in overloading, the compiler needs some sophistication. What exactly is involved? Be very precise in your explanation and illustrate it with code fragments.
  39. Many programming languages require that in order to have mutually recursive functions, the programmer first define one header (name, return types, parameters and parameter types), then the entire second function, then the entire first function. For example, in C++:
        int f(int x, char y);
        void g(int x) {if (x < 0) f(2, 'c');}
        int f(int x, char y) {g(randomInteger());}
    

    In C++, when f is finally declared, the names of the formal parameters don't have to be repeated exactly as they appeared in the incomplete specification. But in Ada they do. Explain why the Ada rule makes life much easier for the compiler writer.

  40. Here is a cool little functional language:
        PROGRAM →  (DECL ';')* EXPR
        DECL    →  'val' ID '=' EXPR
                | fun ID '(' PARAMS? ')' '=' EXPR
        EXPR    →  NUMLIT | ID | UOP EXPR | EXPR BOP EXPR
                | EXPR '?' EXPR ':' EXPR |  ID '(' ARGS? ')' | '(' EXPR ')'
        PARAMS  →  ID (',' ID)*
        ARGS    →  EXPR (',' EXPR)*
        UOP     →  '-' | 'abs' | 'not'
        BOP     →  '+' | '-' | '*' | '/' | 'mod' | 'and' | 'or' | '==' | '<'
    
    1. Why is this called a functional language?
    2. Is the grammar ambiguous? Why or why not?
    3. Give a semantic object hierarchy for this language.
    4. Write a Greatest Common Denominator function in this language.
    5. Give three examples of syntax errors and three examples of static semantic errors in this language. Make sure to write down all your assumptions; I did not give you any semantics so you will have to make up something reasonable.

  41. Many languages have a syntax rule
        DESIGNATOR  →  DESIGNATOR  "."  ID
    

    for specifying variables made up from a record and a field of the record. But sometimes it can have the additional interpretation that the DESIGNATOR to the left of the dot was the name of a (visible) subprogram and the ID was an object declared immediately inside that subprogram. Show how to rearchitect the semantic object class hierarchy to support this.

  42. The x86 has an enter instruction which automatically makes a display. Research this instruction. Suppose a Carlos program had the following structure (indentation determines nesting):
        function f, parameters: [x,y], locals: [a]
            function g, parameters: [c], locals: [p,q,r,s]
                function h, parameters: [a], locals: []
            function k, parameters: [], locals: [z]
    
    1. Show what the runtime stack looks like from the call sequence f→g→k→g→h→h→f→k
    2. What does the generated assembly language look like when trying to access the value of f.x from h?
    3. Which parts of the Carlos compiler need to be rewritten to use this instruction?
  43. The ENTER instruction is rarely used because it is slow. Show how slow it is by doing the following. Prepare a table with four columns. The left column will be:
        enter n, 0
        enter n, 1
        enter n, 2
        enter n, 3
        ...
    

    and so on. The second column will be the number of clock cycles required on a Pentium for the particular ENTER instruction. The third column will be code equivalent to the ENTER instruction. For example, ENTER n, 1 is equivalent to:

        push ebp
        mov  ebp, esp
        push ebp
        sub  esp, n
    

    The fourth column will be the number of clocks for the code in column 3.

  44. Write JavaCC (TOKEN spec and parsing functions only) for
         A -> Ac | d
         B -> (a|b)*A | bba*c
    
  45. Show x86 code for the expression
        x / y > (3 * x) || z || x < 3
    

    where the "||" operator is short-circuit, and the variables x, y, and z are all integer variables. Put the value of the expression in eax. Write the best possible code you can for the Pentium 4 processor.

  46. In Ada, C, and C++ arrays and records (structs) can be allocated on the stack, not just on the heap. When making assignments of aggregates to variables, compilers usually generate code to deposit the values in temporary storage. Why is this necessary in general? After all, in
        Weekdays := Day_Set(False, True, True, True, True, True, False);
    

    we could construct the aggregate directly in the variable Weekdays. Give an example of an assignment statement that illustrates the necessity of constructing an aggregate in temporary storage (before copying to the target variable).

  47. Write an assembly language function to compute sin(log(x))/(y-7) where x and y are two double (64-bit float) parameters. Use the x86 C calling convention. Also write a C program that calls the function and displays the result.
  48. Write an assembly language function to compute the log base a of b, where x and y are two double (64-bit float) parameters. Use the x86 C calling convention. Write a C program for the unit tester (with at least 10 assert statements).
  49. Describe three techniques that can be used to make a symbol table capable of handling overloading of subprograms.
  50. Give highly optimized x86 code for the following:
        for j := 5 to y do
            y := j * 7 + c;
            printInteger(y - 4);
        end loop;
    

    where y and c are local variables in the current procedure at offsets -12 and +16 respectively. Remember that the range is evaluated only once, the whole loop is skipped on the empty range, etc.). Make sure you respect the overflow semantics! Identify any induction expressions and explain how you optimized them. Compare your hand-written code with that generated by a real compiler.

  51. Suppose we wanted to enhance Hana to support separate compilation; that is, top-level procedures and functions could be defined in separate files, compiled separately and linked together. From a target-code perspective this addition is rather trivial since we have PUBLIC and EXTERN directives in assembly language. The hard work is in enforcing the static semantics of calls across separate files; this requires that subprogram specifications (or "prototypes") be part of the language, and perhaps other devices for accessing types and variables defined in other files as well. Describe an approach to this enhancement, detailing grammar changes, changes to the parser specification, and changes to the entity classes and to the semantic analyzer.
  52. Write an x86 assembly language program that sets every third byte of the three megabyte section of memory starting at address b. Use the MMX registers.
  53. Why do language designers put functions like sqrt into a standard library?
  54. Write an assembly language version of the following, using an LEA instruction for the 3n+1 computation:
        int C(int n) {
            int count = 0;
            while (n != 1) {
                n = (n % 2 == 0) ? n / 2 : 3 * n + 1;
            }
            return count;
        }
    
  55. Generate code for the following basic block, using register tracking:
        y := x * 4 + z;
        z := p * y;
        y := z;
        x := z / y << x;
    
  56. Explain why including the return type of a function in the criteria for distinguishing functions for the purpose of overloading would greatly increase the complexity of a Hana compiler.
  57. Describe, in English, the languages expressed by these regular expressions.
    1. [01]*(10111[01] | 11[01][01][01][01])[01]*
    2. ([bc]*a[bc]*a[bc]*)*
    3. 0*1 | 0*10
    4. c*a[ac]*b[abc]*
  58. Show both naive and optimized intermediate code (entity graph), and both naive and optimized assembly language for:
        if (x % 4096 == 0) {printf("Don't say \66;\6f;\6f;!");}
    

    Hint: you need strength reduction, too.

  59. One kind of strength reduction is replacing division by a power of two with an arithmetic right shift, for example
        sar eax, 10          to divide by 1024
        sar eax, 8           to divide by 256
    

    This optimization is not safe. Explain why. Show how to make it safe, and explain both why your optimization works and why it is safe.

  60. Write a NASM function that takes in four doubles and returns the product of the largest and the smallest argument. Assume the function will be called from a C program built under gcc running on a Pentium II or above. Note that you need to respect the calling convention. Do not use conditional jumps in your code.
  61. Here is a small expression language
        EXP     →  EXP  EXP  OP  | INTLIT
        OP      →  +  |  -  |  *  |  /
    
    1. What language is this?
    2. Is the grammar ambiguous? Why or why not?
    3. Is it LL(k) for any k? If so, for which k? If not, why not?
    4. Give a class hierarchy of entities for this language.
    5. Give an attribute grammar for this language that can be used to evaluate expressions.
  62. Write an assembly language function that returns the dot product of two single-precision floating point arrays using the XMM registers. Implement a unit tester in C.
  63. Which if the following expressions are legal in Java (assuming x and y are integer variables)? State why they are legal or why they are not.
    1. x---y
    2. x-----y
  64. Draw the abstract syntax tree for the following Java compilation unit. (Make sure it is fairly abstract):
    package p;
    class C implements A {
        public static A x = new   t[3];
        Socket s () {
            while (x - 6>p |    e || q +- p) {
                this.x[3] = !v+++t;
            }
        }
        {System.out.println("ooh");}
    }
    
  65. Classify the following as a syntax error, semantic error, or "not a compile time error at all". In the case where code is given, assume all identifiers are properly declared and in scope. All items refer to the Java language.

    1. x+++-y
    2. x---+y
    3. incrementing a read-only variable
    4. accessing a private field in another class
    5. Using an uninitialized variable
    6. Dereferencing a null reference
    7. null instanceof C
    8. !!x
  66. EBNF generally uses

    Suppose I wanted to add a new one:

    Show how to write A # B # C using only the conventional EBNF markup.

  67. Write the following in assembly language (use the C calling convention). It is supposed to compute a*log10(b). Use the fyl2x and fldl2t instructions.
        double f(double a, double b);
    
  68. Write a small C++ function or Java method that sums up an array of ints and throws an exception if the sum is odd. If the sum is even return true. In the catch clause for the exception, return false. Run timing studies to show the real cost of the exception. For example, call the method a million times for each possible return value, and report the aggregate CPU time for all the runs returning false and for those returning true.
  69. Write regular expressions for
    1. Octal constants in C
    2. Unsigned binary numbers divisible by 8
    3. Hexadecimal numerals divisible by 8 (signed or unsigned!)
    4. Floating point constants that are not allowed to have an empty fractional part and can have no more than three digits in the exponent part
    5. the set of all character strings that contains neither the substring "exit" nor "exec"
  70. Optimize the following. "Show your work" (that is, show a few intermediate steps toward your final solution, recording the optimizations you performed. You can abbreviate CP=copy propagation, CF=constant folding, DCE=dead code elimination. You'll want to use more than just these three techniques.
        L1:
            r0 := x
            z := 6
            r1 := 4 - r0
            r2 := 3 >= r1
            if r2 == 0 goto L2
            r3 := y + 4
            r4 := *r3
            z := r4
        L2:
    
  71. Write a NASM assembly language function that returns the sum of the reciprocals of all the elements in an array of doubles. Use the C calling convention -- so the function accepts the array and a length. You should have a copy of my function for summing up an array, so this should be pretty easy:
  72. Classify the following as (a) lexical error, (b) syntax error, (c) static semantic error, (d) dynamic semantic error, or (e) no error.

    1. A function call with no matching signature in Hana.
    2. A function call with no matching signature in C.
    3. x < y < z in Hana, where x and y are ints and z is a boolean.
    4. x < y < z in C, where x and y and z are all ints.
    5. 3[a] in Hana, where a is an array variable.
    6. 3[a] in C, where a is an array variable.
    7. char x = '\a'; in Hana.
    8. char x = '\a'; in C.
    9. Value returning function without a return statement, in Hana.
    10. Value returning function without a return statement, in C.
    11. Semicolon after a block, in Hana.
    12. Semicolon after a block, in C.
  73. Show a Hana abstract syntax tree for:
    struct s {boolean x; s y;}
    s a(int p, ...) {
        s[] x = new s[]{null, null};
        print($s.p[a] + substring("ff", #s << p|-2));
    }
    
  74. Suppose we added to Hana a simple exception facility, like that of C++, in which anything can be thrown or caught. This facility doesn't have a finally clause, like Java does, but it does make use of the "..." symbol for a catch-all clause which must be the last catch clause, if it exists. Show changes to the Hana lexical specification, grammar, and semantic rules to support this facility.
  75. Write JavaCC specs for a parser that recognizes the language of this grammar:
        G -> (S s G)?
        S -> V q e | i f E g | V x
        V -> i | V d i | V a E a
        E -> n | V
    
  76. Suppose we changed the definition of Hana so that no identifier could be three characters long and end with "oo" (or "oO" or "Oo" or "OO").
    1. Write a regex for alphanumeric strings beginning with a letter that are not three characters long ending case-insensitively with "oo".
    2. Show how to modify the Hana JavaCC specification to disallow these offensive strings as identifiers, without changing the token specification for ID.
    3. Show how to modify the Hana JavaCC specification to disallow these offensive strings as identifiers, by only changing the token specification for ID.
  77. Write a Java method or Perl subroutine that returns whether its input string is a three character alphanumeric string ending, case insensitively, in "oo". Do this by matching against a regular expression.
  78. Write, by hand, a super-efficient Squid fragment for the following Hana fragment:
        struct s {int x; int y; string s;}
        s a = new s {
            codepoint(getChar()), codepoint(getChar()), getString()};
        while (a.x++ < a.y) {print($a.s[1]);}
    
  79. What does this code do? For what ranges of n does it make sense?

        mov eax, n
        shl eax, 23
        add eax, 3f800000h
        mov [esp-4], eax
        fld dword [esp-4]
    
  80. Is this grammar an LL grammar?
        A -> B C
        B -> a | b?c?
        C -> c | BA
    

    (If you are having trouble with this, and you have time, you can write a JavaCC specification and "try it out".) If you find that this grammar is not LL, make one that is (that defines the same language of course).

  81. We've seen that one way to deal with ugly code in curly brace languages is to require blocks in compound statements; for example:
        IFSTMT -> 'if' '(' EXP ')' BLOCK
                  ('else' 'if' '(' EXP ')' BLOCK)*
                  ('else' BLOCK)?
        BLOCK -> '{' STMT* '}'
    
    What if we tried the same approach in a language with a syntax like Ruby (or Fortran or Modula — languages using a terminating end)? We might get a grammar like this:
        IFSTMT -> 'if' EXP 'then' STMT+
                  ('else' 'if'  EXP 'then' STMT+)*
                  ('else' STMT+)?
                  'end'
    
    Is this grammar left recursive? Is it LL(k)? Why or why not?