- One of the freshmen tried, and failed,
to write a grammar for the language
L = {w ∈ {a,b}* | w has exactly twice as many a's as b's} and
came up with:
S → aab | aba | baa | aaSb | abSa | baSa | aSab | aSba | bSaa | SS
- Prove that aaabbbbaaaaa is not in the language generated
by this grammar.
- Give a correct context free grammar for L (and don't forget, like
the freshman did, that the empty string belongs, too).
- We've seen the EBNF form A ^ B which
denotes A | ABA | ABABA | ....
Such a form makes it convenient to write rules involving
separators, such as
IDLIST → ID ^ ","
This form can also be used to model a construct representing
"one or more" A's, rather than using AA* or
A*A. Show how to do this.
- Consider an extension of Iki
that features if and while statements,
variable declarations, and a large set of arithmetic, relational
and logical operators. Variables in this language may only have
type integer, but expressions in
if and while statements
are to be Boolean-valued. Give a syntax for this
language that enforces typing. (In other words, type errors are to
be syntactic, rather than static semantic, errors.)
- The Ada language has slightly unusual precedence rules;
unary negation is placed at the level of the additive operators,
a couple levels lower than other unary operators. The
grammar fragment is:
EXP → EXP1 ('and' EXP1)* | EXP1 ('or' EXP1)*
EXP1 → EXP2 (RELOP EXP2)?
EXP2 → '-'? EXP3 (ADDOP EXP3)*
EXP3 → EXP4 (MULOP EXP4)*
EXP4 → EXP5 ('**' EXP5)? | 'not' EXP5 | 'abs' EXP5
- Show the abstract syntax tree for the expression -8 * 5
- Suppose the designers of Ada decided to remove the optional
unary negation operator from EXP2 and instead add a clause to
the EXP4 rule that said
'-' EXP5
Draw the abstract syntax tree for -8 * 5 under this
supposition.
- Do you think one reason that the Ada designers crafted
their unusual syntax had anything to do with the fact that
Ada comments start with "--" and go to the end of the
line? In other words, would the rule EXP4 → '-' EXP5
possibly confuse programmers attempting to write double negatives?
- Rewrite the syntactic clauses EXP, EXP1, EXP2, EXP3 and so on
through EXP10 of Hana in the classic LR grammar style.
- Suppose a new computer called the X1234 has just come out and it
doesn't have an Ada compiler. But you want to make a resident Ada
compiler on that machine. Fortunately you have a resident Ada compiler
that runs on a MIPS machine. Describe exactly how you can construct
the desired resident Ada compiler for the X1234 using the one for
the MIPS.
- C does not allow structures (i.e., non-atomic objects)
to be tested for equality. Ada does. Maybe the designers of
C wanted to keep things simple. How exactly would
equality operations for structures complicate a C
compiler or the runtime system?
- If possible, write a program in Modula 3 that makes a variable point
to itself. That is, for some designator X, make it so that
X^ = X. If this is not possible, state why it is not possible.
- If possible, write a program in Ada that makes a variable point
to itself. That is, for some designator X, make it so
that X.all = X. If this is not possible, state why it is not
possible.
- If possible, show how to make a ML variable x of
type x such that x.x = x, or state why this is
impossible.
- If possible, show how to make a Hana variable x of
type x such that x.x == x, or state why this is
impossible.
- In C++ you can say (x += 7) *= z but you can't say this
in C. Explain the reason why, using precise, technical terminology.
See if this same phenomenon holds for conditional expressions, too.
What other languages behave like C++ in this respect?
- Identify the following errors as syntactic, static semantic, or
dynamic semantic (runtime): If no language is mentioned for a
particular case, it probably does not matter. Assume either
C or Ada and write your assumption.
- Redeclaration of an identifier.
- Unbalanced parentheses.
- Applying an operator to an element of the wrong type.
- Array index out of bounds (in C, in Ada, ...).
- Division by zero.
- Semicolon after a block in C.
- Wrong number of arguments supplied to a call.
- Assignment of a variable of type T to a variable of type subtype of
T where the first variable is out of the range of the second in Ada.
- An unwanted infinite loop.
- Dereference of a null pointer.
- Application of the "." to an identifier which is not a field of the
record.
- Use of an uninitialized variable.
- for x (a) {printf("*"); x = x++;} in Hana.
- Draw the abstract syntax tree for the following C fragment:
for (int i = x-3; q<=4&m.z[r |- 4]&2-8*r>- 5/~x;) {
while (a) {
y;
2,y;
}
}
- Write an assembly language program that displays a multiplication
table of size 12 × 12.
- Consider the continue statement of C.
- What kind of static semantic checks are required for this
statement?
- Give an example piece C code that has a continue statement in it,
and show the intermediate and target code for it.
- Under what circumstances can you safely replace the x86 code fragment
je L6
jmp L4
L6:
with the single instruction jne L4?
- Show that the addressing modes immediate, absolute memory, and
register indirect can be simulated by register and register-offset
alone.
- Give grammars for the languages
- {w in {a,b,c}* | w has at most one occurrence of any symbol}
- {ambncm+n | m,n >= 1 }
- Palindromes over {a, b}
- {ambn | m >= n }
- Strings of parentheses, brackets and braces, all properly balanced and nested.
- Semicolon terminated statements
- Comma separated expressions
- Strings over {a, b, c, d, e} containing at most one
occurrence of any symbol.
- Give grammars for the languages
- {anbncn | n >= 0 }
- {aibjck | i = j or j = k }
- {ww | w in {a,b}* }
- Give two reasons why compilers are often written in the language
they compile.
- One way to implement a runtime system for a language with exceptions
is to place two return addresses in an activation record. Sketch
a small Ada or C++ function that can throw (a possibly user-defined)
exception, and a code fragment that calls the function. Give a
stack frame layout with two return addresses, one is the normal
return address and the other is the address of the handler in the
caller. Show the assembly language for the caller and the function
itself.
- Show the target code that is generated for the source statement
X := Y; where X and Y are both one step
down the static chain from the current subprogram, by a code
generator which emits access code for the two values
independently. Assume X is at offset –8 and Y is at offset –12.
How many registers are used? Then generate code for this statement
by hand, intelligently.
- Discuss advantages and disadvantages of a subprogram
call implementation in which (a) the calling subprogram saves
registers and (b) the called subprogram saves registers.
Explain why the x86's C calling convention is a nice
compromise.
- Some languages not require the parameters to a
subprogram call to be evaluated in any particular order. Is it
possible that different evaluation orders can lead to different
arguments being passed? If so, give an example to illustrate
this point, and if not, prove that no such event could occur.
- Give three examples of how aliasing can occur (you
can use examples from several different languages). How does
aliasing make copy propagation difficult?
When, if ever, can an algorithm determine that a semantic
object cannot possibly be aliased?
- Ada allows subprograms to be objects, as in the following
code fragment:
type Real_To_Real is access function (Real) return Real;
type Foo is access procedure (Integer; in out Boolean);
Sine, Cosine: Real_To_Real;
P: Foo;
Q: Real_To_Real;
function Integrate (F: Real_To_Real; A, B: Real);
...
function Square (X: Real) return Real is
begin
return X * X;
end;
...
Put (Integrate(Square'Access, 3, 10));
Q := Cosine;
if Q(Pi)> X then ...
Describe the semantic rules relating to this facility in
Ada, and how you would enforce them in a compiler.
- Suppose the variable A was declared in an Ada program with
type array (21..38) of String(1..10), and happened to have
offset –42 in the frame of the subprogram in which it was declared.
Suppose further that the variable J was declared in the same subprogram
and had offset –26.
- Show the target code that loads the value of A(J-1) into
register eax that would be generated naïvely. Do not
forget to show the bounds checking!
- Show target code to load the value of A(J-1) into
register eax in which the "-1" computation is "folded in"
to the computation of the base address of A. Note that
the bounds checking code will look a little different than in
part (a).
- Can loop unrolling ever be unsafe? Why or why not?
- Write an assembly language program that takes zero or more command
line arguments, which should all be integers, and displays the average
of the parameters to standard output.
- Occasionally a compiler may output a sequence such as
mov [ebp-8], eax
mov eax, [ebp-8]
The second instruction might be able to be removed. But it is
easy to see that whether we are able to remove this instruction
is undecidable. Why, exactly?
- It is a well-known irritation that Ada does not allow you to write
array aggregates for zero- or one-element arrays, e.g., A := (3)
gives a static semantic error when A is a one-element array of
Integer. Why is this so? Propose a (trivial) syntactic extension to
Ada that would remove this irritation.
- In a language that supports recursion, there may be multiple
activations of a subprogram on the dynamic chain, and hence
stack allocations of frames are generally used. However,
subprograms that do not themselves make calls need not use stack
frames. More generally, any subprogram that can never appear twice
on a dynamic chain does not require a stack frame. Describe
how to compute the set of all such subprograms at compile time.
- What exactly must be the case for a subprogram to not need a
static link in its stack frame? Think up as many cases as possible.
- Give an abstract syntax tree for the following Java code fragment:
if (x > 2 || !String.matches(f(x))) {
write(-3 * q);
} else if (! here || there) {
do {
while (close) tryHarder();
x = x >>> 3 & 2 * x;
} while (false);
q[4].g(6) = person.list[2];
} else {
throw up;
}
- The reachability problem is to determine for a given instruction,
whether or not it might be executed for some run of the program.
To optimize a program for space, we need to solve the reachability
problem and remove all unreachable instructions. Show that this
is impossible by reducing the halting problem to the
reachability problem.
- In Ada, the declarations
X: Integer := X + 1;
Foo: Foo;
Bar: Real := Bar(Foo);
(where global declarations of X, Foo and Bar are visible) are
all illegal, since a declaration of an identifier hides global
declarations of the same name immediately at the point it
appears in the text, but the identifier may not be used until its
declaration is complete. Give an alternate interpretation
under which these declarations would be legal and explain the
advantages and disadvantages of it from both the programmer's
and the compiler writer's perspectives.
- In C++ it is not permitted to have two functions that differ only in
return type overload each other. In Ada it is allowed. What is the
reason for this situation? Even though Ada does allow this flexibility
in overloading, the compiler needs some sophistication. What exactly
is involved? Be very precise in your explanation and illustrate it with code
fragments.
- Many programming languages require that in order to have
mutually recursive functions, the programmer first define one
header (name, return types, parameters and parameter types), then
the entire second function, then the entire first function. For
example, in C++:
int f(int x, char y);
void g(int x) {if (x < 0) f(2, 'c');}
int f(int x, char y) {g(randomInteger());}
In C++, when f is finally declared, the names
of the formal parameters don't have to be repeated exactly as they
appeared in the incomplete specification. But in Ada they do.
Explain why the Ada rule makes life much easier for the compiler
writer.
- Here is a cool little functional language:
PROGRAM → (DECL ';')* EXPR
DECL → 'val' ID '=' EXPR
| fun ID '(' PARAMS? ')' '=' EXPR
EXPR → NUMLIT | ID | UOP EXPR | EXPR BOP EXPR
| EXPR '?' EXPR ':' EXPR | ID '(' ARGS? ')' | '(' EXPR ')'
PARAMS → ID (',' ID)*
ARGS → EXPR (',' EXPR)*
UOP → '-' | 'abs' | 'not'
BOP → '+' | '-' | '*' | '/' | 'mod' | 'and' | 'or' | '==' | '<'
- Why is this called a functional language?
- Is the grammar ambiguous? Why or why not?
- Give a semantic object hierarchy for this language.
- Write a Greatest Common Denominator function in this language.
- Give three examples of syntax errors and three examples of static semantic errors in this language. Make sure to write down all your assumptions; I did not give you any semantics so you will have to make up something reasonable.
- Many languages have a syntax rule
DESIGNATOR → DESIGNATOR "." ID
for specifying variables made up from a record and a field of
the record. But sometimes it can have the additional interpretation
that the DESIGNATOR
to the left of the dot was the name of a (visible) subprogram and
the ID was an object declared immediately inside that
subprogram. Show how to rearchitect the semantic object
class hierarchy to support this.
- The x86 has an enter instruction which automatically makes a
display. Research this instruction. Suppose a Carlos program
had the following structure (indentation determines nesting):
function f, parameters: [x,y], locals: [a]
function g, parameters: [c], locals: [p,q,r,s]
function h, parameters: [a], locals: []
function k, parameters: [], locals: [z]
- Show what the runtime stack looks like from the call sequence
f→g→k→g→h→h→f→k
- What does the generated assembly language look like when
trying to access the value of f.x from h?
- Which parts of the Carlos compiler need to be rewritten
to use this instruction?
- The ENTER instruction is rarely used because it is slow. Show how
slow it is by doing the following. Prepare a table with four
columns. The left column will be:
enter n, 0
enter n, 1
enter n, 2
enter n, 3
...
and so on. The second column will be the number of clock cycles required
on a Pentium for the particular ENTER instruction. The third column will
be code equivalent to the ENTER instruction. For example, ENTER n, 1 is
equivalent to:
push ebp
mov ebp, esp
push ebp
sub esp, n
The fourth column will be the number of clocks for the code
in column 3.
- Write JavaCC (TOKEN spec and parsing functions only) for
A -> Ac | d
B -> (a|b)*A | bba*c
- Show x86 code for the expression
x / y > (3 * x) || z || x < 3
where the "||" operator is short-circuit, and the variables x, y,
and z are all integer variables. Put the value of
the expression in eax. Write the best possible code you can for
the Pentium 4 processor.
- In Ada, C, and C++ arrays and records (structs) can be allocated
on the stack, not just on the heap. When making assignments of aggregates
to variables, compilers usually generate code to deposit the values
in temporary storage. Why is this necessary in general? After all, in
Weekdays := Day_Set(False, True, True, True, True, True, False);
we could construct the aggregate directly in the variable
Weekdays. Give an example of an assignment statement that
illustrates the necessity of constructing an aggregate in temporary
storage (before copying to the target variable).
- Write an assembly language function to compute sin(log(x))/(y-7)
where x and y are two double (64-bit float) parameters. Use the
x86 C calling convention. Also write a C program that
calls the function and displays the result.
- Write an assembly language function to compute the log base a of b,
where x and y are two double (64-bit float) parameters. Use the
x86 C calling convention. Write a C program for the unit tester
(with at least 10 assert statements).
- Describe three techniques that can be used to make a symbol table
capable of handling overloading of subprograms.
- Give highly optimized x86 code for the following:
for j := 5 to y do
y := j * 7 + c;
printInteger(y - 4);
end loop;
where y and c are local variables in the current
procedure at offsets
-12 and +16 respectively. Remember that the range is evaluated only
once, the whole loop is skipped on the empty range, etc.).
Make sure you respect the overflow
semantics! Identify any induction expressions and explain how
you optimized them. Compare your hand-written code with
that generated by a real compiler.
- Suppose we wanted to enhance Hana to support separate
compilation; that is, top-level procedures and functions could be
defined in separate files, compiled separately and linked
together. From a target-code perspective this addition is
rather trivial since we have PUBLIC and EXTERN directives in
assembly language. The hard work is in enforcing the static semantics
of calls across separate files; this requires that subprogram
specifications (or "prototypes") be part of the language, and
perhaps other devices for accessing types and variables defined
in other files as well. Describe an approach to this enhancement,
detailing grammar changes, changes to the parser specification, and
changes to the entity classes and to the semantic analyzer.
- Write an x86 assembly language program that sets every third
byte of the three megabyte section of memory starting at
address b. Use the MMX registers.
- Why do language designers put functions like sqrt into a
standard library?
- Write an assembly language version of the following, using
an LEA instruction for the 3n+1 computation:
int C(int n) {
int count = 0;
while (n != 1) {
n = (n % 2 == 0) ? n / 2 : 3 * n + 1;
}
return count;
}
- Generate code for the following basic block, using register tracking:
y := x * 4 + z;
z := p * y;
y := z;
x := z / y << x;
- Explain why including the return type of a function in
the criteria for distinguishing functions for the purpose of
overloading would greatly increase the complexity of a Hana compiler.
- Describe, in English, the languages expressed by these
regular expressions.
- [01]*(10111[01] | 11[01][01][01][01])[01]*
- ([bc]*a[bc]*a[bc]*)*
- 0*1 | 0*10
- c*a[ac]*b[abc]*
- Show both naive and optimized intermediate code (entity graph),
and both naive and optimized assembly language for:
if (x % 4096 == 0) {printf("Don't say \66;\6f;\6f;!");}
Hint: you need strength reduction, too.
- One kind of strength reduction is replacing division
by a power of two with an arithmetic right shift, for example
sar eax, 10 to divide by 1024
sar eax, 8 to divide by 256
This optimization is not safe. Explain why. Show how to
make it safe, and explain both why your optimization works
and why it is safe.
- Write a NASM function that takes in four doubles
and returns the product of the largest and the smallest
argument. Assume the function will be called from
a C program built under gcc running on a Pentium II or above.
Note that you need to respect the calling convention.
Do not use conditional jumps in your code.
- Here is a small expression language
EXP → EXP EXP OP | INTLIT
OP → + | - | * | /
- What language is this?
- Is the grammar ambiguous? Why or why not?
- Is it LL(k) for any k? If so, for which k? If not, why not?
- Give a class hierarchy of entities for this language.
- Give an attribute grammar for this language that can be used to evaluate
expressions.
- Write an assembly language function that returns the dot product
of two single-precision floating point arrays using the XMM registers.
Implement a unit tester in C.
- Which if the following expressions are legal in Java
(assuming x and y are integer variables)? State why they
are legal or why they are not.
- x---y
- x-----y
- Draw the abstract syntax tree for the following Java compilation unit.
(Make sure it is fairly abstract):
package p;
class C implements A {
public static A x = new t[3];
Socket s () {
while (x - 6>p | e || q +- p) {
this.x[3] = !v+++t;
}
}
{System.out.println("ooh");}
}
Classify the following as a syntax error, semantic error,
or "not a compile time error at all". In the case where
code is given, assume all identifiers are properly declared
and in scope. All items refer to the Java language.
- x+++-y
- x---+y
- incrementing a read-only variable
- accessing a private field in another class
- Using an uninitialized variable
- Dereferencing a null reference
- null instanceof C
- !!x
EBNF generally uses
- A B to mean exactly one A followed by exactly one B
- A? to mean zero or one A
- A* to mean zero or more As
- A | B to mean either exactly one A OR exactly one B
Suppose I wanted to add a new one:
- A1 # A2 # ... # An to
mean "a non-empty string in which each of the Ais appears
zero or one times, but in any order."
Show how to write A # B # C using only the conventional EBNF
markup.
- Write the following in assembly language (use the C calling
convention). It is supposed to compute a*log10(b).
Use the
fyl2x and fldl2t instructions.
double f(double a, double b);
- Write a small C++ function or Java method that
sums up an array of ints and throws an exception if
the sum is odd. If the sum is even return true.
In the catch clause for the exception, return false.
Run timing studies to show the real cost of the exception.
For example, call the method a million times for
each possible return value, and report the aggregate CPU time
for all the runs returning false and for those returning
true.
- Write regular expressions for
- Octal constants in C
- Unsigned binary numbers divisible by 8
- Hexadecimal numerals divisible by 8 (signed or unsigned!)
- Floating point constants that are not allowed to have an empty
fractional part and can have no more than three digits in the exponent
part
- the set of all character strings that contains
neither the substring "exit" nor "exec"
- Optimize the following. "Show your work" (that is, show a few
intermediate steps toward your final solution, recording the optimizations
you performed. You can abbreviate CP=copy propagation, CF=constant folding,
DCE=dead code elimination. You'll want to use more than just these three
techniques.
L1:
r0 := x
z := 6
r1 := 4 - r0
r2 := 3 >= r1
if r2 == 0 goto L2
r3 := y + 4
r4 := *r3
z := r4
L2:
- Write a NASM assembly language function
that returns the sum of the reciprocals of all
the elements in an array of doubles. Use the C calling
convention -- so the function accepts the array and a length.
You should have a copy of my function for summing up an
array, so this should be pretty easy:
Classify the following as (a) lexical error, (b) syntax error,
(c) static semantic error, (d) dynamic semantic error, or (e)
no error.
- A function call with no matching signature in Hana.
- A function call with no matching signature in C.
- x < y < z in Hana, where x and y are ints and z is a boolean.
- x < y < z in C, where x and y and z are all ints.
- 3[a] in Hana, where a is an array variable.
- 3[a] in C, where a is an array variable.
- char x = '\a'; in Hana.
- char x = '\a'; in C.
- Value returning function without a return statement, in Hana.
- Value returning function without a return statement, in C.
- Semicolon after a block, in Hana.
- Semicolon after a block, in C.
- Show a Hana abstract syntax tree for:
struct s {boolean x; s y;}
s a(int p, ...) {
s[] x = new s[]{null, null};
print($s.p[a] + substring("ff", #s << p|-2));
}
- Suppose we added to Hana a simple exception facility, like that
of C++, in which anything can be thrown or caught. This facility
doesn't have a finally clause, like Java does, but it does make
use of the "..." symbol for a catch-all clause which must be
the last catch clause, if it exists. Show changes to the Hana lexical
specification, grammar, and semantic rules to support this facility.
- Write JavaCC specs for a parser that recognizes the language
of this grammar:
G -> (S s G)?
S -> V q e | i f E g | V x
V -> i | V d i | V a E a
E -> n | V
- Suppose we changed the definition
of Hana so that no identifier could be three characters long and
end with "oo" (or "oO" or "Oo" or "OO").
- Write a regex for alphanumeric strings beginning with a letter
that are not three characters long ending case-insensitively
with "oo".
- Show how to modify the Hana JavaCC specification to disallow
these offensive strings as identifiers, without changing
the token specification for ID.
- Show how to modify the Hana JavaCC specification to disallow
these offensive strings as identifiers, by only changing
the token specification for ID.
- Write a Java method or Perl subroutine that returns whether its
input string is a three character alphanumeric string
ending, case insensitively, in "oo". Do this by matching against
a regular expression.
- Write, by hand, a super-efficient Squid fragment for the following
Hana fragment:
struct s {int x; int y; string s;}
s a = new s {
codepoint(getChar()), codepoint(getChar()), getString()};
while (a.x++ < a.y) {print($a.s[1]);}
- What does this code do? For what ranges of n does it
make sense?
mov eax, n
shl eax, 23
add eax, 3f800000h
mov [esp-4], eax
fld dword [esp-4]
- Is this grammar an LL grammar?
A -> B C
B -> a | b?c?
C -> c | BA
(If you are having trouble with this, and you have time,
you can write a JavaCC specification and "try it out".)
If you find that this grammar is not LL, make one that is (that defines
the same language of course).
- We've seen that one way to deal with ugly code in curly
brace languages is to require blocks in compound statements;
for example:
IFSTMT -> 'if' '(' EXP ')' BLOCK
('else' 'if' '(' EXP ')' BLOCK)*
('else' BLOCK)?
BLOCK -> '{' STMT* '}'
What if we tried the same approach in a language with
a syntax like Ruby (or Fortran or Modula — languages
using a terminating end)? We might get a grammar like
this:
IFSTMT -> 'if' EXP 'then' STMT+
('else' 'if' EXP 'then' STMT+)*
('else' STMT+)?
'end'
Is this grammar left recursive? Is it LL(k)? Why or why
not?