The syntax of a programming language is the set of rules that define which arrangements of symbols comprise structurally legal programs.
In the case of traditional textual languages, "arrangements of symbols" can be read as "strings of symbols."
We say structurally legal because we want to distinguish between (1) errors of well-formedness and (2) errors of meaning. For example, this Java program
class A {2 / == {{;
is not well-formed; it does not have the structure of a Java program. A compiler would report a syntax error. But this program
class A {int x = y;}
is structurally well formed. According to the official syntax of the Java language, this program is fine. The compilation unit consists of a type declaration which is a class declaration whose body consists of a field declaration with a type, a name, and an initializing expression which is an identifier. A compiler will not detect a syntax error; it will, however, detect a semantic error since the identifier y has not been declared.
The same goes for this C program:
/*
* This C program has no syntax errors, but it does have
* static semantic errors.
*/
int main() {
f(x)->p = sqrt(3.3);
return "dog";
}
Non-structural errors detectable by a compiler are called static semantic errors. They are almost always related to problems of context: identifiers used but not declared, identifiers redeclared, the wrong number of arguments passed to a subroutine, assignment to a read-only variable, etc. In practice, structure, or syntax, is context-free.
It is best to start by writing down examples of both legal programs (strings) and illegal ones, then try to figure out the general rules from these examples.
Example: Consider the language of integer arithmetic expressions over the operators {+, -, *, /}. The expressions may contain parentheses.
| Examples of strings in this language |
Non-Examples |
|---|---|
432 24*(31/899+3-0)/(54/2+4+2*3) (2) 8*(((3-6))) |
432) 24*(31////)/(5+---+)) [fwe]23re31124efr$#%^@ --2-- |
You can then start identifying the set of legal symbols in the language, the different kinds of structural components (digits, numbers, expressions), and the rules such as "one way to make a expression is to put a "*" between two other expressions. To write out the actual syntax, you can use a CFG, BNF, EBNF or Syntax Diagrams
Formally, a context free grammar, or CFG, is a quadruple (T, C, P, S), where
The language defined by a grammar (T, C, P, S) is {s ∈ T* | S ⇒P* s}.
A CFG for the example language is:
(
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, /, (, )},
{E, N, D},
{
(D, 0), (D, 1), (D, 2), (D, 3), (D, 4), (D, 5), (D, 6),
(D, 7), (D, 8), (D, 9), (N, D), (N, ND), (E, N), (E, E+E),
(E, E-E), (E, E*E), (E, E/E), (E, (E))
},
E
)
Here E stands for "expression", N for "number", and D for "digit". Also, notice that parentheses appear in both the little language we are describing (the object language) and the mathematical notation for the grammar (the meta language). These have to be kept distinct (using colors, fonts, or plain common sense).
We usually take C, T and S from context and write the production rules (P) in a more readable form:
E → N E → E + E E → E - E E → E * E E → E / E E → ( E ) N → D N → N D D → 0 D → 1 D → 2 D → 3 D → 4 D → 5 D → 6 D → 7 D → 8 D → 9
A little allowable notational shorthand allows us to combine rules with the same left hand sides (again, be careful if your object language has a pipe in it!):
E → N | E + E | E - E | E * E | E / E | ( E ) N → D | N D D → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
This isn't the only possible grammar. Another uses additional categories T (for term) and F (for factor) to make a non-ambiguous grammar whose parse trees actually suggest an operator precedence:
E → T | E + T | E - T T → F | T * F | T / F F → N | ( E ) N → D | N D D → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
BNF (Backus-Naur Form) is a style of writing grammars to make them look a little prettier. The original BNF looked something like this:
<expression> ::= <number> | <expression> + <expression>
| <expression> - <expression> | <expression> * <expression>
| <expression> / <expression> | ( <expression> )
<number> ::= <digit> | <number> <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
but eventually the concern about having to deal with the possibility of languages that actually contain characters like "<", ">", "|" and so on made people realize one should quote the tokens, not the categories, so modern BNF looks like:
expression ::= number | expression '+' expression
| expression '-' expression | expression '*' expression
| expression '/' expression | '(' expression ')'
number ::= digit | number digit
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
EBNF (Extended BNF) adds even more syntactic sugar. There are lots and lots of variants of EBNF but generally the idea is that EBNF quotes tokens rather than categories and uses fancy markup on the right hand sides, such as:
The non-ambiguous example grammar from above looks like this in EBNF:
EXP → TERM (('+' | '-') TERM)*
TERM → FACTOR (('*' | '/') FACTOR)*
FACTOR → NUMBER | '(' EXP ')'
NUMBER → ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9')+
You might even see
Sometimes you'll see even see:
Our example unambiguous grammar is now a bit simpler:
EXP → TERM ^ [+-]
TERM → FACTOR ^ [*/]
FACTOR → NUMBER | '(' EXP ')'
NUMBER → [0-9]+
and the ambiguous one is really simple:
EXP → [0-9]+ | EXP [*/+-] EXP | '(' EXP ')'
There is an International Standard for EBNF if you are interested.
A graphical notation; easy to generate from the EBNF. Example:

Real languages have whitespace, comments, escape characters in string and character literals, multi-word identifiers and keywords. Specifying the set of legal strings with a single set of rules is rather difficult and error-prone. Here is an attempt to define the syntax of a C-like mini-language:
PROGRAM → S* BLOCK S*
BLOCK → (DEC S* ';' S*)* (STMT S* ';' S*)+
DEC → 'var' S+ ID
| 'fun' S+ ID S* '(' IDLIST? ')' S* '=' S* EXP
STMT → ID S* '=' S* EXP
| 'read' S+ IDLIST
| 'write' S+ EXPLIST
| 'while' S+ EXP S+ 'loop' S+ BLOCK 'end'
EXP → TERM (S* ADDOP S* TERM)*
TERM → FACTOR (S* MULOP S* FACTOR)*
FACTOR → INTLIT | ID | CALL | '(' S* EXP S* ')'
CALL → ID S* '(' S* EXPLIST? S* ')'
IDLIST → ID (S* ',' S* ID)*
EXPLIST → EXP (S* ',' S* EXP)*
ADDOP → '+' | '-'
MULOP → '*' | '/'
INTLIT → DIGIT+
DIGIT → [\p{Nd}]
LETTER → [\p{L}]
ID → LETTER (LETTER | DIGIT | '_')* - KEYWORD
KEYWORD → 'var' | 'fun' | 'read' | 'write' | 'while' | 'loop' | 'end'
S → [\x20\x09\x0A\x0D] | COMMENT
COMMENT → '--' [^\x0A\x0D]* [\x0A\x0D]
How did we do?
In other words, instead of worrying about the character string
-- This doesn't write hello world
var x ; var y;-- My first program
-- uggh: lousy indentation and SPACing
while y - 5 loop
var y;
read x ,y;
x = 2 * (3+y);
end; -- ends a loop, I think
write 5;
we should care about the token string
var ID(x) ; var ID(y) ; while ID(y) - INTLIT(5) loop var ID(y) ; read ID(x) , ID(y) ; ID(x) = INTLIT(2) * ( INTLIT(3) + ID(y) ) ; end ; write INTLIT(5) ;
Notice that some of the tokens (ID and INTLIT to be specific) contain attributes. In a real compiler we could maintain other attributes, like line and column number, say.
We generally define syntax in two parts
Thus we would define the syntax above by saying:
A string is a syntactically valid program if it greedily matches S*T0S*T1S*T2...S*TnS* where
S → [\x20\x09\x0A\x0D] | '--' [^\x0A\x0D]* [\x0A\x0D]
T → INTLIT | ID | KEYWORD | SYMBOL
SYMBOL → [+\-*/=,;()]
ID → LETTER (LETTER | DIGIT | '_')* - KEYWORD
KEYWORD → 'var' | 'fun' | 'read' | 'write' | 'while' | 'loop' | 'end'
INTLIT → DIGIT+
DIGIT → [\p{Nd}]
LETTER → [\p{L}]
and T0T1...Tn is derivable from
PROGRAM → BLOCK
BLOCK → (DEC ';')* (STMT ';')+
DEC → 'var' ID
| 'fun' ID '(' IDLIST? ')' '=' EXP
STMT → ID '=' EXP
| 'read' IDLIST
| 'write' EXPLIST
| 'while' EXP 'loop' BLOCK 'end'
IDLIST → ID (',' ID)*
EXPLIST → EXP (',' EXP)*
EXP → TERM ([+-] TERM)*
TERM → FACTOR ([*/] FACTOR)*
FACTOR → INTLIT | ID | CALL | '(' EXP ')'
CALL → ID '(' EXPLIST? ')'
Pulling out the skips from the macrosyntax makes the grammar much clearer. Derivations and parse trees need only contain tokens and categories, not characters, for example:

(Again, note how attributes are carried along for some of the tokens.)
You can go further and argue that certain punctuation like commas, semicolons, parentheses, are only artificial devices to make what should be a hierarchical program be a flat one. The direct specification of the higher level structure is called abstract syntax. Here is the abstract syntax tree corresponding to the parse tree we saw earlier:

What disappears from syntax when we move to abstract syntax?
Abstract syntax is normally expressed with a tree grammar (recall that a tree grammar produces trees, whereas the Chomsky grammars we saw above produce strings). For our example language above:
Program → Block Block → Dec* Stmt+ Var: Dec → id Fun: Dec → id id* Exp Assign: Stmt → Varref Exp Read: Stmt → id* Write: Stmt → Exp* While: Stmt → Exp Block Plus: Exp → Exp Exp Minus: Exp → Exp Exp Times: Exp → Exp Exp Divide: Exp → Exp Exp Varref: Exp → id Intlit: Exp → numeral Call: Exp → id Exp*
To show how ASTs are produced from source code, we generally augment the EBNF with attribute evaluation rules. One way to do this in our little language is this:
P → B
P.tree := (Program B.tree)
B → D1 ';' ... Dn ';' S1 ';' ... Sn ';'
B.tree := (Block D1.tree ... Dn.tree S1.tree ... Sn.tree)
D → var I
D.tree := (Var I.name)
D → fun ID '(' I1 ',' ... ',' In ')' '=' E
D.tree := (Fun I.name I1.name ... In.name E.tree)
S → I '=' E
S.tree := (Assign (Varref I.name) E.tree)
S → read I1 ',' ... ',' In
S.tree := (Read (Varref I1.name) ... (Varref In.name))
S → write E1 ',' ... ',' En
S.tree := (Write E1.tree ... En.tree)
S → while E loop B end
S.tree := (While E.tree B.tree)
E → T
E.tree := T.tree
E → E1 '+' T
E.tree := (Plus E1.tree T.tree)
E → E1 '-' T
E.tree := (Minus E1.tree T.tree)
T → F
T.tree := F.tree
T → T1 '*' F
T.tree := (Times T1.tree F.tree)
T → T1 '/' F
T.tree := (Divide T1.tree F.tree)
F → N
F.tree := (Intlit N.value)
F → I
F.tree := (Varref I.name)
F → C
F.tree := C.tree
F → '(' E ')'
F.tree := E.tree
C → I '(' E1 ',' ... ',' En ')'
C.tree := (Call I.name E1.tree ... En.tree)
This uses the (ROOT SUBTREE ... SUBTREE) notation for trees. Another approach is to treat the tree as an object graph with named fields: this is useful when actually coding! In this case the abstract syntax tree looks like

and the attribute grammar is written:
P → B
P.tree := Program(block=B.tree)
B → D1 ';' ... Dn ';' S1 ';' ... Sn ';'
B.tree := Block(decls=[D1.tree,...,Dn.tree], stmts=[S1.tree, ..., Sn.tree])
D → var I
D.tree := Var(name=I.name)
D → fun ID '(' I1 ',' ... ',' In ')' '=' E
D.tree := Fun(name=I.name, params=[I1.name, ..., In.name], body=E.tree)
S → I '=' E
S.tree := Assign(left=Varref(name=I.name), right=E.tree)
S → read I1 ',' ... ',' In
S.tree := Read(vars=[Varref(name=I1.name), ..., Varref(name=In.name)])
S → write E1 ',' ... ',' En
S.tree := Write(exps=[E1.tree, ..., En.tree])
S → while E loop B end
S.tree := While(condition=E.tree, body=B.tree)
E → T
E.tree := T.tree
E → E1 '+' T
E.tree := Plus(left=E1.tree, right=T.tree)
E → E1 '-' T
E.tree := Minus(left=E1.tree, right=T.tree)
T → F
T.tree := F.tree
T → T1 '*' F
T.tree := Times(left=T1.tree, right=F.tree)
T → T1 '/' F
T.tree := Divide(left=T1.tree, right=F.tree)
F → N
F.tree := Intlit(value=N.value)
F → I
F.tree := Varref(name=I.name)
F → C
F.tree := C.tree
F → '(' E ')'
F.tree := E.tree
C → I '(' E1 ',' ... ',' En ')'
C.tree := Call(name=I.name, args=[E1.tree, ..., En.tree])
It can be argued that abstract syntax captures the essence of a language's structure because multiple concrete syntaxes can be mapped to a single abstract syntax. Here are some example programs in various languages that all have the same abstract syntax tree above:
program
declare x, y
// In this language, indentation defines blocks
while (y - 5)
declare y
get(x)
get(y)
x := 2 * (3 + y)
put(5)
new x, y;
while (y - 5) {
new y;
STDIN -> x;
STDIN -> y;
x <- 2 * (3 + y);
}
STDOUT <- 5;
COMMENT THIS LOOKS LIKE OLD CODE
DECLARE INT X.
DECLARE INT Y.
WHILE Y MINUS 5 IS NOT ZERO LOOP:
DECLARE INT Y.
READ FROM STDIN INTO X.
READ FROM STDIN INTO Y.
MOVE PRODUCT OF 2 AND (SUM OF 3 AND Y) INTO X.
END LOOP.
WRITE 5 TO STDOUT.
(program
(define x y)
(while (- y 5)
(seq (define y) (read x y) (assign x (* 2 (+ 3 y))))
)
(write 5)
)
<program>
<declare vars="x y"/>
<while>
<minus><varref ref="y"/><intlit value="5"/></minus>
<declare vars="y"/>
<read vars="x y"/>
<assign varref="x">
<times>
<intlit value="2"/>
<plus>
<intlit value="3"/>
<varref ref="y"/>
</plus>
</times>
</assign>
</while>
<write>
<intlit value="5"/>
</write>
</program>
Can't we just write a context-sensitive grammar to define the set of legal programs? HAH!!! Context-sensitive grammars are
Try the following exercise if you're not convinced: