Language Theory Basics

To make a useful study of language translation and interpretation, it is useful to have a simple theory.

Strings

A symbol is a primitive unit of data representation.

An alphabet is a finite set of symbols.

Example: Σ = {a, b, c}.

A string is a finite sequence of symbols over some alphabet.

Examples: w = abbcbabb, x = b, y = aaa.

The length of a string is the number of symbol occurrences in the string.

Example: if w = abbac, then |w| = 5.

The empty string, denoted ε, is the (unique) string of length 0.

The concatenation of two strings x and y is the string obtained by appending y at the end of x. We define w0 = ε, w1 = w, w2 = ww, and wk+1 = wwk.

Example: if x = abb and y = ca then xy = abbca.

Example: wε = εw = w.

Example: (abc)4 = abcabcabcabc.

The reversal wR of a string w is the string formed by reversing w; formally εR = ε and (σw)R = wRσ for some symbol σ

x is a substring of y iff ∃v ∃w. y = vxw.

Languages

A language is a set of strings over some alphabet.

The concatenation of two languages L1 and L2 is L1L2 = {xy | x ∈ L1 ∧ y ∈ L2}. Also, we define L0 = {ε}, L1 = L, L2 = LL, ... Lk+1 = LLk.

The Kleene Closure of L is L* = L0 ∪ L1 ∪ L2 ∪ ... = {w1w2...wk | wi ∈ L ∧ k ≥ 0}.

The Positive Closure of L is L+ = L1 ∪ L2 ∪ L3 ∪ ... = {w1w2...wk | wi ∈ L ∧ k > 0}.

If Σ is an alphabet then Σ is also a language over Σ. So is Σ*. Σ* is important: it is the set of all strings over Σ. So for any language L over Σ, L ⊆ Σ*. Also important is that &Sigma* is countable.

Examples:

Σ = {a,b}
Σ* = {ε,a,b,aa,ab,ba,bb,aaa,aab,aba,...}
L1 = {ba, b}
L2 = {abb, bb}
L1* = {ε,ba,b,baba,bab,bba,bb,...}
L1+ = {ba,b,baba,bab,bba,bb,...}
L1L2 = {baabb, babb, bbb}

Some Language Theorems

Let Σ be an alphabet and A, B, C, and D be languages over Σ (i.e. subsets of &Sigma*).

A∅ = ∅A = ∅
A{ε} = {ε}A = A
(AB)C = A(BC)
(A ⊆ B) ∧ (C ⊆ D) ⇒ AC ⊆ BD
A(B ∪ C) = AB ∪ AC
(B ∪ C)A = BA ∪ CA
A(B ∩ C) ⊆ AB ∩ AC
(B ∩ C)A ⊆ BA ∩ CA
A* = A+ ∪ {ε}
An ⊆ A*, n ≥ 0
An ⊆ A+, n ≥ 1
A ⊆ AB*
A ⊆ B*A
A ⊆ B ⇒ A* ⊆ B*
A ⊆ B ⇒ A+ ⊆ B+
AA* = A*A = A+
{ε} ∈ A iff A+ = A*
(A*)* = A*A* = A*
(A*)+ = (A+)* = A*
A*A+ = A+A* = A*
(A*B*)* = (A ∪ B)* = (A* ∪ B*)*
(C = AC ∪ B) ⇒ (C = A*B)

Types of Languages

One way to characterize languages is by how much work is required to determine whether or not a given string is in the language.

The Chomsky Hierarchy is one such classification scheme:

chomsky.png

The main language types in this hiearchy are:

Language Type Description Grammar Recognizer
Finite Contains only a finite number of strings. Enumeration Loop-free FSM
Regular
(Type 3)
Strings can be recognized by processing them one symbol at a time, without backtracking, and without using additional memory. Regular FSM
Deterministic Context Free
(LR)
Strings can be recognized by processing them one symbol at a time, without backtracking, using only a stack for external storage. (Stacks are nice! Θ(1) access!) DCFG DPDA
Context Free
(Type 2)
Strings can be recognized using only a stack for external storage. (Backtracking may occur while scanning the input.) CFG PDA
Context-Sensitive
(Type 1)
Strings can be recognized using a memory bounded by some linear function of the string's length. CSG LBA
Recursive String membership is decidable. There are no predetermined memory bounds, and no bounds on the number of steps. ? TM that always halts
Recursively Enumerable
(Type 0, r.e.)
String membership is partially decidable: if a string is in the language, you can determine that for sure; if a string is not, you might not be able to determine this. Unrestricted Chomsky Grammar Turing Machine

There do exist non-recursively enumerable languages. In these languages string membership cannot even partially decided. Why? Because, for a given alphabet, the set of all r.e. languages is countable but the set of all languages is uncountable.

Among the non-r.e. languages, there are

The are hundreds of other classes of languages. Perhaps the two most famous are:

Note, by definition, P ⊆ NP, but no one has been able to prove whether or not P = NP. There is cash prize waiting for you if you can do it.

Exercise: Research and provide definitions for these other classes of languages: LL, LOGSPACE, PSPACE, EXPSPACE, EXPTIME, LALR, SLR, BPP, and LINEAR-SIZE.

If you're interested in this stuff, check out the Complexity Zoo.