How do we encode characters into bit strings? To answer this we need to know what a character is, what a character set is, and what a character encoding is.
A character is an abstract symbol. A character has a name. Examples:
PLUS SIGN CYRILLIC SMALL LETTER TSE BLACK CHESS KNIGHT MUSICAL SYMBOL FERMATA BELOW
Do not confuse a character with a glyph, which is a picture of a character. Two or more characters can share the same glyph (e.g. LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA), and one character can have many glyphs (think fonts).
A character set has two parts: (1) a repertoire, which is a set of characters, and (2) a code position mapping, which is a function mapping non-negative integers to characters in the repertoire. When an integer i maps to a character c we say i is the codepoint of c.
An example of a character set is the Universal Character Set which happens to be identical to another character set called Unicode. Here is part of UCS:
25 PERCENT SIGN
2B PLUS SIGN
54 LATIN CAPITAL LETTER T
5D RIGHT SQUARE BRACKET
B0 DEGREE SIGN
C9 LATIN CAPITAL LETTER E WITH ACUTE
2AD LATIN LETTER BIDENTAL PERCUSSIVE
39B GREEK CAPITAL LETTER LAMDA
446 CYRILLIC SMALL LETTER TSE
543 ARMENIAN CAPITAL LETTER CHEH
5E6 HEBREW LETTER TSADI
635 ARABIC LETTER SAD
784 THAANA LETTER BAA
94A DEVANAGARI VOWEL SIGN SHORT O
9D7 BENGALI AU LENGTH MARK
BEF TAMIL DIGIT NINE
D93 SINHALA LETTER AIYANNA
F0A TIBETAN MARK BKA- SHOG YIG MGO
11C7 HANGUL JONGSEONG NIEUN-SIOS
1293 ETHIOPIC SYLLABLE NAA
13CB CHEROKEE LETTER QUV
2023 TRIANGULAR BULLET
20A4 LIRA SIGN
2105 CARE OF
213A ROTATED CAPITAL Q
21B7 CLOCKWISE TOP SEMICIRCLE ARROW
2226 NOT PARALLEL TO
2234 THEREFORE
265E BLACK CHESS KNIGHT
1D111 MUSICAL SYMBOL FERMATA BELOW
1D122 MUSICAL SYMBOL F CLEF
Unicode code points are traditionally written with U+ followed by four to six hex digits (e.g. U+00C9, U+1D122).
The entire character set is described in the two files http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. http://www.unicode.org/Public/UNIDATA/Unihan.txt. The codepoints are not assigned haphazardly: see http://www.unicode.org/Public/UNIDATA/Blocks.txt.
ISO8859-1 is a character set that is exactly equivalent to the first 256 mappings of Unicode. Obviously it doesn't have enough characters.
These 15 charsets also have 256-character repertoires. They all share the same characters in the first 128 positions, but differ in the next 128. See http://www.unicode.org/Public/MAPPINGS/ISO8859/.
This character set, with a repertoire of 256 characters, also known as CP1252, can be found at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT. It is very close to ISO8859-1. Be careful with this set! Users of Windows systems often unknowingly produce documents with this character set, then forget to specify it when making these documents available on the web or transporting them via other protocols with tend to default to Unicode. Then the end result is annoying. It's best to avoid this.
ASCII is a character set that is exactly equivalent to the first 128 mappings of Unicode. Obviously it doesn't have enough characters. However it is commonly used! It's a good "lowest common denominator" and many Internet protocols require it!
A character encoding specifies how a character (or character string) is encoded in a bit string. There are many, many encodings of Unicode. The most important are UTF-32, UTF-16 and UTF-8.
This is the simplest. Just encode each character in 32 bits. The encoding of a character is simply its code point! Couldn't be more straightforward. Of course, you try to convince people to actually use four bytes per character.
In UTF-16 some characters are encoded in 16 bits and some in 32 bits.
| Character Range | Bit Encoding |
|---|---|
| U+0000 ... U+FFFF | xxxxxxxx xxxxxxxx |
| U+10000 ... U+10FFFF | let y = X-1000016 in 110110yy yyyyyyyy 110111yy yyyyyyyy |
UTF-16 simply cannot encode codepoints beyond U+10FFFF. So far this is not a problem. Note also that the existence of UTF-16, and its blessing by the Unicode Consortium means that U+D800 through U+DFFF cannot be legal characters. Hack!?
Here's another variable length encoding.
| Character Range | Bit Encoding | (Bits) |
|---|---|---|
| U+0000 ... U+007F | 0xxxxxxx | 7 |
| U+0080 ... U+07FF | 110xxxxx 10xxxxxx | 11 |
| U+0800 ... U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 16 |
| U+10000 ... U+1FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 21 |
| U+200000 ... U+3FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 26 |
| U+4000000 ... U+7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 31 |
UTF-8 rocks. The number of advantages it has is stunning. For examples:
To stream out the UTF-8 bytes from an int, use:
if (c < 0x7F) {
emitByte(c);
} else if (c < 0x7FF) {
emitByte(0xC0 | c>>6);
emitByte(0x80 | c & 0x3F);
} else if (c < 0xFFFF) {
emitByte(0xE0 | c>>12);
emitByte(0x80 | c>>6 & 0x3F);
emitByte(0x80 | c & 0x3F);
} else if (c < 0x1FFFFF) {
emitByte(0xF0 | c>>18);
emitByte(0x80 | c>>12 & 0x3F);
emitByte(0x80 | c>>6 & 0x3F);
emitByte(0x80 | c & 0x3F);
} else if (c <= 0x3FFFFFF) {
emitByte(0xF8 | c>>24);
emitByte(0x80 | c>>18 & 0x3F);
emitByte(0x80 | c>>12 & 0x3F);
emitByte(0x80 | c>>6 & 0x3F);
emitByte(0x80 | c & 0x3F);
} else if (c <= 0x7FFFFFFF) {
emitByte(0xFC | c>>30);
emitByte(0x80 | c>>24 & 0x3F);
emitByte(0x80 | c>>18 & 0x3F);
emitByte(0x80 | c>>12 & 0x3F);
emitByte(0x80 | c>>6 & 0x3F);
emitByte(0x80 | c & 0x3F);
}
I won't describe any others here, but UTF-7 is worth mentioning. If you like the stuff on this page see the IANA Charsets Page. You may also want to check out the UTF page at czybrra.com, which is very complete and well-written (and from which I borrowed the list of UTF-8 advantages).
| Unicode Character | UTF-32 Encoding | UTF-16 Encoding | UTF-8 Encoding |
|---|---|---|---|
| RIGHT SQUARE BRACKET (U+005D) | 00 00 00 5D | 00 5D | 5D |
| LATIN CAPITAL LETTER E WITH ACUTE (U+00C9) | 00 00 00 C9 | 00 C9 | C3 89 |
| CHEROKEE LETTER QUV (U+13CB) | 00 00 13 CB | 13 CB | E1 8F 8B |
| MUSICAL SYMBOL F CLEF (U+1D122) | 00 01 D1 22 | D8 34 DD 22 | F0 9D 84 A2 |