Character Encoding

How do we encode characters into bit strings? To answer this we need to know what a character is, what a character set is, and what a character encoding is.

Characters

A character is an abstract symbol. A character has a name. Examples:

   PLUS SIGN
   CYRILLIC SMALL LETTER TSE
   BLACK CHESS KNIGHT
   MUSICAL SYMBOL FERMATA BELOW

Do not confuse a character with a glyph, which is a picture of a character. Two or more characters can share the same glyph (e.g. LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA), and one character can have many glyphs (think fonts).

Character Sets

A character set has two parts: (1) a repertoire, which is a set of characters, and (2) a code position mapping, which is a function mapping non-negative integers to characters in the repertoire. When an integer i maps to a character c we say i is the codepoint of c.

UCS and Unicode

An example of a character set is the Universal Character Set which happens to be identical to another character set called Unicode. Here is part of UCS:

     25 PERCENT SIGN
     2B PLUS SIGN
     54 LATIN CAPITAL LETTER T
     5D RIGHT SQUARE BRACKET
     B0 DEGREE SIGN
     C9 LATIN CAPITAL LETTER E WITH ACUTE
    2AD LATIN LETTER BIDENTAL PERCUSSIVE
    39B GREEK CAPITAL LETTER LAMDA
    446 CYRILLIC SMALL LETTER TSE
    543 ARMENIAN CAPITAL LETTER CHEH
    5E6 HEBREW LETTER TSADI
    635 ARABIC LETTER SAD
    784 THAANA LETTER BAA
    94A DEVANAGARI VOWEL SIGN SHORT O
    9D7 BENGALI AU LENGTH MARK
    BEF TAMIL DIGIT NINE
    D93 SINHALA LETTER AIYANNA
    F0A TIBETAN MARK BKA- SHOG YIG MGO
   11C7 HANGUL JONGSEONG NIEUN-SIOS
   1293 ETHIOPIC SYLLABLE NAA
   13CB CHEROKEE LETTER QUV
   2023 TRIANGULAR BULLET
   20A4 LIRA SIGN
   2105 CARE OF
   213A ROTATED CAPITAL Q
   21B7 CLOCKWISE TOP SEMICIRCLE ARROW
   2226 NOT PARALLEL TO
   2234 THEREFORE
   265E BLACK CHESS KNIGHT
  1D111 MUSICAL SYMBOL FERMATA BELOW
  1D122 MUSICAL SYMBOL F CLEF

Unicode code points are traditionally written with U+ followed by four to six hex digits (e.g. U+00C9, U+1D122).

The entire character set is described in the two files http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. http://www.unicode.org/Public/UNIDATA/Unihan.txt. The codepoints are not assigned haphazardly: see http://www.unicode.org/Public/UNIDATA/Blocks.txt.

ISO8859-1

ISO8859-1 is a character set that is exactly equivalent to the first 256 mappings of Unicode. Obviously it doesn't have enough characters.

ISO8859-2 through ISO8859-16

These 15 charsets also have 256-character repertoires. They all share the same characters in the first 128 positions, but differ in the next 128. See http://www.unicode.org/Public/MAPPINGS/ISO8859/.

windows-1252

This character set, with a repertoire of 256 characters, also known as CP1252, can be found at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT. It is very close to ISO8859-1. Be careful with this set! Users of Windows systems often unknowingly produce documents with this character set, then forget to specify it when making these documents available on the web or transporting them via other protocols with tend to default to Unicode. Then the end result is annoying. It's best to avoid this.

ASCII

ASCII is a character set that is exactly equivalent to the first 128 mappings of Unicode. Obviously it doesn't have enough characters. However it is commonly used! It's a good "lowest common denominator" and many Internet protocols require it!

Character Encodings

A character encoding specifies how a character (or character string) is encoded in a bit string. There are many, many encodings of Unicode. The most important are UTF-32, UTF-16 and UTF-8.

UTF-32

This is the simplest. Just encode each character in 32 bits. The encoding of a character is simply its code point! Couldn't be more straightforward. Of course, you try to convince people to actually use four bytes per character.

UTF-16

In UTF-16 some characters are encoded in 16 bits and some in 32 bits.

Character RangeBit Encoding
U+0000 ... U+FFFFxxxxxxxx xxxxxxxx
U+10000 ... U+10FFFF let y = X-1000016 in
110110yy yyyyyyyy 110111yy yyyyyyyy

UTF-16 simply cannot encode codepoints beyond U+10FFFF. So far this is not a problem. Note also that the existence of UTF-16, and its blessing by the Unicode Consortium means that U+D800 through U+DFFF cannot be legal characters. Hack!?

UTF-8

Here's another variable length encoding.

Character RangeBit Encoding(Bits)
U+0000 ... U+007F0xxxxxxx7
U+0080 ... U+07FF110xxxxx 10xxxxxx11
U+0800 ... U+FFFF1110xxxx 10xxxxxx 10xxxxxx16
U+10000 ... U+1FFFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx21
U+200000 ... U+3FFFFFF111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx26
U+4000000 ... U+7FFFFFFF1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx31

UTF-8 rocks. The number of advantages it has is stunning. For examples:

To stream out the UTF-8 bytes from an int, use:

  if (c < 0x7F) {
      emitByte(c);
  } else if (c < 0x7FF) {
      emitByte(0xC0 | c>>6);
      emitByte(0x80 | c & 0x3F);
  } else if (c < 0xFFFF) {
      emitByte(0xE0 | c>>12);
      emitByte(0x80 | c>>6 & 0x3F);
      emitByte(0x80 | c & 0x3F);
  } else if (c < 0x1FFFFF) {
      emitByte(0xF0 | c>>18);
      emitByte(0x80 | c>>12 & 0x3F);
      emitByte(0x80 | c>>6 & 0x3F);
      emitByte(0x80 | c & 0x3F);
  } else if (c <= 0x3FFFFFF) {
      emitByte(0xF8 | c>>24);
      emitByte(0x80 | c>>18 & 0x3F);
      emitByte(0x80 | c>>12 & 0x3F);
      emitByte(0x80 | c>>6 & 0x3F);
      emitByte(0x80 | c & 0x3F);
  } else if (c <= 0x7FFFFFFF) {
      emitByte(0xFC | c>>30);
      emitByte(0x80 | c>>24 & 0x3F);
      emitByte(0x80 | c>>18 & 0x3F);
      emitByte(0x80 | c>>12 & 0x3F);
      emitByte(0x80 | c>>6 & 0x3F);
      emitByte(0x80 | c & 0x3F);
  }

Others

I won't describe any others here, but UTF-7 is worth mentioning. If you like the stuff on this page see the IANA Charsets Page. You may also want to check out the UTF page at czybrra.com, which is very complete and well-written (and from which I borrowed the list of UTF-8 advantages).

Examples

Unicode Character UTF-32 Encoding UTF-16 Encoding UTF-8 Encoding
RIGHT SQUARE BRACKET (U+005D) 00 00 00 5D 00 5D 5D
LATIN CAPITAL LETTER E WITH ACUTE (U+00C9) 00 00 00 C9 00 C9 C3 89
CHEROKEE LETTER QUV (U+13CB) 00 00 13 CB 13 CB E1 8F 8B
MUSICAL SYMBOL F CLEF (U+1D122) 00 01 D1 22 D8 34 DD 22 F0 9D 84 A2