Character Encodings: Unicode, UTF-16, ASCII, LATIN-1 and UTF-8

Quick Intro to the important character encodings

ASCII: 8 bit codes, ‘A’ = 0x41, ‘a’ = 0x61, space = 0x20, CR = 0x0d, etc.  Do “man ascii” on UNIX to see a whole table.  There are 128 codes, values 0 - 127,  using 7 of the 8 bits.  The highest code is 0x7f, for the DELETE character.

 

Unicode: 18 bit codes. Can express any language text, and many other special symbols.

 

UTF-16: 16 bit codes matching real Unicode except for a few special characters and some obscure languages. Used for strings and chars inside Java.  Java uses multiple 16-bit chars for the characters that don’t fit the main sequence.

 

UTF-8: Can be thought of as compressed Unicode, often 8 bits but longer for some chars. XML data is usually written in UTF-8, and UTF-8 is XML's default char encoding.  Can encode any Unicode character, so has the same power as Unicode to express any language.

 

LATIN-1, also known as ISO-8859-1 (or ISO-8859-15 to be sure to include the euro sign): 8-bit codes. uses the ASCII codes 9(TAB), 10(LF), 13(CR), 32-127, plus more codes over 127: 160-255, or in hex, 0xa0-0xff.  Example non-ASCII chars: British pound sign, 0xa3, copyright sign ©, 0xa9, plus/minus sign ± , 0xb1. This is the default encoding for HTML,

 

More details

In the spectrum of Unicode codes, the ASCII chars provide the first codes: ‘A’ = 0x0041, ‘a’ = 0x0061, space = 0x0020, CR = 0x000d, etc.  So the Java String “b a” is coded in 3 UTF-16 codes: 0062 0020 0061, a total of 6 bytes.  Above ASCII in the Unicode spectrum come the LATIN-1 codes , British pound sign  = 0x00a3,  copyright sign = 0x00a9, etc., then the rest of the codes, those over 0xff.

 

What does a Unicode above 0xff stand for?  Answer: all sorts of characters used in other languages, plus some symbols used in English that don’t appear in ASCII, like ™ (0x2122), and so on.  To see them in MS Word, pull down the Insert menu, select Symbol, and a whole matrix of symbols shows up.  Click on one of them and see the code at the bottom of the window.

 

It’s easy to convert ASCII to UTF-16: just add a first byte of 0x00.  Java does this for us in class InputStreamReader or Scanner.  System.out.println(s) is using class PrintWriter to convert Unicode (ASCII subset) to ASCII by dropping a byte of 0x00.

 

For XML, and some other documents, we want to use the flexibility of full Unicode, yet not pay the price of 16 bits/character in our files.  That’s where UTF-8 comes in.  It is really a kind of compression of Unicode, at least for ASCII-mostly documents.

 

From the standard http://www.faqs.org/rfcs/rfc3629.html

.

 

    Unicode value range         |        UTF-8 octet sequence
     hexadecimal range(bits)    |              (binary)
   -----------------------------|------------------------------------
   0000-007F (000000000xxxxxxx) | 0xxxxxxx
   0080-07FF (00000xxxxxxxxxxx) | 110xxxxx 10xxxxxx
   0800-FFFF (xxxxxxxxxxxxxxxx) | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF          | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Read more: http://www.faqs.org/rfcs/rfc3629.html#ixzz0kzON5C3R

There is an additional rule for the codes over FFFF (Unicode is actually 18 bits long, but the codes above FFFF are not in common use.
 

By this, we see that ‘a’ = 0x0061 has UTF-8 code 0x61, same as its ASCII code, and similarly with any ASCII char.

 

a (0x0061)  is coded by the first rule: 0061 = 00000000 0|1100001, UTF-8 code = 01100001 (8 bits)

© (0x00a9) is coded by the second rule:  00a9 = 00000|000 10|101001, UTF-8 code = 11000010 10101001  (16 bits)

™ (0x2122) is coded by the third rule: 2122 = 0010|0001 00|100010, UTF-8 code =  11100010 10000100 10100010 (24 bits)

 

Footnote: The additional rule converts the Unicode characters whose native values lie above 0xffff, that is, overflow 16 bits in their representation.  UTF-16 codes all characters in 16 bits, including these outliers, by reserving some 16 bit codes for special use as prefixes.  Thus to convert properly from UTF-16 to UTF-8, you need to have special cases for these prefix characters.  See http://en.wikipedia.org/wiki/UTF-8 if you are interested.

 

We can see that the three cases of UTF-8 codes are recognizable from their first few bits.  Thus we can decode UTF-8 by matching the first byte, then following the patterns to reassemble the bits. 

 

UTF-8 can recover from loss of one byte of data in the sense that the first two bits tell whether or not a particular byte is a continuation byte or a first byte, so you can pick  out where a code starts after the damaged data and decode from there.

 

UTF-8 preserves sort order across conversion to/from Unicode. So if a set of Unicode strings is sorted, so is the corresponding set of UTF-8 strings, and vice versa.

 

Pretty clever scheme!