Class 20 Backtracking and Dynamic Programming

Character Encodings: Unicode, UTF-16, ASCII, LATIN-1 and UTF-8

Quick Intro to the important character encodings

ASCII: 8 bit codes, ‘A’ = 0x41, ‘a’ = 0x61, space = 0x20, CR = 0x0d, etc. Do “man ascii” on UNIX to see a whole table. There are 128 codes, values 0 - 127, using 7 of the 8 bits. The highest code is 0x7f, for the DELETE character.

Unicode: 18 bit codes. Can express chars for any language text, and many other special symbols.

UTF-16: 16 bit codes matching real Unicode except for a few special characters and some obscure languages, which require multiple 16-bit codes. Used for strings and chars inside Java.

UTF-8: Can be thought of as compressed Unicode, often 8 bits but longer for some chars. UTF-8 is XML's default char encoding. UTF-8 can encode any Unicode character, so has the same power as Unicode to express any language.

LATIN-1, also known as ISO-8859-1: 8-bit codes. Uses the ASCII codes 9(TAB), 10(LF), 13(CR), 32-127, plus more codes over 127: 160-255, or in hex, 0x80-0xff. Example non-ASCII chars: British pound sign, 0xa3, copyright sign ©, 0xa9, plus/minus sign ± , 0xb1. This is the default encoding for HTML. A more recent version of LATIN-1, known as Latin-9 or ISO-8859-15, includes the euro sign as code 0xA4. To be safe, use “euro” in important documents that need to be portable.

From the standard http://www.faqs.org/rfcs/rfc3629.html

    Unicode value range         |        UTF-8 octet sequence

     hexadecimal range(bits)    |              (binary)

   -----------------------------|------------------------------------

   0000-007F (000000000xxxxxxx) | 0xxxxxxx

   0080-07FF (00000xxxxxxxxxxx) | 110xxxxx 10xxxxxx

   0800-FFFF (xxxxxxxxxxxxxxxx) | 1110xxxx 10xxxxxx 10xxxxxx

   0001 0000-0010 FFFF          | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The last rule is for the codes over FFFF. Unicode is actually 18 bits long, but the codes above FFFF are not in common use.

By this, we see that ‘a’ = 0x0061 has UTF-8 code 0x61, same as its ASCII code, and similarly with any ASCII char.

a (0x0061) is coded by the first rule: 0061 = 00000000 0|1100001, UTF-8 code = 01100001 (8 bits)

€ (0x20ac) is coded by the third rule: 20ac = 0010|0000|1010|1100, UTF-8 code = 11100010 10000010 10101100 (24 bits)

™ (0x2122) is coded by the third rule: 2122 = 0010|0001 00|100010, UTF-8 code = 11100010 10000100 10100010 (24 bits)

We can see that the three cases of UTF-8 codes are recognizable from their first few bits. Thus we can decode UTF-8 by matching the first byte, then following the patterns to reassemble the bits.

UTF-8 can recover from loss of one byte of data in the sense that the first two bits tell whether or not a particular byte is a continuation byte or a first byte, so you can pick out where a code starts after the damaged data and decode from there.

UTF-8 preserves sort order across conversion to/from Unicode. So if a set of Unicode strings is sorted, so is the corresponding set of UTF-8 strings, and vice versa.

Pretty clever scheme!

More details

In the spectrum of Unicode codes, the ASCII chars provide the first codes: ‘A’ = 0x0041, ‘a’ = 0x0061, space = 0x0020, CR = 0x000d, etc. So the Java String “b a” is coded in 3 UTF-16 codes: 0062 0020 0061, a total of 6 bytes. Above ASCII codes in the Unicode spectrum come the LATIN-1 codes , British pound sign = 0x00a3, copyright sign = 0x00a9, etc., then the rest of the codes, those over 0xff. The euro sign (not in the true LATIN-1) is Unicode 0x20ac.

What does a Unicode above 0xff stand for? Answer: all sorts of characters used in other languages, plus some symbols used in English that don’t appear in ASCII, like ™ (0x2122), the euro, and so on. To see them in MS Word, pull down the Insert menu, select Symbol, and a whole matrix of symbols shows up. Click on one of them and see the code at the bottom of the window.

It’s easy to convert ASCII to UTF-16: just add a first byte of 0x00. Java does this for us in classes InputStreamReader and Scanner. System.out.println(s) is using class PrintWriter to convert Unicode (ASCII subset) to ASCII by dropping a byte of 0x00. Using the appropriate classes, we can output UTF-8 from a Java program, or Latin-1, as well as ASCII. See program below. Also, we can get Java to read UTF-8 or Latin-1 or ASCII. See the Javadocs for InputStreamStream, Scanner, and OutputSteamWriter.

For XML, and some other documents, we want to use the flexibility of full Unicode, yet not pay the price of 16 bits/character in our files. That’s where UTF-8 comes in. It is really a kind of compression of Unicode, at least for ASCII-mostly documents.

// Show how to output Unicode from Java to a file in UTF-8 encoding

import java.io.*;

public class TestUTF8Output

{

public static void main (String[] args)

{

// set up test string with various chars

// start with ASCII, then Brit. pound sign, then TM sign:

String s = new String("abcd\u00a3\u2122\n"); // now in Unicode

try {

OutputStreamWriter out2 = new OutputStreamWriter(

new FileOutputStream("foo.txt"), "UTF-8");

// make a printwriter, so we can use print, etc

PrintWriter out = new PrintWriter(out2);

out.print(s); // print s in UTF-8

out.close();

// remind user where file output is

System.out.println("Look in file foo.txt for UTF-8 output.");

System.out.println("Expect abcd each in one byte.");

System.out.println("Then Unicode 00a3 in two bytes.");

System.out.println("Then Unicode 2122 in three bytes.");

System.out.println("Finally \\n in one byte.");

System.out.println("On UNIX, see them with \"od -x foo.txt\"");

System.out.println("On Windows, read it with Notepad or Word.");

} catch (Exception e) {

System.err.println("exception: " + e);

}

dbs2(79)% java TestUTF8Output

Look in file foo.txt for UTF-8 output.

Expect abcd each in one byte.

Then Unicode 00a3 in two bytes.

Then Unicode 2122 in three bytes.

Finally \n in one byte.

On UNIX, see them with "od -x foo.txt"

dbs2(80)% od -x foo.txt

0000000 6162 6364 c2a3 e284 a20a

0000012

dbs2(81)% od -c foo.txt

0000000 a b c d 302 243 342 204 242 \n

0000012