ASCII: 8 bit codes, ‘A’ = 0x41, ‘a’ =
0x61, space = 0x20, CR = 0x0d, etc. Do “man ascii”
on UNIX to see a whole table. There are 128 codes, values 0 - 127, using 7 of the 8 bits. The highest code is
0x7f, for the DELETE character.
Unicode: 18 bit codes.
Can express chars for any language text, and many other special symbols.
UTF-16: 16 bit codes
matching real Unicode except for a few special characters and some obscure
languages, which require multiple 16-bit codes. Used for strings and chars
inside Java.
UTF-8: Can be thought of
as compressed Unicode, often 8 bits but longer for some chars. UTF-8 is XML's
default char encoding. UTF-8 can encode any Unicode character, so has the
same power as Unicode to express any language.
LATIN-1, also known as ISO-8859-1: 8-bit codes. Uses the ASCII codes 9(TAB), 10(LF), 13(CR), 32-127, plus
more codes over 127: 160-255, or in hex, 0x80-0xff. Example
non-ASCII chars: British pound sign, 0xa3, copyright sign ©, 0xa9, plus/minus
sign ± , 0xb1. This is the default encoding for HTML.
A more recent version of LATIN-1, known as Latin-9 or ISO-8859-15, includes the
euro sign as code 0xA4. To be safe, use “euro”
in important documents that need to be portable.
From the standard http://www.faqs.org/rfcs/rfc3629.html
Unicode value range | UTF-8 octet sequence
hexadecimal range(bits) | (binary)
-----------------------------|------------------------------------
0000-007F (000000000xxxxxxx) | 0xxxxxxx
0080-07FF (00000xxxxxxxxxxx) | 110xxxxx 10xxxxxx
0800-FFFF (xxxxxxxxxxxxxxxx) | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The last rule is
for the codes over FFFF. Unicode is actually 18 bits long, but the codes above
FFFF are not in common use.
By this, we see that ‘a’ = 0x0061 has
UTF-8 code 0x61, same as its ASCII code, and similarly with any ASCII char.
a (0x0061)
is coded by the first rule: 0061 =
00000000 0|1100001, UTF-8 code = 01100001 (8 bits)
© (0x00a9) is coded by the second
rule: 00a9 = 00000|000 10|101001, UTF-8 code = 11000010 10101001 (16 bits)
€ (0x20ac) is coded by the third rule: 20ac = 0010|0000|1010|1100, UTF-8 code = 11100010 10000010
10101100 (24 bits)
™
(0x2122) is coded by the third rule: 2122
= 0010|0001 00|100010, UTF-8 code = 11100010
10000100 10100010 (24 bits)
We
can see that the three cases of UTF-8 codes are recognizable from their first
few bits. Thus we can decode UTF-8 by matching the first byte, then
following the patterns to reassemble the bits.
UTF-8
can recover from loss of one byte of data in the sense that the first two bits
tell whether or not a particular byte is a continuation byte or a first byte,
so you can pick out where a code starts after the damaged data and decode from
there.
UTF-8
preserves sort order across conversion to/from Unicode. So if a set of Unicode
strings is sorted, so is the corresponding set of UTF-8 strings, and vice versa.
Pretty
clever scheme!
More details
In the spectrum of Unicode codes, the
ASCII chars provide the first codes: ‘A’ = 0x0041, ‘a’ = 0x0061, space =
0x0020, CR = 0x000d, etc. So the Java String “b a” is coded in 3 UTF-16
codes: 0062 0020 0061, a total of 6 bytes. Above ASCII codes in the
Unicode spectrum come the LATIN-1 codes , British
pound sign = 0x00a3, copyright sign = 0x00a9, etc., then the rest
of the codes, those over 0xff. The euro sign (not in the true LATIN-1) is
Unicode 0x20ac.
What does a Unicode above 0xff stand
for? Answer: all sorts of characters used in other languages, plus some
symbols used in English that don’t appear in ASCII, like ™ (0x2122), the euro, and
so on. To see them in MS Word, pull down the Insert menu, select Symbol,
and a whole matrix of symbols shows up. Click on one of them and see the
code at the bottom of the window.
It’s easy to convert ASCII to UTF-16:
just add a first byte of 0x00. Java does this for us in classes InputStreamReader and Scanner. System.out.println(s)
is using class PrintWriter to convert Unicode (ASCII
subset) to ASCII by dropping a byte of 0x00.
Using the appropriate classes, we can output UTF-8 from a Java program,
or Latin-1, as well as ASCII. See program below. Also, we can get Java to read
UTF-8 or Latin-1 or ASCII. See the Javadocs for InputStreamStream,
Scanner, and OutputSteamWriter.
For XML, and some other documents, we
want to use the flexibility of full Unicode, yet not pay the price of 16
bits/character in our files. That’s where UTF-8 comes in. It is
really a kind of compression of Unicode, at least for ASCII-mostly documents.
// Show how to output Unicode from Java to a file in
UTF-8 encoding
import java.io.*;
public class TestUTF8Output
{
public static void main (String[] args)
{
// set
up test string with various chars
//
start with ASCII, then Brit. pound sign, then TM sign:
String
s = new String("abcd\u00a3\u2122\n"); // now in Unicode
try {
OutputStreamWriter out2 = new OutputStreamWriter(
new
FileOutputStream("foo.txt"),
"UTF-8");
//
make a printwriter, so we can use print, etc
PrintWriter out = new PrintWriter(out2);
out.print(s); // print s in
UTF-8
out.close();
//
remind user where file output is
System.out.println("Look in file foo.txt for UTF-8 output.");
System.out.println("Expect abcd each in one
byte.");
System.out.println("Then Unicode 00a3 in two bytes.");
System.out.println("Then Unicode 2122 in three bytes.");
System.out.println("Finally \\n in one byte.");
System.out.println("On UNIX, see them with \"od
-x foo.txt\"");
System.out.println("On Windows, read it with Notepad or Word.");
}
catch (Exception e) {
System.err.println("exception: " + e);
}
}
}
dbs2(79)% java TestUTF8Output
Look in file foo.txt for UTF-8 output.
Expect abcd each in one
byte.
Then Unicode 00a3 in two
bytes.
Then Unicode 2122 in three
bytes.
Finally \n in one byte.
On UNIX, see them with "od
-x foo.txt"
dbs2(80)% od
-x foo.txt
0000000 6162 6364 c2a3 e284 a20a
0000012
dbs2(81)% od
-c foo.txt
0000000
a b c d
302 243 342 204 242 \n
0000012