ASCII: 8 bit codes, ‘A’ = 0x41, ‘a’ =
0x61, space = 0x20, CR = 0x0d, etc. Do “man ascii” on UNIX to see a whole
table. There are 128 codes, values 0 - 127, using 7 of the 8
bits. The highest code is 0x7f, for the DELETE character.
Unicode: 18 bit codes.
Can express any language text, and many other special symbols.
UTF-16: 16 bit codes
matching real Unicode except for a few special characters and some obscure
languages. Used for strings and chars inside Java. Java uses multiple
16-bit chars for the characters that don’t fit the main sequence.
UTF-8: Can be thought
of as compressed Unicode, often 8 bits but longer for some chars. XML data is
usually written in UTF-8, and UTF-8 is XML's default char encoding. Can
encode any Unicode character, so has the same power as Unicode to express any
language.
LATIN-1, also known as ISO-8859-1 (or
ISO-8859-15 to be sure to include the euro sign): 8-bit codes. uses
the ASCII codes 9(TAB), 10(LF), 13(CR), 32-127, plus more codes over 127:
160-255, or in hex, 0xa0-0xff. Example non-ASCII chars: British pound
sign, 0xa3, copyright sign ©, 0xa9, plus/minus sign ± , 0xb1. This is the default
encoding for HTML,
More details
In the spectrum of Unicode codes, the
ASCII chars provide the first codes: ‘A’ = 0x0041, ‘a’ = 0x0061, space =
0x0020, CR = 0x000d, etc. So the Java String “b a” is coded in 3 UTF-16
codes: 0062 0020 0061, a total of 6 bytes. Above ASCII in the Unicode
spectrum come the LATIN-1 codes , British pound sign = 0x00a3,
copyright sign = 0x00a9, etc., then the rest of the codes, those over
0xff.
What does a Unicode above 0xff stand
for? Answer: all sorts of characters used in other languages, plus some
symbols used in English that don’t appear in ASCII, like ™ (0x2122), and so
on. To see them in MS Word, pull down the Insert menu, select Symbol, and
a whole matrix of symbols shows up. Click on one of them and see the code
at the bottom of the window.
It’s easy to convert ASCII to UTF-16:
just add a first byte of 0x00. Java does this for us in class
InputStreamReader or Scanner. System.out.println(s) is using class
PrintWriter to convert Unicode (ASCII subset) to ASCII by dropping a byte of
0x00.
For XML, and some other documents, we
want to use the flexibility of full Unicode, yet not pay the price of 16
bits/character in our files. That’s where UTF-8 comes in. It is
really a kind of compression of Unicode, at least for ASCII-mostly documents.
From the standard http://www.faqs.org/rfcs/rfc3629.html
.
Unicode value range | UTF-8 octet sequence
hexadecimal range(bits) | (binary)
-----------------------------|------------------------------------
0000-007F (000000000xxxxxxx) | 0xxxxxxx
0080-07FF (00000xxxxxxxxxxx) | 110xxxxx 10xxxxxx
0800-FFFF (xxxxxxxxxxxxxxxx) | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Read more: http://www.faqs.org/rfcs/rfc3629.html#ixzz0kzON5C3R
There is an
additional rule for the codes over FFFF (Unicode is actually 18 bits long, but
the codes above FFFF are not in common use.
By this, we see that ‘a’ = 0x0061 has
UTF-8 code 0x61, same as its ASCII code, and similarly with any ASCII char.
a (0x0061) is coded by the first
rule: 0061 = 00000000 0|1100001, UTF-8 code = 01100001 (8 bits)
© (0x00a9) is coded by the second
rule: 00a9 = 00000|000 10|101001, UTF-8 code = 11000010 10101001
(16 bits)
™
(0x2122) is coded by the third rule: 2122 = 0010|0001 00|100010, UTF-8 code
= 11100010 10000100 10100010 (24 bits)
Footnote: The additional rule converts
the Unicode characters whose native values lie above 0xffff, that is, overflow
16 bits in their representation. UTF-16 codes all characters in 16 bits,
including these outliers, by reserving some 16 bit codes for special use as
prefixes. Thus to convert properly from UTF-16 to UTF-8, you need to have
special cases for these prefix characters. See http://en.wikipedia.org/wiki/UTF-8
if you are interested.
We can see that the three cases of UTF-8 codes are recognizable from their first few bits. Thus we can decode UTF-8 by matching the first byte, then following the patterns to reassemble the bits.
UTF-8
can recover from loss of one byte of data in the sense that the first two bits
tell whether or not a particular byte is a continuation byte or a first byte,
so you can pick out where a code starts after the damaged data and decode
from there.
UTF-8
preserves sort order across conversion to/from Unicode. So if a set of Unicode
strings is sorted, so is the corresponding set of UTF-8 strings, and vice
versa.
Pretty
clever scheme!