CS253: Software Development with C++

Spring 2018

Unicode

See this page as a slide show

CS253 Unicode

Unicode (ISO-10646)

U+ notation

By convention, Unicode code points are represented as U+ followed by at four (more only if needed) upper-case hexadecimal digits.

U+005D]RIGHT SQUARE BRACKET
U+00F1รฑLATIN SMALL LETTER N WITH TILDE
U+042FะฏCYRILLIC CAPITAL LETTER YA
U+2622โ˜ขRADIOACTIVE SIGN
U+1F3A9๐ŸŽฉTOP HAT

RIGHT SQUARE BRACKET is written U+005D, not U+5D. You could also call it Unicode character #93, but donโ€™t.

Whatโ€™s in Unicode

ASCII: A-ZDingbats: โœˆ โ˜ž โœŒ โœ” โœฐ โ˜บ โ™ฅ โ™ฆ โ™ฃ โ™  โ€ข
Other Latin: รค รฑ ยซ ยปEmoji: ๐Ÿฑ
Cyrillic: ะฏEgyptian hieroglyphics: ๐“ฅ
Hebrew: ืMathematics: โˆƒ ๐’‹ : ๐’‹ โˆ‰ โ„
Chinese: โฟ‚Musical notation: ๐„ž ๐„ต ๐†– ๐… 
Japanese: ใ‚ขno Klingon โ˜น

All Unicode โ€œblocksโ€: http://unicode.org/Public/UNIDATA/Blocks.txt

Code Points

Initially, Unicode is all about mapping integers to characters:

DecimalU+hexMeaningExample
97U+0061LATIN SMALL LETTER Aa
9786U+263AWHITE SMILING FACEโ˜บ
66506U+103CAOLD PERSIAN SIGN AURAMAZDAAHA๐Š

Now, do that for 128,000+ more characters.

Encoding

Fine, so weโ€™ve defined this mapping. How do we actually represent those in a computer? Thatโ€™s the job of an encoding. An encoding is a mapping of the bits in an integer to bytes.

16-bit Encodings


 ยทยทยทยทJยทยทยทยท ยทยทยทยทaยทยทยทยท ยทยทยทยทcยทยทยทยท ยทยทยทยทkยทยทยทยท
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ 00 โ”‚ 4A โ”‚ 00 โ”‚ 61 โ”‚ 00 โ”‚ 63 โ”‚ 00 โ”‚ 6B โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
  0    1    2    3    4    5    6    7

16-bit Encodings

UTF-16:

32-bit Encodings

UTF-32:


  ยทยทยทยทยทยทยทยทJยทยทยทยทยทยทยทยท   ยทยทยทยทยทยทยทยทaยทยทยทยทยทยทยทยท   ยทยทยทยทยทยทยทยทcยทยทยทยทยทยทยทยท   ยทยทยทยทยทยทยทยทkยทยทยทยทยทยทยทยท
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ 00 โ”‚ 00 โ”‚ 00 โ”‚ 4A โ”‚ 00 โ”‚ 00 โ”‚ 00 โ”‚ 61 โ”‚ 00 โ”‚ 00 โ”‚ 00 โ”‚ 63 โ”‚ 00 โ”‚ 00 โ”‚ 00 โ”‚ 6B โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15

False Positives

Hey, thereโ€™s a slash in this string! No, wait, there isnโ€™t.

When using UTF-16 or UTF-32 encoding, a naรฏve algorithm will falsely detect a slash (oh, excuse me, a solidus) in one of the bytes of U+262F.

Similarly, a C-string cannot hold a UTF-16 or UTF-32 string, because of the embedded zero bytes.

UTF-8 Variable-Length Encoding

BitsRangeByte 1Byte 2Byte 3Byte 4
7U+0000โ€“U+007F0xxxxxxx 
11U+0080โ€“U+07FF110xxxxx10xxxxxx 
16U+0800โ€“U+FFFF1110xxxx10xxxxxx10xxxxxx 
21U+10000โ€“U+1FFFFF11110xxx10xxxxxx10xxxxxx10xxxxxx
  J    a    c    k
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ 4A โ”‚ 61 โ”‚ 63 โ”‚ 6B โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
  0    1    2    3

Illustration of Various Encodings

U+CharDescriptionUTF-32BEUTF-16BEUTF-8
U+0041AA00000041004141
U+03A9ฮฉOmega000003A903A9CE A9
U+4DCAไทŠHexagram for peace00004DCA4DCAE4 B7 8A
U+1F42E๐ŸฎMooooooooo!0001F42ED83D DC2EF0 9F 90 AE

Example

BitsRangeByte 1Byte 2Byte 3Byte 4
7U+0000โ€“U+007F0xxxxxxx 
11U+0080โ€“U+07FF110xxxxx10xxxxxx 
16U+0800โ€“U+FFFF1110xxxx10xxxxxx10xxxxxx 
21U+10000โ€“U+1FFFFF11110xxx10xxxxxx10xxxxxx10xxxxxx
  • Consider U+1F42E ๐Ÿฎ
    • 1F42E16 = 1โ€‡1111โ€‡0100โ€‡0010โ€‡11102 (17 bits)
    • Need 21 bits, add leading zeroes: 0โ€‡0001โ€‡1111โ€‡0100โ€‡0010โ€‡1110
    • Grouped properly: 000โ€‡011111โ€‡010000โ€‡101110
    • Byte #1: 11110xxx, use first three bits, 11110โ€‡000
    • Byte #2: 10xxxxxx, use the next six bits, 10โ€‡011111
    • Byte #3: 10xxxxxx, use the next six bits, 10โ€‡010000
    • Byte #4: 10xxxxxx, use the next six bits, 10โ€‡101110
    • All the bits:
      • 11110โ€‡000โ€‡โ€‡10โ€‡011111โ€‡โ€‡10โ€‡010000โ€‡โ€‡10โ€‡101110
      • 11110000โ€‡โ€‡โ€‡10011111โ€‡โ€‡โ€‡10010000โ€‡โ€‡โ€‡10101110
      • 1111โ€‡0000โ€‡โ€‡1001โ€‡1111โ€‡โ€‡1001โ€‡0000โ€‡โ€‡1010โ€‡1110
      • F0โ€‡9Fโ€‡90โ€‡AE

Byte Order Mark

Often, files contain a โ€œmagic numberโ€โ€”initial bytes that indicate what sort of file it is.

EncodingBytes
UTF-32BE00 00 FE FF
UTF-32LEFF FE 00 00
UTF-16BEFE FF
UTF-16LEFF FE
UTF-8EF BB BF

The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a Byte Order Mark, or BOM. When used as the first bytes of a data file, indicates the encoding (assuming that youโ€™re limited to Unicode).

If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly harmless.

Programming

Itโ€™s all about bytes vs. characters. Too many languages have no byte type, so programmers use char instead. Trouble! The language has no idea whether youโ€™re processing text, which should be treated as Unicode, or bytes of data, which would happen if a program were parsing a JPEG file.

Linux Commands

echo \u: up to four digits; \U: up to eight digits

    % echo -e '\uf1'
    รฑ
    % echo -e '\U1f435'
    ๐Ÿต

wc -c counts bytes; wc -m counts characters

    % echo -e '\U1f435' | wc -c
    5
    % echo -e '\U1f435' | wc -m
    2

Viewing Files

View with xxd or od:

    % echo -e 'ABC' | xxd
    00000000: 4142 430a                                ABC.
    % echo -e '\U1f435' | xxd
    00000000: f09f 90b5 0a                             .....

    % echo -e 'ABC' | od -t x1
    0000000 41 42 43 0a
    0000004
    % echo -e '\U1f435' | od -t x1
    0000000 f0 9f 90 b5 0a
    0000005

User: Guest

Check: HTML CSS
Edit History Source

Modified: 2018-04-24T16:57

Apply to CSU | Contact CSU | Disclaimer | Equal Opportunity
Colorado State University, Fort Collins, CO 80523 USA
© 2018 Colorado State University
CS Building