CS253 Unicode

Unicode (ISO-10646)

First published 1991
Version 9.0 published June 2016.
More than 128,000 characters
Incorporates ASCII as code points 0–127 without change.
Incorporates ISO-8859-1 (Latin-1) as code points 0–255 without change.
Code points, not encoding (patience)
- U+0041 A LATIN CAPITAL LETER A
- U+2FC2 ⿂ KANGXI RADICAL FISH
- U+1F355 🍕 SLICE OF PIZZA
Meaning, not pictures

U+ notation

By convention, Unicode code points are represented as U+ followed by at four (more only if needed) upper-case hexadecimal digits.

U+005D	]	RIGHT SQUARE BRACKET
U+00F1	ñ	LATIN SMALL LETTER N WITH TILDE
U+042F	Я	CYRILLIC CAPITAL LETTER YA
U+2622	☢	RADIOACTIVE SIGN
U+1F3A9	🎩	TOP HAT

RIGHT SQUARE BRACKET is written U+005D, not U+5D. You could also call it Unicode character #93, but don’t.

What’s in Unicode

ASCII: A-Z	Dingbats: ✈ ☞ ✌ ✔ ✰ ☺ ♥ ♦ ♣ ♠ •
Other Latin: ä ñ « »	Emoji: 🐱
Cyrillic: Я	Egyptian hieroglyphics: 𓁥
Hebrew: א	Mathematics: ∃ 𝒋 : 𝒋 ∉ ℝ
Chinese: ⿂	Musical notation: 𝄞 𝄵 𝆖 𝅘𝅥𝅮
Japanese: ア	no Klingon ☹

All Unicode “blocks”: http://unicode.org/Public/UNIDATA/Blocks.txt

Code Points

Initially, Unicode is all about mapping integers to characters:

Decimal	U+hex	Meaning	Example
97	U+0061	LATIN SMALL LETTER A	a
9786	U+263A	WHITE SMILING FACE	☺
66506	U+103CA	OLD PERSIAN SIGN AURAMAZDAAHA	𐏊

Now, do that for 128,000+ more characters.

Encoding

Fine, so we’ve defined this mapping. How do we actually represent those in a computer? That’s the job of an encoding. An encoding is a mapping of the bits in an integer to bytes.

16-bit Encodings

UCS-2:
- Fixed-length 16-bit.
- Each character is two 8-bit bytes, whether in memory, or on a disk.
- Certainly is straightforward.
- Inadequate for modern Unicode, which has many more than 2¹⁶ characters. Can’t even represent U+1F554 🕔 CLOCK FACE FIVE OCLOCK.
- Unicode originally had a much more modest scope, only living languages, so that might have worked.


 ····J···· ····a···· ····c···· ····k····
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ 00 │ 4A │ 00 │ 61 │ 00 │ 63 │ 00 │ 6B │
└────┴────┴────┴────┴────┴────┴────┴────┘
  0    1    2    3    4    5    6    7

16-bit Encodings

UTF-16:

Slightly variable-length: values ≤ U+FFFF take two bytes, other values take four bytes.
Consider U+203D ‽ INTERROBANG:
- UTF-16BE (big-endian): bytes are 20 3D
- UTF-16LE (little-endian): bytes are 3D 20
For values ≥ U+10000 and < U+10FFFF:
- Subtract out 0x1000
- Emit U+D800 plus the top ten bits.
- Emit U+DC00 plus the lower ten bits.
- There are no valid code points U+D800…U+DFFF.
100% overhead for ASCII text.

32-bit Encodings

UTF-32:

straightforward rendering of the code point in binary, with the same problems about byte order:
- UTF-32BE: big-endian version
- UTF-32LE: little-endian version
300% overhead for ASCII text.
- Sure, disk space is cheap, but, c’mon.


  ········J········   ········a········   ········c········   ········k········
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 00 │ 00 │ 00 │ 4A │ 00 │ 00 │ 00 │ 61 │ 00 │ 00 │ 00 │ 63 │ 00 │ 00 │ 00 │ 6B │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15

False Positives

Hey, there’s a slash in this string! No, wait, there isn’t.

U+002F / SOLIDUS
U+262F ☯ YIN YANG

When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely detect a slash (oh, excuse me, a solidus) in one of the bytes of U+262F.

Similarly, a C-string cannot hold a UTF-16 or UTF-32 string, because of the embedded zero bytes.

UTF-8 Variable-Length Encoding

Bits	Range	Byte 1	Byte 2	Byte 3	Byte 4
7	U+0000–U+007F	0xxxxxxx
11	U+0080–U+07FF	110xxxxx	10xxxxxx
16	U+0800–U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
21	U+10000–U+1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

Features:
- ASCII never appears except as intended; NUL and slash never appear in other byte sequences.
- Therefore, no kernel changes.
- Find previous/next char are fast operations.
- Self-synchronizing: if some bytes are damaged, it’s easy to find the beginning of the next/previous character.
- 0% overhead for ASCII text.

  J    a    c    k
┌────┬────┬────┬────┐
│ 4A │ 61 │ 63 │ 6B │
└────┴────┴────┴────┘
  0    1    2    3

Illustration of Various Encodings

U+	Char	Description	UTF-32BE	UTF-16BE	UTF-8
U+0041	A	A	00000041	0041	41
U+03A9	Ω	Omega	000003A9	03A9	CE A9
U+4DCA	䷊	Hexagram for peace	00004DCA	4DCA	E4 B7 8A
U+1F42E	🐮	Mooooooooo!	0001F42E	D83D DC2E	F0 9F 90 AE

Example

Bits	Range	Byte 1	Byte 2	Byte 3	Byte 4
7	U+0000–U+007F	0xxxxxxx
11	U+0080–U+07FF	110xxxxx	10xxxxxx
16	U+0800–U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
21	U+10000–U+1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

Consider U+1F42E 🐮
- 1F42E₁₆ = 1 1111 0100 0010 1110₂ (17 bits)
- Need 21 bits, add leading zeroes: 0 0001 1111 0100 0010 1110
- Grouped properly: 000 011111 010000 101110
- Byte #1: 11110xxx, use first three bits, 11110 000
- Byte #2: 10xxxxxx, use the next six bits, 10 011111
- Byte #3: 10xxxxxx, use the next six bits, 10 010000
- Byte #4: 10xxxxxx, use the next six bits, 10 101110
- All the bits:
  - 11110 000 10 011111 10 010000 10 101110
  - 11110000 10011111 10010000 10101110
  - 1111 0000 1001 1111 1001 0000 1010 1110
  - F0 9F 90 AE

Byte Order Mark

Often, files contain a “magic number”—initial bytes that indicate what sort of file it is.

Encoding	Bytes
UTF-32BE	00 00 FE FF
UTF-32LE	FF FE 00 00
UTF-16BE	FE FF
UTF-16LE	FF FE
UTF-8	EF BB BF

The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a Byte Order Mark, or BOM. When used as the first bytes of a data file, indicates the encoding (assuming that you’re limited to Unicode).

If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly harmless.

Programming

It’s all about bytes vs. characters. Too many languages have no byte type, so programmers use char instead. Trouble! The language has no idea whether you’re processing text, which should be treated as Unicode, or bytes of data, which would happen if a program were parsing a JPEG file.

Linux Commands

echo \u: up to four digits; \U: up to eight digits

    % echo -e '\uf1'
    ñ
    % echo -e '\U1f435'
    🐵

wc -c counts bytes; wc -m counts characters

    % echo -e '\U1f435' | wc -c
    5
    % echo -e '\U1f435' | wc -m
    2

Viewing Files

View with xxd or od:

    % echo -e 'ABC' | xxd
    00000000: 4142 430a                                ABC.
    % echo -e '\U1f435' | xxd
    00000000: f09f 90b5 0a                             .....

    % echo -e 'ABC' | od -t x1
    0000000 41 42 43 0a
    0000004
    % echo -e '\U1f435' | od -t x1
    0000000 f0 9f 90 b5 0a
    0000005

CS253: Software Development with C++

Spring 2018

Unicode

CS253 Unicode

Unicode (ISO-10646)

U+ notation

What’s in Unicode

Code Points

Encoding

16-bit Encodings

16-bit Encodings

32-bit Encodings

False Positives

UTF-8 Variable-Length Encoding

Illustration of Various Encodings

Example

Byte Order Mark

Programming

Linux Commands

Viewing Files