Unicode

Overview

Introduction
Chaos
Order
Problems
Programming
Resources

Introduction

Jack Applin
I am an amateur with regard to Unicode. It fascinates me.
This talk was developed with the aid of Vicodin and Codeine, so it may wander a bit.

Chaos

The time before Unicode.

Pre-ASCII

It’s all about the mapping of bits to symbols. But what bits should we use to represent a given symbol? Is ‘A’ represented by 1, 1003, or 65? There were many opinions:

Baudot code, a 5-bit modal telegraph code
CDC display code, 6-bit (10 per 60-bit word)
EBCDIC (8-bit)
- ‘i’+1 ≠ ‘j’
Convergence was impeded by the usual bickering between organizations reluctant to abandon their proprietary solutions for a common standard.

ASCII (ISO-646)

US-ASCII, ISO-646
Published in 1963
Seven-bit code
- Why not just use all eight bits?
- Check your eight-bit privilege!
Eighth bit often used for parity
Lacks π æ ä ñ ç £ ¢ € ¥ ° • © ™ ≤ « » “ ” ‘ ’, not to mention Korean, Hebrew, etc.

National Use

Great, everything was standard! No, wait—the French still wanted their accents (à), the British wanted their pound sterling (£), etc.

A number of characters were designated as “National Use”, to be replaced by local characters. For example, @ was replaced by ‘à’ for French use, and ‘§’ for the Germans. Similarly, ‘\’ was replaced by ‘Ö’ for the Swedes, and ‘Ñ’ for Spain.

Swedish C programs looked like this:

    printf("Hello, world!Ön");

I’m told that one got used to it.

I’m Still Not Satisfied

Still, this was not good enough. Greeks, Russians, and Israelis needed entire alphabets of non-Latin characters, and the few characters reserved for national use were insufficient.

The character positions 128–255 were there for the taking, and so they got took. Many incompatible eight-bit extensions to ASCII were created.

Using the Eigth Bit

Character Set	Sponsor
ArmSCII	Armenia
ISCII	India
YUSCII	Yugoslavia
PETSCII	Commodore
WISCII	Wang Computers
Roman8	Hewlett-Packard
Latin-1 (a.k.a. ISO-8859-1)
Windows-1252 (appallingly a.k.a. “ANSI”)	Microsoft

Convergence was impeded by the usual bickering between organizations reluctant to abandon their proprietary solutions for a common standard.

ISO-8859-X

ISO-8859-1, Latin-1: W. European	ISO-8859-9, Latin-5: Turkish
ISO-8859-2, Latin-2: Cent. European	ISO-8859-10, Latin-6: Nordic
ISO-8859-3, Latin-3: S. European	ISO-8859-11, Latin/Thai
ISO-8859-4, Latin-4: N. European	There is no ISO-8859-12!
ISO-8859-5, Latin/Cyrillic	ISO-8859-13, Latin-7: Baltic Rim
ISO-8859-6, Latin/Arabic	ISO-8859-14, Latin-8: Celtic
ISO-8859-7, Latin/Greek	ISO-8859-15, Latin-9: Latin-1 with tweaks
ISO-8859-8, Latin/Hebrew	ISO-8859-16, Latin-10: SE European

Non-European Languages

Shift-JIS: Japanese
BIG5: Chinese character encoding used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
GB: Chinese character encoding used in the People’s Republic of China.

All of these encodings are variable-length: one byte for ASCII, two bytes for Japanese/Chinese.

Which Encoding?

How did you know how any particular data file was encoded?
Guesswork, usually.

Order

Incorporation

One way to change something is through incorporation. You don’t try to change the existing thing—you just incorporate it into a bigger framework.

Physics: Relativity doesn’t invalidate Newtonian physics, at least not at human speeds.
Religion: Your god is just fine—you keep on worshipping him. However, you should know that he’s really just one god of many in our new pantheon. Welcome to our new theology, which encompasses your old one!
Character Sets: The national versions of ASCII, and the ISO-8859-X versions, were successful because plain ASCII still worked. They were supersets of ASCII.

Unicode (ISO-10646)

First published 1991
Version 11.0, published June 2018, has 137,374 characters.
Incorporates ASCII as code points 0–127 without change.
Incorporates ISO-8859-1 as code points 0–255 without change.
Code points, not encoding (patience)
- U+0041 A LATIN CAPITAL LETER A
- U+2fc2 ⿂ KANGXI RADICAL FISH
- U+1f355 🍕 SLICE OF PIZZA
- Meaning, not pictures (glyphs)

What’s in Unicode

ASCII: A-Z	Dingbats: ✈☞✌✔✰☺♥♦♣♠•
Other Latin: äñ«»	Emoji: 🐱
Cyrillic: Я	Egyptian hieroglyphics: 𓁥
Hebrew: א	Mathematics: ∃x:x∉ℝ
Chinese: ⿂	Musical notation: 𝄞𝄵𝆖𝅘𝅥𝅮
Japanese: ア	no Klingon ☹

All Unicode “blocks”: http://unicode.org/Public/UNIDATA/Blocks.txt

Code Points

Initially, Unicode is all about mapping integers to characters:

Decimal	U+hex	Meaning	Example
97	U+0061	LATIN SMALL LETTER A	a
9786	U+263a	WHITE SMILING FACE	☺
66573	U+1040d	DESERET CAPITAL LETTER OW	𐐍

Now, do that for 110,000+ more characters.

Encoding

Fine, so we’ve defined this mapping. How do we actually represent those in a computer? That’s the job of an encoding. An encoding is a mapping of integers to bytes.

16-bit Encodings

UCS-2:

Fixed-length 16-bit.
Each character is two 8-bit bytes, whether in memory, or on a disk.
Certainly is straightforward.
Inadequate for modern Unicode, which has many more than 2¹⁶ characters. Can’t even represent U+1f554 🕔 CLOCK FACE FIVE OCLOCK.
Unicode originally had a much more modest scope—only living languages, so it might have worked for that.

16-bit Encodings

UTF-16:

Slightly variable-length: values ≤ U+FFFF take two bytes, other values take four bytes.

UTF-16BE (big-endian): U+203d ‽ INTERROBANG is 20 3d.
UTF-16LE (little-endian): U+203d ‽ INTERROBANG is 3d 20.
For values ≥ U+10000 and < U+10ffff:
- Subtract out 0x10000
- Emit U+D800 plus the top ten bits.
- Emit U+DC00 plus the lower ten bits.
- There are no valid code points U+D800…U+DFFF.
100% overhead for ASCII text.

32-bit Encodings

UTF-32:

straightforward rendering of the code point in binary, with the same problems about byte order:
- UTF-32BE: big-endian version
- UTF-32LE: little-endian version
300% overhead for ASCII text.
- Sure, disk space is cheap, but, c’mon.

False Positives

Hey, there’s a slash in this string! No, wait, there isn’t.

U+002f / SOLIDUS
U+262f ☯ YIN YANG

When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely detect a slash (oh, excuse me, a solidus) in one of the bytes of U+262f.

Similarly, a C-string cannot hold a UTF-16 or UTF-32 string, because of the embedded zero bytes.

Morse Code

Consider the phrase “I ate lunch”, in Morse Code:

I = ••
ate = •− − •
lunch = •−•• ••− −• −•−• ••••

Nine characters encoded in 23 bits, not counting spaces between letters and words. That’s less than 2⅔ bits/character. How can this be?

etaoin shrdlu

Etaoin Shrdlu

Morse code is designed so that the most common English letters are represented by short sequences. E is a single •, T is a single −, whereas Q is − − • −. Q takes a long time to transmit, but the letter Q doesn’t occur that often, so that’s ok.

Similarly, the UTF-8 encoding is designed so that Unicode code points 0–127 (which ones are those, again?) take only a single byte, whereas code points represented by large numbers can take up to four bytes.

American imperialism or good engineering? You decide!

UTF-8 Variable-Length Encoding

Bits	Range	Byte 1	Byte 2	Byte 3	Byte 4
7	U+0000–U+007F	0xxxxxxx
11	U+0080–U+07FF	110xxxxx	10xxxxxx
16	U+0800–U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
21	U+10000–U+1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

ASCII never appears except as intended; NUL and slash never appear in other byte sequences.
Therefore, no kernel changes.
Find previous/next char are fast operations.
Self-synchronizing: if some bytes are damaged, it’s easy to find the beginning of the next/previous character.
0% overhead for ASCII text.

Illustration of Various Encodings

U+0041 A LATIN CAPITAL LETTER A
U+03a9 Ω GREEK CAPITAL LETTER OMEGA
U+4dca ䷊ HEXAGRAM FOR PEACE
U+1f42e 🐮 COW FACE

Encoding	U+0041 A	U+03a9 Ω	U+4dca ䷊	U+1f42e 🐮
UTF-32BE	00000041	000003a9	00004dca	0001f42e
UTF-16BE	0041	03a9	4dca	d83d dc2e
UTF-8	41	ce a9	e4 b7 8a	f0 9f 90 ae

Byte Order Mark

Often, files contain a “magic number”—initial bytes that indicate what sort of file it is.

Encoding	Bytes
UTF-32BE	00 00 FE FF
UTF-32LE	FF FE 00 00
UTF-16BE	FE FF
UTF-16LE	FF FE
UTF-8	EF BB BF

The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a Byte Order Mark, or BOM. When used as the first bytes of a data file, indicates the encoding (assuming that you’re limited to Unicode).

If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly harmless.

Email

Email used to be be a real mess. MIME extensions came along to help:

From: Greg Redder <Greg.Redder@ColoState.EDU>
To: Jack Applin <Jack.Applin@colostate.edu>
Subject: Re: SNMP read only string
Date: Tue, 11 Oct 2016 22:25:56 +0000
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0

SmFjaywNCg0KV2UgY2FuIHNldCB1cCByZWFkLW9ubHkgYWNjZXNzIHRvIHN3aXRjaGVzIGluIHRo
ZSBDUyBidWlsZGluZy4gICAgV2UnZCBuZWVkIHRvIG1ha2Ugc3VyZSB0aGF0IHRoZSB3aG9sZSBz
bm1wIHRyZWUgaXNuJ3QgcmV0cmlldmVkIGZyZXF1ZW50bHkgb3IgeW91IGNhbiBidXJ5IHRoZSBP
Uy4gICAgU28sIGRlcGVuZGluZyB1cG9uIGhvdyBzdXJlIHdlIGFyZSB0aGF0IHdvbid0IGhhcHBl
biBtaWdodCBkZXRlcm1pbmUgaG93IG1hbnkgc3dpdGNoZXMgd2UgcHJvdmlkZSBhY2Nlc3MgdG8g

Output

HTML5 defaults to UTF-8; hooray! However:
- <meta charset="UTF-8">
- <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
Many terminal emulators (xterm, gnome-terminal, PuTTY, etc.) still default to ISO-8859-1 (Latin-1). You have to change their configuration.
Many programs look at the environment variable LANG to determine whether to produce ASCII or UTF-8 output. Make sure that LANG=en_US.UTF-8

    ％echo $LANG
    en_US.UTF-8
    ％rm foo
    rm: cannot remove ‘foo’: No such file or directory
    ％unset LANG
    ％rm foo
    rm: cannot remove 'foo': No such file or directory

Input is hard!

Here’s the problem: there are now more characters available than there are keys on the keyboard. It’s not practical to have a keyboard with 137,000 keys: one with 1cm² keys would be 3½m × 3½m.

Of course, people who write in Chinese and other languages with many characters have had to deal with this problem for quite some time.

Input: Copy & paste

Low-tech is sometimes the best. Create a file of your most-used Unicode chars:

½ ⅓ µ ♡ ° “ ” ‘ ’ … ☺ ☹ × ² ³ – —,

and copy & paste them as needed.

Input: Linux Scripts

I have tiny scripts named “onehalf”, “micro”, and so on, that display the corresponding character. Vim, for example, can read the output of a script into the file being edited: :r !onehalf return

    ％onehalf
    ½
    ％micro
    µ

My “u” program searches for Unicode characters by code point or regular expression, or can display them all:

    ％u active
    2622 ☢ RADIOACTIVE SIGN
    ％u tri.*fire
    2632 ☲ TRIGRAM FOR FIRE
    ％u u+2600 u+2603
    2600 ☀ BLACK SUN WITH RAYS
    2601 ☁ CLOUD
    2602 ☂ UMBRELLA
    2603 ☃ SNOWMAN

Input Methods

Linux: control-shift-U hex-digits return
- It’s amazing how quickly one learns a few oft-used codes. I know that U+21d2 is ⇒, because I often enter Google appointments such as “Soda⇒CSU”.
Windows:
- Hold down Alt
- Press + on numeric pad
- Press digits from numeric pad, A–F from keyboard
- Release Alt
Vim: digraphs
- In Vim, control-K 1 2 produces ½. Do :dig return for all codes.
Emacs: God only knows

Problems

Life is still not quite perfect.

Fonts

Fonts be huge—137,000+ characters.
Some font formats allow only 256 characters.
Nearly all font formats allow only 65,536 characters.
I don’t believe that a complete Unicode 9.0 font exists.
Google’s Noto fonts.

Emojis

Unicode has, arguably, too many emojis (1791 at last count).

These are not established, long-used characters. They are recent inventions. This goes against the usual rules for Unicode.

Historical Mess

Unicode is one big collection of historical compromises. In a hundred years, will anybody care that these first three code points maintained compatibility with Latin-1?

U+00b2 ² SUPERSCRIPT TWO	U+2075 ⁵ SUPERSCRIPT FIVE
U+00b3 ³ SUPERSCRIPT THREE	U+2076 ⁶ SUPERSCRIPT SIX
U+00b9 ¹ SUPERSCRIPT ONE	U+2077 ⁷ SUPERSCRIPT SEVEN
U+2070 ⁰ SUPERSCRIPT ZERO	U+2078 ⁸ SUPERSCRIPT EIGHT
U+2074 ⁴ SUPERSCRIPT FOUR	U+2079 ⁹ SUPERSCRIPT NINE

Politics: the art of the possible.

More History

The double-struck alphabet, in code point order:

U+2102 ℂ	U+2124 ℤ	U+1d53d 𝔽	U+1d543 𝕃	U+1d54c 𝕌
U+210d ℍ	U+1d538 𝔸	U+1d53e 𝔾	U+1d544 𝕄	U+1d54d 𝕍
U+2115 ℕ	U+1d539 𝔹	U+1d540 𝕀	U+1d546 𝕆	U+1d54e 𝕎
U+2119 ℙ	U+1d53b 𝔻	U+1d541 𝕁	U+1d54a 𝕊	U+1d54f 𝕏
U+211a ℚ	U+1d53c 𝔼	U+1d542 𝕂	U+1d54b 𝕋	U+1d550 𝕐
U+211d ℝ

What’s wrong with this picture?

Spoofing

applin@example.org vs. aррlin@example.org?
- Sorry—the latter uses the Cyrillic letter er, which looks like the Latin letter p.
How about https://google.com vs. https://gοοgle.com?
- Oops—the second one uses the Greek omicron, not the Latin letter o. Too bad!
We can hardly forbid Russians the use of er, or Greeks the use of omicron. However, once google.com (with a Latin ‘o’) is in use, it seems unreasonable to allow the creation of gοοgle.com (with a Greek omicron).

Canonical Form

Accented characters can be pre-baked, or created in two steps:

U+00f1 ñ LATIN SMALL LETTER N WITH TILDE
U+006e n LATIN SMALL LETTER N
u+0303 ◌̃ COMBINING TILDE
echo -e '\u00f1'; echo -e 'n\u0303'

Unicode decrees that U+00f1 and the sequence U+006e U+0303 be treated as the same. Comparing strings just got a lot harder.

Unicode calls this process normalization.

Programming

It’s all about bytes vs. characters. Too many languages have no byte type, so programmers use char instead. Trouble! The language has no idea whether you’re processing text, which should be treated as Unicode, or bytes of data, which would happen if a program were parsing a JPEG file.

Linux Commands

echo \u: up to four digits; \U: up to eight digits

    ％echo -e '\uf1'
    ñ
    ％echo -e '\U1f435'
    🐵

wc -c counts bytes; wc -m counts characters

    ％echo -e '\U1f435' | wc -c
    5
    ％echo -e '\U1f435' | wc -m
    2

C

strlen("€"): just bytes of data, or Unicode?
int ĵäçǩ; // fails for gcc
The datatype wchar_t (wide character type), which is usually a typedef (synonym) for int. It is large enough to hold a unicode character.
constants of wchar_t type: L'≥' and L"9÷3≤4"
functions such as wcslen() and wcscpy()

C Example

// Show the decimal value of each character read.

#include <locale.h>
#include <wchar.h>
#include <stdio.h>

int main() {
    setlocale(LC_ALL, "");			// Set locale per environment
    wchar_t buf[80];				// 80 wide characters

    printf("sizeof(buf)=%zd\n", sizeof(buf));	// byte size of 80 wchar_t?
    while (fgetws(buf, 80, stdin) != NULL) {	// duplication of information
	printf("%ls", buf);			// print the wide string
	for (size_t i=0; i<wcslen(buf); i++)	// O(N²) algorithm
	    printf("%d ", (int) buf[i]);
	puts("");
    }
}

C++ Example

// Show the decimal value of each character read.

#include <clocale>
#include <iostream>

using namespace std;

int main() {
    setlocale(LC_ALL, "");			// Set locale per environment
    wstring s;					// wstring, not string

    while (getline(wcin, s)) {			// wcin
	wcout << s << '\n';			// wcout
	for (size_t i=0; i<s.length(); i++)
	    wcout << int(s[i]) << ' ';
	wcout << '\n';
    }
}

Java Example

// Show the decimal value of each character read.

import java.util.*;

class prog {
    public static void main(String[] args) {
	Scanner scan = new Scanner(System.in);
	while (scan.hasNextLine()) {
	    String line = scan.nextLine();
	    System.out.println(line);
	    for (int i=0; i<line.length(); i++)
		System.out.print(line.codePointAt(i)+" ");
	    System.out.println("");
	}
    }
}

Perl Example

#! /usr/bin/perl

# Show the decimal value of each character read.

use 5.14.2;
use warnings;
use utf8;
use open qw(:std :encoding(utf8));

while (<>) {
    print;
    for my $β (/./g) {
	print ord($β), " ";
    }
    print "\n";
}

Python Example

#! /usr/bin/python3

# Show the decimal value of each character read.

import sys;

for lïné in sys.stdin:
    print(lïné)
    for Φ in lïné:
        print(ord(Φ), end=' ')
    print

Resources

The gospel: http://unicode.org
http://unicode.org/Public/UNIDATA/
http://unicode.org/Public/UNIDATA/NamesList.txt
My Unicode script: ~applin/bin/u