One of the finest universities north of Prospect in Fort Collins

Jack Applin

PmWiki

See this page as a slide show

Unicode

Overview

  • Introduction
  • Chaos
  • Order
  • Problems
  • Programming
  • Resources

Introduction

  • Jack Applin
  • I am an amateur with regard to Unicode. It fascinates me.
  • This talk was developed with the aid of Vicodin and Codeine, so it may wander a bit.

Chaos

The time before Unicode.

Pre-ASCII

It’s all about the mapping of bits to symbols. But what bits should we use to represent a given symbol? Is ‘A’ represented by 1, 1003, or 65? There were many opinions:

  • Baudot code, a 5-bit modal telegraph code
  • CDC display code, 6-bit (10 per 60-bit word)
  • EBCDIC (8-bit)
    • ‘i’+1 ≠ ‘j’
  • Convergence was impeded by the usual bickering between organizations reluctant to abandon their proprietary solutions for a common standard.

ASCII (ISO-646)

  • US-ASCII, ISO-646
  • Published in 1963
  • Seven-bit code
    • Why not just use all eight bits?
    • Check your eight-bit privilege!
  • Eighth bit often used for parity
  • Lacks π æ ä ñ ç £ ¢ € ¥ ° • © ™ ≤ « » “ ” ‘ ’, not to mention Korean, Hebrew, etc.

National Use

Great, everything was standard! No, wait—the French still wanted their accents (à), the British wanted their pound sterling (£), etc.

A number of characters were designated as “National Use”, to be replaced by local characters. For example, @ was replaced by ‘à’ for French use, and ‘§’ for the Germans. Similarly, ‘\’ was replaced by ‘Ö’ for the Swedes, and ‘Ñ’ for Spain.

Swedish C programs looked like this:

    printf("Hello, world!Ön");

I’m told that one got used to it.

I’m Still Not Satisfied

Still, this was not good enough. Greeks, Russians, and Israelis needed entire alphabets of non-Latin characters, and the few characters reserved for national use were insufficient.

The character positions 128–255 were there for the taking, and so they got took. Many incompatible eight-bit extensions to ASCII were created.

Using the Eigth Bit

Character SetSponsor
ArmSCIIArmenia
ISCIIIndia
YUSCIIYugoslavia
PETSCIICommodore
WISCIIWang Computers
Roman8Hewlett-Packard
Latin-1 (a.k.a. ISO-8859-1) 
Windows-1252 (appallingly a.k.a. “ANSI”)Microsoft

Convergence was impeded by the usual bickering between organizations reluctant to abandon their proprietary solutions for a common standard.

ISO-8859-X

ISO-8859-1, Latin-1: W. EuropeanISO-8859-9, Latin-5: Turkish
ISO-8859-2, Latin-2: Cent. EuropeanISO-8859-10, Latin-6: Nordic
ISO-8859-3, Latin-3: S. EuropeanISO-8859-11, Latin/Thai
ISO-8859-4, Latin-4: N. EuropeanThere is no ISO-8859-12!
ISO-8859-5, Latin/CyrillicISO-8859-13, Latin-7: Baltic Rim
ISO-8859-6, Latin/ArabicISO-8859-14, Latin-8: Celtic
ISO-8859-7, Latin/GreekISO-8859-15, Latin-9: Latin-1 with tweaks
ISO-8859-8, Latin/HebrewISO-8859-16, Latin-10: SE European

Non-European Languages

  • Shift-JIS: Japanese
  • BIG5: Chinese character encoding used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
  • GB: Chinese character encoding used in the People’s Republic of China.

All of these encodings are variable-length: one byte for ASCII, two bytes for Japanese/Chinese.

Which Encoding?

  • How did you know how any particular data file was encoded?
  • Guesswork, usually.

Order

Incorporation

One way to change something is through incorporation. You don’t try to change the existing thing—you just incorporate it into a bigger framework.

Physics
Relativity doesn’t invalidate Newtonian physics, at least not at human speeds.
Religion
Your god is just fine—you keep on worshipping him. However, you should know that he’s really just one god of many in our new pantheon. Welcome to our new theology, which encompasses your old one!
Character Sets
The national versions of ASCII, and the ISO-8859-X versions, were successful because plain ASCII still worked. They were supersets of ASCII.

Unicode (ISO-10646)

  • First published 1991
  • Version 11.0, published June 2018, has 137,374 characters.
  • Incorporates ASCII as code points 0–127 without change.
  • Incorporates ISO-8859-1 as code points 0–255 without change.
  • Code points, not encoding (patience)
    • U+0041 A LATIN CAPITAL LETER A
    • U+2fc2 ⿂ KANGXI RADICAL FISH
    • U+1f355 🍕 SLICE OF PIZZA
    • Meaning, not pictures (glyphs)

What’s in Unicode

ASCII: A-ZDingbats: ✈☞✌✔✰☺♥♦♣♠•
Other Latin: äñ«»Emoji: 🐱
Cyrillic: ЯEgyptian hieroglyphics: 𓁥
Hebrew: אMathematics: ∃x:x∉ℝ
Chinese: ⿂Musical notation: 𝄞𝄵𝆖𝅘𝅥𝅮
Japanese: アno Klingon ☹

All Unicode “blocks”: http://unicode.org/Public/UNIDATA/Blocks.txt

Code Points

Initially, Unicode is all about mapping integers to characters:

DecimalU+hexMeaningExample
97U+0061LATIN SMALL LETTER Aa
9786U+263aWHITE SMILING FACE
66573U+1040dDESERET CAPITAL LETTER OW𐐍

Now, do that for 110,000+ more characters.

Encoding

Fine, so we’ve defined this mapping. How do we actually represent those in a computer? That’s the job of an encoding. An encoding is a mapping of integers to bytes.

16-bit Encodings

UCS-2:

  • Fixed-length 16-bit.
  • Each character is two 8-bit bytes, whether in memory, or on a disk.
  • Certainly is straightforward.
  • Inadequate for modern Unicode, which has many more than 2¹⁶ characters. Can’t even represent U+1f554 🕔 CLOCK FACE FIVE OCLOCK.
  • Unicode originally had a much more modest scope—only living languages, so it might have worked for that.

16-bit Encodings

UTF-16:

  • Slightly variable-length: values ≤ U+FFFF take two bytes, other values take four bytes.
  • UTF-16BE (big-endian): U+203d ‽ INTERROBANG is 20 3d.
  • UTF-16LE (little-endian): U+203d ‽ INTERROBANG is 3d 20.
  • For values ≥ U+10000 and < U+10ffff:
    • Subtract out 0x10000
    • Emit U+D800 plus the top ten bits.
    • Emit U+DC00 plus the lower ten bits.
    • There are no valid code points U+D800…U+DFFF.
  • 100% overhead for ASCII text.

32-bit Encodings

UTF-32:

  • straightforward rendering of the code point in binary, with the same problems about byte order:
    • UTF-32BE: big-endian version
    • UTF-32LE: little-endian version
  • 300% overhead for ASCII text.
    • Sure, disk space is cheap, but, c’mon.

False Positives

Hey, there’s a slash in this string! No, wait, there isn’t.

  • U+002f / SOLIDUS
  • U+262f ☯ YIN YANG

When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely detect a slash (oh, excuse me, a solidus) in one of the bytes of U+262f.

Similarly, a C-string cannot hold a UTF-16 or UTF-32 string, because of the embedded zero bytes.

Morse Code

Consider the phrase “I ate lunch”, in Morse Code:

  • I = ••
  • ate = •− − •
  • lunch = •−•• ••− −• −•−• ••••

Nine characters encoded in 23 bits, not counting spaces between letters and words. That’s less than 2⅔ bits/character. How can this be?

etaoin shrdlu

Etaoin Shrdlu

Morse code is designed so that the most common English letters are represented by short sequences. E is a single •, T is a single −, whereas Q is − − • −. Q takes a long time to transmit, but the letter Q doesn’t occur that often, so that’s ok.

Similarly, the UTF-8 encoding is designed so that Unicode code points 0–127 (which ones are those, again?) take only a single byte, whereas code points represented by large numbers can take up to four bytes.

American imperialism or good engineering? You decide!

UTF-8 Variable-Length Encoding

BitsRangeByte 1Byte 2Byte 3Byte 4
7U+0000–U+007F0xxxxxxx   
11U+0080–U+07FF110xxxxx10xxxxxx  
16U+0800–U+FFFF1110xxxx10xxxxxx10xxxxxx 
21U+10000–U+1FFFFF11110xxx10xxxxxx10xxxxxx10xxxxxx
  • ASCII never appears except as intended; NUL and slash never appear in other byte sequences.
  • Therefore, no kernel changes.
  • Find previous/next char are fast operations.
  • Self-synchronizing: if some bytes are damaged, it’s easy to find the beginning of the next/previous character.
  • 0% overhead for ASCII text.

Illustration of Various Encodings

  • U+0041 A LATIN CAPITAL LETTER A
  • U+03a9 Ω GREEK CAPITAL LETTER OMEGA
  • U+4dca ䷊ HEXAGRAM FOR PEACE
  • U+1f42e 🐮 COW FACE

EncodingU+0041 AU+03a9 ΩU+4dca ䷊U+1f42e 🐮
UTF-32BE00000041000003a900004dca0001f42e
UTF-16BE004103a94dcad83d dc2e
UTF-841ce a9e4 b7 8af0 9f 90 ae

Byte Order Mark

Often, files contain a “magic number”—initial bytes that indicate what sort of file it is.

EncodingBytes
UTF-32BE00 00 FE FF
UTF-32LEFF FE 00 00
UTF-16BEFE FF
UTF-16LEFF FE
UTF-8EF BB BF

The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a Byte Order Mark, or BOM. When used as the first bytes of a data file, indicates the encoding (assuming that you’re limited to Unicode).

If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly harmless.

Email

Email used to be be a real mess. MIME extensions came along to help:

From: Greg Redder <Greg.Redder@ColoState.EDU>
To: Jack Applin <Jack.Applin@colostate.edu>
Subject: Re: SNMP read only string
Date: Tue, 11 Oct 2016 22:25:56 +0000
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0

SmFjaywNCg0KV2UgY2FuIHNldCB1cCByZWFkLW9ubHkgYWNjZXNzIHRvIHN3aXRjaGVzIGluIHRo
ZSBDUyBidWlsZGluZy4gICAgV2UnZCBuZWVkIHRvIG1ha2Ugc3VyZSB0aGF0IHRoZSB3aG9sZSBz
bm1wIHRyZWUgaXNuJ3QgcmV0cmlldmVkIGZyZXF1ZW50bHkgb3IgeW91IGNhbiBidXJ5IHRoZSBP
Uy4gICAgU28sIGRlcGVuZGluZyB1cG9uIGhvdyBzdXJlIHdlIGFyZSB0aGF0IHdvbid0IGhhcHBl
biBtaWdodCBkZXRlcm1pbmUgaG93IG1hbnkgc3dpdGNoZXMgd2UgcHJvdmlkZSBhY2Nlc3MgdG8g

Output

  • HTML5 defaults to UTF-8; hooray! However:
    • <meta charset="UTF-8">
    • <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
  • Many terminal emulators (xterm, gnome-terminal, PuTTY, etc.) still default to ISO-8859-1 (Latin-1). You have to change their configuration.
  • Many programs look at the environment variable LANG to determine whether to produce ASCII or UTF-8 output. Make sure that LANG=en_US.UTF-8
    %echo $LANG
    en_US.UTF-8
    %rm foo
    rm: cannot remove ‘foo’: No such file or directory
    %unset LANG
    %rm foo
    rm: cannot remove 'foo': No such file or directory

Input is hard!

Here’s the problem: there are now more characters available than there are keys on the keyboard. It’s not practical to have a keyboard with 137,000 keys: one with 1cm² keys would be 3½m × 3½m.

Of course, people who write in Chinese and other languages with many characters have had to deal with this problem for quite some time.

Input: Copy & paste

Low-tech is sometimes the best. Create a file of your most-used Unicode chars:

½ ⅓ µ ♡ ° “ ” ‘ ’ … ☺ ☹ × ² ³ – —,

and copy & paste them as needed.

Input: Linux Scripts

  • I have tiny scripts named “onehalf”, “micro”, and so on, that display the corresponding character. Vim, for example, can read the output of a script into the file being edited: :r !onehalf return
    %onehalf
    ½
    %micro
    µ
  • My “u” program searches for Unicode characters by code point or regular expression, or can display them all:
    %u active
    2622 ☢ RADIOACTIVE SIGN
    %u tri.*fire
    2632 ☲ TRIGRAM FOR FIRE
    %u u+2600 u+2603
    2600 ☀ BLACK SUN WITH RAYS
    2601 ☁ CLOUD
    2602 ☂ UMBRELLA
    2603 ☃ SNOWMAN

Input Methods

  • Linux: control-shift-U hex-digits return
    • It’s amazing how quickly one learns a few oft-used codes. I know that U+21d2 is ⇒, because I often enter Google appointments such as “Soda⇒CSU”.
  • Windows:
    • Hold down Alt
    • Press + on numeric pad
    • Press digits from numeric pad, A–F from keyboard
    • Release Alt
  • Vim: digraphs
    • In Vim, control-K 1 2 produces ½. Do :dig return for all codes.
  • Emacs: God only knows

Problems

Life is still not quite perfect.

Fonts

  • Fonts be huge—137,000+ characters.
  • Some font formats allow only 256 characters.
  • Nearly all font formats allow only 65,536 characters.
  • I don’t believe that a complete Unicode 9.0 font exists.
  • Google’s Noto fonts.

Emojis

Unicode has, arguably, too many emojis (1791 at last count).

These are not established, long-used characters. They are recent inventions. This goes against the usual rules for Unicode.

Historical Mess

Unicode is one big collection of historical compromises. In a hundred years, will anybody care that these first three code points maintained compatibility with Latin-1?

U+00b2 ² SUPERSCRIPT TWOU+2075 ⁵ SUPERSCRIPT FIVE
U+00b3 ³ SUPERSCRIPT THREEU+2076 ⁶ SUPERSCRIPT SIX
U+00b9 ¹ SUPERSCRIPT ONEU+2077 ⁷ SUPERSCRIPT SEVEN
U+2070 ⁰ SUPERSCRIPT ZEROU+2078 ⁸ SUPERSCRIPT EIGHT
U+2074 ⁴ SUPERSCRIPT FOURU+2079 ⁹ SUPERSCRIPT NINE

Politics: the art of the possible.

More History

The double-struck alphabet, in code point order:

U+2102 ℂU+2124 ℤU+1d53d 𝔽U+1d543 𝕃U+1d54c 𝕌
U+210d ℍU+1d538 𝔸U+1d53e 𝔾U+1d544 𝕄U+1d54d 𝕍
U+2115 ℕU+1d539 𝔹U+1d540 𝕀U+1d546 𝕆U+1d54e 𝕎
U+2119 ℙU+1d53b 𝔻U+1d541 𝕁U+1d54a 𝕊U+1d54f 𝕏
U+211a ℚU+1d53c 𝔼U+1d542 𝕂U+1d54b 𝕋U+1d550 𝕐
U+211d ℝ    

What’s wrong with this picture?

Spoofing

  • applin@example.org vs. aррlin@example.org?
    • Sorry—the latter uses the Cyrillic letter er, which looks like the Latin letter p.
  • How about https://google.com vs. https://gοοgle.com?
    • Oops—the second one uses the Greek omicron, not the Latin letter o. Too bad!
  • We can hardly forbid Russians the use of er, or Greeks the use of omicron. However, once google.com (with a Latin ‘o’) is in use, it seems unreasonable to allow the creation of gοοgle.com (with a Greek omicron).

Canonical Form

Accented characters can be pre-baked, or created in two steps:

  • U+00f1 ñ LATIN SMALL LETTER N WITH TILDE
  • U+006e n LATIN SMALL LETTER N
  • u+0303 ◌̃ COMBINING TILDE
  • echo -e '\u00f1'; echo -e 'n\u0303'

Unicode decrees that U+00f1 and the sequence U+006e U+0303 be treated as the same. Comparing strings just got a lot harder.

Unicode calls this process normalization.

Programming

It’s all about bytes vs. characters. Too many languages have no byte type, so programmers use char instead. Trouble! The language has no idea whether you’re processing text, which should be treated as Unicode, or bytes of data, which would happen if a program were parsing a JPEG file.

Linux Commands

echo \u: up to four digits; \U: up to eight digits

    %echo -e '\uf1'
    ñ
    %echo -e '\U1f435'
    🐵

wc -c counts bytes; wc -m counts characters

    %echo -e '\U1f435' | wc -c
    5
    %echo -e '\U1f435' | wc -m
    2

C

  • strlen("€"): just bytes of data, or Unicode?
  • int ĵäçǩ; // fails for gcc
  • The datatype wchar_t (wide character type), which is usually a typedef (synonym) for int. It is large enough to hold a unicode character.
  • constants of wchar_t type: L'≥' and L"9÷3≤4"
  • functions such as wcslen() and wcscpy()

C Example

// Show the decimal value of each character read.

#include <locale.h>
#include <wchar.h>
#include <stdio.h>

int main() {
    setlocale(LC_ALL, "");			// Set locale per environment
    wchar_t buf[80];				// 80 wide characters

    printf("sizeof(buf)=%zd\n", sizeof(buf));	// byte size of 80 wchar_t?
    while (fgetws(buf, 80, stdin) != NULL) {	// duplication of information
	printf("%ls", buf);			// print the wide string
	for (size_t i=0; i<wcslen(buf); i++)	// O(N²) algorithm
	    printf("%d ", (int) buf[i]);
	puts("");
    }
}

C++ Example

// Show the decimal value of each character read.

#include <clocale>
#include <iostream>

using namespace std;

int main() {
    setlocale(LC_ALL, "");			// Set locale per environment
    wstring s;					// wstring, not string

    while (getline(wcin, s)) {			// wcin
	wcout << s << '\n';			// wcout
	for (size_t i=0; i<s.length(); i++)
	    wcout << int(s[i]) << ' ';
	wcout << '\n';
    }
}

Java Example

// Show the decimal value of each character read.

import java.util.*;

class prog {
    public static void main(String[] args) {
	Scanner scan = new Scanner(System.in);
	while (scan.hasNextLine()) {
	    String line = scan.nextLine();
	    System.out.println(line);
	    for (int i=0; i<line.length(); i++)
		System.out.print(line.codePointAt(i)+" ");
	    System.out.println("");
	}
    }
}

Perl Example

#! /usr/bin/perl

# Show the decimal value of each character read.

use 5.14.2;
use warnings;
use utf8;
use open qw(:std :encoding(utf8));

while (<>) {
    print;
    for my $β (/./g) {
	print ord($β), " ";
    }
    print "\n";
}

Python Example

#! /usr/bin/python3

# Show the decimal value of each character read.

import sys;

for lïné in sys.stdin:
    print(lïné)
    for Φ in lïné:
        print(ord(Φ), end=' ')
    print

Resources