International Character Sets & Encodings

UNICODE

A computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. Since 1991, the Unicode Consortium and the ISO have developed The Unicode Standard and ISO/IEC 10646 in tandem. Published as The Unicode Standard, the latest version of Unicode contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts, as well as multiple symbol sets. The repertoire, character names, and code points of Unicode Version 2.0 exactly match those of ISO/IEC 10646-1:1993.

ISO/IEC 10646-1:1993

The Universal Coded Character Set (UCS), is a standard set of characters defined by ISO/IEC 10646, UCS is the basis of many character encodings, contains over 128,000 abstract characters, each identified by an unambiguous name and an integer number called its code point. Characters (letters, numbers, symbols, ideograms, logograms, etc.) from the many languages, scripts, and traditions of the world are represented in the UCS with unique code points. One could code the characters of this primordial ISO 10646 standard in one of three ways:

1. UCS-4, four bytes for every character, enabling the simple encoding of all characters;

2. UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 code points, straightforwardly, and other planes and groups by switching to them with ISO 2022 escape sequences;

3. UTF-1, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control codes).

UTF-1

A way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design, it is not possible to resynchronize if decoding starts in the middle of a character (this makes truncation hard, among other things) and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of division by a number which is not a power of 2. Due to these issues, UTF-1 never gained wide acceptance and has been replaced by UTF-8.

UTF-8

A character encoding scheme capable of encoding all possible characters, or code points, defined by Unicode and originally designed by Ken Thompson and Rob Pike. The encoding is variable-length and uses 8-bit code units. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings. The name is derived from: Universal Coded Character Set + Transformation Format – 8-bit.

UTF-8 is the dominant character encoding for the World Wide Web, as of July 2016 it accounts for 87.2% of all Web pages with the most popular East Asian encoding, GB 2312, at 0.8% and Shift JIS at 1.1%. The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8, and the W3C recommends UTF-8 as the default encoding in XML and HTML.

UTF-16

A character encoding scheme capable of encoding all 1,112,064 possible characters in Unicode. The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that a fixed-width 2-byte encoding could not encode enough characters to be truly universal.

UTF-32

It is a protocol to encode Unicode code points that uses exactly 32 bits per Unicode code point. This makes UTF-32 a fixed-length encoding, in contrast to all other Unicode transformation formats which are variable-length encodings. The UTF-32 form of a code point is a direct representation of that code point's numerical value.

The main advantage of UTF-32, versus variable-length encodings, is that the Unicode code points are directly indexable. Examining the n'th code point is a constant time operation. In contrast, a variable-length code requires sequential access to find the n'th code point. This makes UTF-32 a simple replacement in code that uses integers to index characters out of strings, as was commonly done for ASCII. The main disadvantage of UTF-32 is that it is space inefficient. HTML5 states that authors should not use UTF-32, as the encoding detection algorithms described in this specification intentionally do not distinguish it from UTF-16.

ISO/IEC-8859-1 (1998)

8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1 is part of the ISO/IEC 8859 series of ASCII-based standard character encodings consisting of 191 characters from the Latin script. It is generally intended for Western European languages and is the basis for most popular 8-bit character sets.

ISO/IEC-8859-2 (1999)

8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2 is part of the ISO/IEC 8859 series of ASCII-based standard character encodings consisting of 131 characters from the Latin script. It is generally intended for Central or "Eastern European" languages that are written in the Latin script. Note that ISO/IEC 8859-2 is very different from code page 852 (MS-DOS Latin 2, PC Latin 2) which is also referred to as "Latin-2" in Czech and Slovak regions.

WINDOWS-1252

This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range. Notable additional characters are curly quotation marks, the Euro sign, and all the printable characters that are in ISO 8859-15. It is known to Windows by the code page number 1252, and by the IANA-approved name "windows-1252".

It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by "smart quotes" in word-processing software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the HTML5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.^[update]

Historically, the phrase "ANSI Code Page" (ACP) is used in Windows to refer to various code pages considered as native. The intention was that most of these would be ANSI standards such as ISO-8859-1. Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard. Microsoft explains, "The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community."

ISO/IEC 2022

Is an ISO standard specifying a technique for including multiple character sets in a single character encoding system, and a technique for representing these character sets in both 7 and 8 bit systems using the same encoding.

To represent multiple character sets, the ISO/IEC 2022 character encodings include escape sequences which indicate the character set for characters which follow. The escape sequences are registered with ISO and follow the patterns defined within the standard. These character encodings require data to be processed sequentially in a forward direction since the correct interpretation of the data depends on previously encountered escape sequences. Note, that other standards such as ISO-2022-JP may impose additional conditions such as the current character set is reset to US-ASCII at the end of a line.

Shift-JIS

a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1. The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201.The lead bytes for the double-byte characters are "shifted" around the 64 half-width katakana characters in the single-byte range 0xA1 to 0xDF.

GB 2312

The registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters. GB abbreviates Guojia Biaozhun, which means national standard in Chinese.

Big5

A Chinese character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters.