International
Character Sets & Encodings
UNICODE
A computing industry standard for the consistent encoding,
representation, and handling of text expressed in most of the world's writing
systems. Since 1991, the Unicode Consortium and the ISO have developed The
Unicode Standard and ISO/IEC 10646 in tandem. Published as The Unicode
Standard, the latest version of Unicode contains a repertoire of more than
128,000 characters covering 135 modern and historic scripts, as well as
multiple symbol sets. The repertoire, character names, and code points of
Unicode Version 2.0 exactly match those of ISO/IEC 10646-1:1993.
ISO/IEC 10646-1:1993
The Universal Coded Character Set (UCS), is a standard set of
characters defined by ISO/IEC 10646, UCS is the basis of many character
encodings, contains over 128,000 abstract characters, each identified by an
unambiguous name and an integer number called its code point. Characters
(letters, numbers, symbols, ideograms, logograms, etc.) from the many
languages, scripts, and traditions of the world are represented in the UCS with
unique code points. One could code the characters of this primordial ISO 10646
standard in one of three ways:
1. UCS-4, four
bytes for every character, enabling the simple encoding of all characters;
2. UCS-2, two
bytes for every character, enabling the encoding of the first plane, 0x20, the
Basic Multilingual Plane, containing the first 36,864 code points,
straightforwardly, and other planes and groups by switching to them with ISO
2022 escape sequences;
3. UTF-1, which
encodes all the characters in sequences of bytes of varying length (1 to 5
bytes, each of which contain no control codes).
UTF-1
A way of transforming ISO 10646/Unicode into a stream of
bytes. Due to the design, it is not possible to resynchronize if decoding
starts in the middle of a character (this makes truncation hard, among other
things) and simple byte-oriented search routines cannot be reliably used with
it. UTF-1 is also fairly slow due to its use of division by a number which is
not a power of 2. Due to these issues, UTF-1 never gained wide acceptance and
has been replaced by UTF-8.
UTF-8
A character encoding scheme capable of encoding all possible
characters, or code points, defined by Unicode and originally designed by Ken
Thompson and Rob Pike. The encoding is variable-length
and uses 8-bit code units. It was designed for backward compatibility with
ASCII and to avoid the complications of endianness and byte order marks in the
alternative UTF-16 and UTF-32 encodings. The name is derived from: Universal Coded Character Set + Transformation Format – 8-bit.
UTF-8 is the dominant character encoding for the World Wide
Web, as of July 2016 it accounts for 87.2% of all Web pages with the most
popular East Asian encoding, GB 2312, at 0.8% and Shift JIS at 1.1%. The
Internet Mail Consortium (IMC) recommends that all e-mail programs be able to
display and create mail using UTF-8, and the W3C recommends UTF-8 as the
default encoding in XML and HTML.
UTF-16
A character encoding scheme capable of encoding all 1,112,064
possible characters in Unicode. The encoding is variable-length, as code points are encoded with one or two 16-bit
code units. UTF-16 developed from an earlier fixed-width 16-bit encoding known
as UCS-2 (for 2-byte Universal Character Set) once it became clear that a
fixed-width 2-byte encoding could not encode enough characters to be truly
universal.
UTF-32
It is a protocol to encode Unicode code points that uses
exactly 32 bits per Unicode code point. This makes UTF-32 a fixed-length encoding, in contrast to
all other Unicode transformation formats which are variable-length encodings.
The UTF-32 form of a code point is a direct representation of that code point's
numerical value.
The main advantage of UTF-32, versus variable-length
encodings, is that the Unicode code points are directly indexable. Examining
the n'th code point is a constant time operation. In contrast, a
variable-length code requires sequential access to find the n'th code point.
This makes UTF-32 a simple replacement in code that uses integers to index
characters out of strings, as was commonly done for ASCII. The main
disadvantage of UTF-32 is that it is space inefficient. HTML5 states that
authors should not use UTF-32, as the encoding detection algorithms described
in this specification intentionally do not distinguish it from UTF-16.
ISO/IEC-8859-1 (1998)
8-bit
single-byte coded graphic character sets — Part 1: Latin alphabet No. 1 is part
of the ISO/IEC 8859 series of ASCII-based standard character encodings
consisting of 191 characters from the Latin script. It is generally intended
for Western European languages and is the basis for most popular 8-bit
character sets.
ISO/IEC-8859-2 (1999)
8-bit
single-byte coded graphic character sets — Part 2: Latin alphabet No. 2 is part
of the ISO/IEC 8859 series of ASCII-based standard character encodings
consisting of 131 characters from the Latin script. It is generally
intended for Central or "Eastern European" languages that are written
in the Latin script. Note that ISO/IEC 8859-2 is very different from code page
852 (MS-DOS Latin 2, PC Latin 2) which is also referred to as
"Latin-2" in Czech and Slovak regions.
WINDOWS-1252
This character encoding is a superset of ISO
8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1
by using displayable characters rather than control characters in the 80 to 9F
(hex) range. Notable additional characters are curly quotation marks, the Euro
sign, and all the printable characters that are in ISO 8859-15. It is known to
Windows by the code page number 1252, and by the IANA-approved name
"windows-1252".
It is very common to mislabel Windows-1252
text with the charset label ISO-8859-1. A common result was that all the quotes
and apostrophes (produced by "smart quotes" in word-processing
software) were replaced with question marks or boxes on non-Windows operating
systems, making text difficult to read. Most modern web browsers and e-mail
clients treat the MIME charset ISO-8859-1 as Windows-1252 to accommodate such
mislabeling. This is now standard behavior in the HTML5 specification, which
requires that documents advertised as ISO-8859-1 actually be parsed with the
Windows-1252 encoding.
Historically, the phrase "ANSI Code
Page" (ACP) is used in Windows to refer to various code pages considered
as native. The intention was that most of these would be ANSI standards such as
ISO-8859-1. Even though Windows-1252 was the first and by far most popular code
page named so in Microsoft Windows parlance, the code page has never been an
ANSI standard. Microsoft explains, "The term ANSI as used to signify
Windows code pages is a historical reference, but is nowadays a misnomer that
continues to persist in the Windows community."
ISO/IEC
2022
Is an ISO
standard specifying a technique for including multiple character sets in a
single character encoding system, and a technique for representing these
character sets in both 7 and 8 bit systems using the same encoding.
To represent
multiple character sets, the ISO/IEC 2022 character encodings include escape
sequences which indicate the character set for characters which follow. The
escape sequences are registered with ISO and follow the patterns defined within
the standard. These character encodings require data to be processed
sequentially in a forward direction since the correct interpretation of the
data depends on previously encountered escape sequences. Note, that other
standards such as ISO-2022-JP may impose additional conditions such as the
current character set is reset to US-ASCII at the end of a line.
Shift-JIS
a character
encoding for the Japanese language, originally developed by a Japanese company
called ASCII Corporation in conjunction with Microsoft and standardized as JIS
X 0208 Appendix 1. The single-byte characters 0x00 to 0x7F match the ASCII
encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at
0x7E in place of the ASCII character set's backslash and tilde respectively.
The single-byte characters from 0xA1 to 0xDF map to the half-width katakana
characters found in JIS X 0201.The lead bytes for the double-byte characters
are "shifted" around the 64 half-width katakana characters in the single-byte
range 0xA1 to 0xDF.
GB 2312
The registered internet name for a key official character set
of the People's Republic of China, used for simplified Chinese characters. GB
abbreviates Guojia Biaozhun, which means national standard in Chinese.
Big5
A Chinese character encoding method used in Taiwan, Hong
Kong, and Macau for Traditional Chinese characters.