Unicode equivalence

From Wikipedia, the free encyclopedia

  (Redirected from Unicode normalization)
Jump to: navigation, search

Unicode contains numerous characters to maintain compatibility with existing standards, some of which are functionally equivalent to other characters or sequences of characters. Because of this, Unicode defines some code points sequences as equivalent. Unicode provides two notions of equivalence: canonical, and compatibility, the former being a subset of the latter. For example, the n character followed by the combining ~ character is (canonically and thus compatibility) equivalent to the single Unicode ñ character, while the typographic ligature ff is only compatibility equivalent with the sequence of two f characters.

Unicode normalization is a form of text normalization that transforms equivalent sequences of characters into the same representation, called a normalization form in the Unicode standard, but which will be called simply normal form in this article. For each of the two equivalence notions, Unicode defines two canonical forms, one fully composed, and one fully decomposed, resulting in four normal forms, abbreviated NFC, NFD, NFKC, and NFKD, which are detailed in this article. Unicode normalization is important in Unicode text processing applications, because it affects the semantics of comparing, searching, and sorting Unicode sequences.

Contents

[edit] Equivalence Notions

[edit] Canonical Equivalence

Underlying Unicode's concept of canonical equivalence are the reciprocal notion of character composition and decomposition. Character composition is the process of combining simpler characters into fewer precomposed characters, such as the n character and the combining ~ character into the single ñ character. Decomposition is the opposite process, breaking precomposed characters back into their component pieces.

Canonical equivalence is a form of equivalence that preserves visually and functionally equivalent characters. For example, precomposed diacritic letters are considered canonically equivalent to their decomposed letter and combining diacritic marks. In other words, the precomposed character ‘ü’ is a canonical equivalent to the sequence ‘u’ and ‘¨’ a combining diaeresis. Similarly, Unicode unifies several Greek diacritics and punctuation characters that have the same appearance to other diacritics.

[edit] Compatibility Equivalence

Compatibility equivalence is broader in scope than canonical equivalence. Anything that is canonically equivalent is also compatibility equivalent, but the opposite is not necessarily true. The compatibility equivalence notion is more concerned with plain text equivalence and may lump together some semantically distinct forms.

For example, superscript and subscript numerals are compatibility equivalent to their core decimal digit counterparts, even though they are not canonically equivalent to them. The rationale is that subscript and superscript forms — through their visually distinct presentation — sometimes convey a distinct meaning, but there may be valid applications in which to consider them equivalent. Superscript and subscripts can be handled in a less cumbersome manner in the context of Unicode rich text formats (see next section).

Full-width and half-width katakana characters are also compatibility equivalent but not canonically equivalent, as are ligatures and their component letter sequences. For these latter examples, there is usually only a visual and not a semantic distinction. In other words, an author does not typically declare the presence of ligatures or vertical text as meaning one thing and non-ligatures and horizontal text as meaning something entirely different. Rather these are strictly visual typographic design choices.

[edit] Normalization

The implementation of Unicode string searches and comparisons in text processing software must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.

Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an equivalence class, multiple canonical forms are possible for each equivalence criteria. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.

In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any these, but compatibility normalization (NFK) will decompose the ffi ligature in the constituent letters, so a search for U+0066 (f) as substring would succeed in NFKC(U+FB03) but not in NFC(U+FB03), and analogously when searching for U+0049 (I) in U+2168. The superscript U+2075 is transformed to U+0035 (5) by compatibility mapping.

Transforming superscripts into baseline equivalents may not be appropriate however for rich text software, be cause the subscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation.[1] In the case of typographic ligatures, this tag is simply <compat>, while for the superscript it is <super>. Rich text standards like HTML take into account the compatibility tags. For instance HTML uses its own markup to position a U+0035 in a superscript position.[2]

[edit] Normal forms

Unicode defines four normal forms. These and the algorithms (transformations) for obtaining them are listed in the table below. All these forms impose the canonical order on the resulting sequence to guarantee uniqueness of the result over the corresponding equivalence class. All these algorithms are idempotent transformations, but none of them are injective due to singletons (see example after the table). Also, none of the normal forms are closed under string concatenation, meaning that the concatenation of two strings in the same normal form is not guaranteed to be a normal form; this happens due to the canonical ordering, see the next section for details.

NFD
Normalization Form Canonical Decomposition
Characters are decomposed by canonical equivalence.
NFC
Normalization Form Canonical Composition
Characters are decomposed and then recomposed by canonical equivalence. It is possible for the result to be a different sequence of characters than the original, in the case of singletons, see example below the table.
NFKD
Normalization Form Compatibility Decomposition
Characters are decomposed by compatibility equivalence.
NFKC
Normalization Form Compatibility Composition
Characters are decomposed by compatibility equivalence, then recomposed by canonical equivalence.

Certain code points are irretrievably aliased to other code points by any of the normalization transformations from the above table. An alternative way to put it is to say that singletons never belong to any normal form. An example is U+212B (Å), the Angstrom sign, which is always replaced by the visually identical U+00C5 (Å – A with ring above) in NFC, which in turn is equivalent to the NFD two character sequence composed of U+0041 (A) and U+030A (° – combining ring above). Thus, none of the normalization functions are injective.

In the Unicode character database singletons are those characters that have a non-empty compatibility field but without any compatibility tag, which makes the mapping canonical.

[edit] Canonical ordering

The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be diacritics, even though in general some diacritics are not combining characters, and some combining characters are not diacritics.

Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are not considered equivalent.

For example, the character U+1EBF (ế), used in Vietnamese has both an acute and a circumflex accent. It's canonical decomposition is the three-character sequence U+0045 (E) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent with U+0045 U+0301 U+0302.

Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0301), even the normal form NFC is affected by composing characters' behavior.

[edit] Notes

[edit] See also

[edit] References

[edit] External links

Personal tools
Languages