fn:normalize-unicode
Returns the value of $arg
after applying Unicode normalization.
Signatures
fn:normalize-unicode($arg as xs:string?) as xs:string
fn:normalize-unicode(
$arg as xs:string?,
$normalizationForm as xs:string
) as xs:string
Properties
This function is deterministic, context-independent, and focus-independent.
Rules
If the value of $arg
is the empty sequence, the function returns the
zero-length string.
If the single-argument version of the function is used, the result is the same as
calling the two-argument version with $normalizationForm
set to the string
"NFC".
Otherwise, the function returns the value of $arg
normalized according to
the rules of the normalization form identified by the value of
$normalizationForm
.
The effective value of $normalizationForm
is the value of the expression
fn:upper-case(fn:normalize-space($normalizationForm))
.
-
If the effective value of
$normalizationForm
is NFC, then the function returns the value of$arg
converted to Unicode Normalization Form C (NFC). -
If the effective value of
$normalizationForm
is NFD, then the function returns the value of$arg
converted to Unicode Normalization Form D (NFD). -
If the effective value of
$normalizationForm
is NFKC, then the function returns the value of$arg
in Unicode Normalization Form KC (NFKC). -
If the effective value of
$normalizationForm
is NFKD, then the function returns the value of$arg
converted to Unicode Normalization Form KD (NFKD). -
If the effective value of
$normalizationForm
is FULLY-NORMALIZED, then the function returns the value of$arg
converted to fully normalized form. -
If the effective value of
$normalizationForm
is the zero-length string, no normalization is performed and$arg
is returned.
Normalization forms NFC, NFD, NFKC, and NFKD, and the algorithms to be used for converting a string to each of these forms, are defined in [UAX #15].
The motivation for normalization form FULLY-NORMALIZED is explained in [Character Model for the World Wide Web 1.0: Normalization]. However, as that specification did not progress beyond working draft status, the normative specification is as follows:
-
A string is fully-normalized if (a) it is in normalization form NFC as defined in [UAX #15], and (b) it does not start with a composing character.
-
A composing character is a character that is one or both of the following:
-
the second character in the canonical decomposition mapping of some character that is not listed in the Composition Exclusion Table defined in [UAX #15];
-
of non-zero canonical combining class (as defined in [The Unicode Standard]).
-
-
A string is converted to FULLY-NORMALIZED form as follows:
-
if the first character in the string is a composing character, prepend a single space (x20);
-
convert the resulting string to normalization form NFC.
-
Conforming implementations must support normalization form "NFC" and may support normalization forms "NFD", "NFKC", "NFKD", and "FULLY-NORMALIZED". They may also support other normalization forms with implementation-defined semantics.
It is implementation-defined which
version of Unicode (and therefore, of the normalization algorithms and their underlying
data) is supported by the implementation. See [UAX #15] for
details of the stability policy regarding changes to the normalization rules in future
versions of Unicode. If the input string contains codepoints that are unassigned in
the
relevant version of Unicode, or for which no normalization rules are defined, the
fn:normalize-unicode
function leaves such codepoints unchanged. If the
implementation supports the requested normalization form then it must
be able to handle every input string without raising an error.
Error Conditions
A dynamic error is raised [ERRFOCH0003] if the
effective value of the $normalizationForm
argument is not one of the values
supported by the implementation.