fn:tokenize
Returns a sequence of strings constructed by splitting the input wherever a separator is found; the separator is any substring that matches a given regular expression.
Signatures
fn:tokenize($input as xs:string?) as xs:string*fn:tokenize(
$input as xs:string?,
$pattern as xs:string
) as xs:string*fn:tokenize(
$input as xs:string?,
$pattern as xs:string,
$flags as xs:string
) as xs:string*Properties
This function is deterministic, context-independent, and focus-independent.
Rules
The one-argument form of this function
splits the supplied string at whitespace boundaries. More specifically, calling fn:tokenize($input)
is equivalent to calling fn:tokenize(fn:normalize-space($input), ' ')) where the second argument
is a single space character (x20).
The effect of calling the two-argument form of this function (omitting the argument
$flags) is the same as the effect of calling the three-argument version with the
$flags argument set to a zero-length string. Flags are defined in
Flags.
The following rules apply to the three-argument form of the function:
-
The
$flagsargument is interpreted in the same way as for thefn:matchesfunction. -
If
$inputis the empty sequence, or if$inputis the zero-length string, the function returns the empty sequence. -
The function returns a sequence of strings formed by breaking the
$inputstring into a sequence of strings, treating any substring that matches$patternas a separator. The separators themselves are not returned. -
Except with the one-argument form of the function, if a separator occurs at the start of the
$inputstring, the result sequence will start with a zero-length string. Similarly, zero-length strings will also occur in the result sequence if a separator occurs at the end of the$inputstring, or if two adjacent substrings match the supplied$pattern. -
If two alternatives within the supplied
$patternboth match at the same position in the$inputstring, then the match that is chosen is the first. For example:fn:tokenize("abracadabra", "(ab)|(a)") returns ("", "r", "c", "d", "r", "")
Error Conditions
A dynamic error is raised [ERRFORX0002] if the value of
$pattern is invalid according to the rules described in section Regular expression syntax.
A dynamic error is raised [ERRFORX0001] if the value of
$flags is invalid according to the rules described in section Flags.
A dynamic error is raised [ERRFORX0003] if the supplied
$pattern matches a zero-length string, that is, if fn:matches("",
$pattern, $flags) returns true.
Notes
If the input string is not zero length, and no separators are found in the input string, the result of the function is a single string identical to the input string.
The one-argument form of the function has a similar effect to
the two-argument form with \s+ as the separator pattern, except that the one-argument
form strips leading and trailing whitespace, whereas the two-argument form delivers
an extra
zero-length token if leading or trailing whitespace is present.
The function returns no information about the separators that were found
in the string. If this information is required, the fn:analyze-string function
can be used instead.
The separator used by the one-argument form of the function is any sequence of tab (x09), newline (x0A), carriage return (x0D) or space (x20) characters. This is the same as the separator recognized by list-valued attributes as defined in XSD. It is not the same as the separator recognized by list-valued attributes in HTML5, which also treats form-feed (x0C) as whitespace. If it is necessary to treat form-feed as a separator, an explicit separator pattern should be used.
Examples
The expression fn:tokenize(" red green blue ") returns ("red", "green", "blue").
The expression fn:tokenize("The cat sat on the mat", "\s+") returns ("The", "cat", "sat", "on", "the", "mat").
The expression fn:tokenize(" red green blue ", "\s+") returns ("", "red", "green", "blue", "").
The expression fn:tokenize("1, 15, 24, 50", ",\s*") returns ("1", "15", "24", "50").
The expression fn:tokenize("1,15,,24,50,", ",") returns ("1", "15", "", "24", "50", "").
fn:tokenize("abba", ".?") raises the dynamic error [ERRFORX0003].
The expression fn:tokenize("Some unparsed <br> HTML <BR> text",
"\s*<br>\s*", "i") returns ("Some unparsed", "HTML", "text").