fn:tokenize
Returns a sequence of strings constructed by splitting the input wherever a separator is found; the separator is any substring that matches a given regular expression.
Signatures
fn:tokenize($input as xs:string?) as xs:string*
fn:tokenize(
$input as xs:string?,
$pattern as xs:string
) as xs:string*
fn:tokenize(
$input as xs:string?,
$pattern as xs:string,
$flags as xs:string
) as xs:string*
Properties
This function is deterministic, context-independent, and focus-independent.
Rules
The one-argument form of this function
splits the supplied string at whitespace boundaries. More specifically, calling fn:tokenize($input)
is equivalent to calling fn:tokenize(fn:normalize-space($input), ' '))
where the second argument
is a single space character (x20).
The effect of calling the two-argument form of this function (omitting the argument
$flags
) is the same as the effect of calling the three-argument version with the
$flags
argument set to a zero-length string. Flags are defined in
Flags.
The following rules apply to the three-argument form of the function:
-
The
$flags
argument is interpreted in the same way as for thefn:matches
function. -
If
$input
is the empty sequence, or if$input
is the zero-length string, the function returns the empty sequence. -
The function returns a sequence of strings formed by breaking the
$input
string into a sequence of strings, treating any substring that matches$pattern
as a separator. The separators themselves are not returned. -
Except with the one-argument form of the function, if a separator occurs at the start of the
$input
string, the result sequence will start with a zero-length string. Similarly, zero-length strings will also occur in the result sequence if a separator occurs at the end of the$input
string, or if two adjacent substrings match the supplied$pattern
. -
If two alternatives within the supplied
$pattern
both match at the same position in the$input
string, then the match that is chosen is the first. For example:fn:tokenize("abracadabra", "(ab)|(a)") returns ("", "r", "c", "d", "r", "")
Error Conditions
A dynamic error is raised [ERRFORX0002] if the value of
$pattern
is invalid according to the rules described in section Regular expression syntax.
A dynamic error is raised [ERRFORX0001] if the value of
$flags
is invalid according to the rules described in section Flags.
A dynamic error is raised [ERRFORX0003] if the supplied
$pattern
matches a zero-length string, that is, if fn:matches("",
$pattern, $flags)
returns true
.
Notes
If the input string is not zero length, and no separators are found in the input string, the result of the function is a single string identical to the input string.
The one-argument form of the function has a similar effect to
the two-argument form with \s+
as the separator pattern, except that the one-argument
form strips leading and trailing whitespace, whereas the two-argument form delivers
an extra
zero-length token if leading or trailing whitespace is present.
The function returns no information about the separators that were found
in the string. If this information is required, the fn:analyze-string
function
can be used instead.
The separator used by the one-argument form of the function is any sequence of tab (x09), newline (x0A), carriage return (x0D) or space (x20) characters. This is the same as the separator recognized by list-valued attributes as defined in XSD. It is not the same as the separator recognized by list-valued attributes in HTML5, which also treats form-feed (x0C) as whitespace. If it is necessary to treat form-feed as a separator, an explicit separator pattern should be used.
Examples
The expression fn:tokenize(" red green blue ")
returns ("red", "green", "blue")
.
The expression fn:tokenize("The cat sat on the mat", "\s+")
returns ("The", "cat", "sat", "on", "the", "mat")
.
The expression fn:tokenize(" red green blue ", "\s+")
returns ("", "red", "green", "blue", "")
.
The expression fn:tokenize("1, 15, 24, 50", ",\s*")
returns ("1", "15", "24", "50")
.
The expression fn:tokenize("1,15,,24,50,", ",")
returns ("1", "15", "", "24", "50", "")
.
fn:tokenize("abba", ".?")
raises the dynamic error [ERRFORX0003].
The expression fn:tokenize("Some unparsed <br> HTML <BR> text",
"\s*<br>\s*", "i")
returns ("Some unparsed", "HTML", "text")
.