fn:analyze-string
Analyzes a string using a regular expression, returning an XML structure that identifies which parts of the input string matched or failed to match the regular expression, and in the case of matched substrings, which substrings matched each capturing group in the regular expression.
Signatures
fn:analyze-string(
$input as xs:string?,
$pattern as xs:string
) as element(fn:analyze-string-result)
fn:analyze-string(
$input as xs:string?,
$pattern as xs:string,
$flags as xs:string
) as element(fn:analyze-string-result)
Properties
This function is nondeterministic, context-independent, and focus-independent.
Rules
The effect of calling the first version of this function (omitting the argument
$flags
) is the same as the effect of calling the second version with the
$flags
argument set to a zero-length string. Flags are defined in
Flags.
The $flags
argument is interpreted in the same way as for the
fn:matches
function.
If $input
is the empty sequence the function behaves as if
$input
were the zero-length string. In this situation the result will be
an element node with no children.
The function returns an element node whose local name is
analyze-string-result
. This element and all its descendant elements have
the namespace URI http://www.w3.org/2005/xpath-functions
. The namespace
prefix is implementation-dependent. The children of this element are a
sequence of fn:match
and fn:non-match
elements. This sequence
is formed by breaking the $input
string into a sequence of strings,
returning any substring that matches $pattern
as the content of a
match
element, and any intervening substring as the content of a
non-match
element.
More specifically, the function starts at the beginning of the input string and attempts
to find the first substring that matches the regular expression. If there are several
matches, the first match is defined to be the one whose starting position comes first
in
the string. If several alternatives within the regular expression both match at the
same
position in the input string, then the match that is chosen is the first alternative
that matches. For example, if the input string is The quick brown fox jumps
and the regular expression is jump|jumps
, then the match that is chosen is
jump
.
Having found the first match, the instruction proceeds to find the second and subsequent matches by repeating the search, starting at the first character that was not included in the previous match.
The input string is thus partitioned into a sequence of substrings, some of which
match
the regular expression, others which do not match it. Each substring will contain
at
least one character. This sequence is represented in the result by the sequence of
fn:match
and fn:non-match
children of the returned element
node; the string value of the fn:match
or fn:non-match
element
will be the corresponding substring of $input
, and the string value of the
returned element node will therefore be the same as $input
.
The content of an fn:non-match
element is always a single text node.
The content of a fn:match
element, however, is in general a sequence of
text nodes and fn:group
element children. An fn:group
element
with a nr
attribute having the integer value N identifies the
substring captured by the Nth parenthesized sub-expression in the regular
expression. For each capturing subexpression there will be at most one corresponding
fn:group
element in each fn:match
element in the
result.
If the function is called twice with the same arguments, it is implementation-dependent whether the two calls return the same element node or distinct (but deep equal) element nodes. In this respect it is non-deterministic with respect to node identity.
The base URI of the element nodes in the result is implementation-dependent.
A schema is defined for the structure of the returned element: see Schema for the result of fn:analyze-string.
The result of the function will always be such that validation against this schema would succeed. However, it is implementation-defined whether the result is typed or untyped, that is, whether the elements and attributes in the returned tree have type annotations that reflect the result of validating against this schema.
Error Conditions
A dynamic error is raised [ERRFORX0002] if the value of
$pattern
is invalid according to the rules described in section Regular expression syntax.
A dynamic error is raised [ERRFORX0001] if the value of
$flags
is invalid according to the rules described in section Flags.
A dynamic error is raised [ERRFORX0003] if the supplied
$pattern
matches a zero-length string, that is, if fn:matches("",
$pattern, $flags)
returns true
.
Notes
It is recommended that a processor that implements schema awareness should return typed nodes. The concept of "schema awareness", however, is a matter for host languages to define and is outside the scope of the function library specification.
The declarations and definitions in the schema are not automatically available in
the static context of the fn:analyze-string
call (or of any other
expression). The contents of the static context are host-language defined, and in
some
host languages are implementation-defined.
The schema defines the outermost element, analyze-string-result
, in such
a way that mixed content is permitted. In fact the element will only have element
nodes (match
and non-match
) as its children, never text nodes. Although this might have originally been an
oversight, defining the analyze-string-result
element with mixed="true"
allows it
to be atomized, which is potentially useful (the atomized value will be the original
input string),
and the capability has therefore been retained for compatibility with the 3.0 version
of this
specification.
Examples
In the following examples, the result document is shown in serialized form, with whitespace between the element nodes. This whitespace is not actually present in the result.
The expression fn:analyze-string("The cat sat on the mat.", "\w+")
returns (with whitespace added for legibility):
<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions">
<match>The</match>
<non-match> </non-match>
<match>cat</match>
<non-match> </non-match>
<match>sat</match>
<non-match> </non-match>
<match>on</match>
<non-match> </non-match>
<match>the</match>
<non-match> </non-match>
<match>mat</match>
<non-match>.</non-match>
</analyze-string-result>
The expression fn:analyze-string("2008-12-03",
"^(\d+)\-(\d+)\-(\d+)$")
returns (with whitespace added for legibility):
<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions">
<match><group nr="1">2008</group>-<group nr="2">12</group>-<group nr="3">03</group></match>
</analyze-string-result>
The expression fn:analyze-string("A1,C15,,D24, X50,",
"([A-Z])([0-9]+)")
returns (with whitespace added for legibility):
<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions">
<match><group nr="1">A</group><group nr="2">1</group></match>
<non-match>,</non-match>
<match><group nr="1">C</group><group nr="2">15</group></match>
<non-match>,,</non-match>
<match><group nr="1">D</group><group nr="2">24</group></match>
<non-match>, </non-match>
<match><group nr="1">X</group><group nr="2">50</group></match>
<non-match>,</non-match>
</analyze-string-result>