[xquery-talk] Count a specific word in a document
Michael Strasser
M.Strasser at gpo.com
Thu Jun 14 08:22:31 PDT 2007
Michael
Thanks for your thorough response and the warning about text(). I spared
everyone the source document because it is not very good (and obviously
it is long). I started with
http://www.simonandkevin.com/ElijahLibretto.htm and converted its source
from MS-Worded HTML to XHTML using a text editor. Its markup is visual,
not structural. (My next XQuery project might be to convert its markup
to a structural one.)
An excerpt is:
<td>
<p>
<i>Elijah</i>
<br/>
Draw near, all ye people, come to me . . .
</p>
<p>
Lord God of Abraham, Isaac and Israel, this day let
it be known that Thou art God, and that I am Thy
servant! Lord God of Abraham! Oh show to all this
people that I have done these things according to
Thy word.
Oh hear me, Lord, and answer me!
Lord God of Abraham, Isaac and Israel, oh hear me
and answer me, and show this people that Thou art
Lord God. And let their hearts again be turned!
</p>
</td>
So you see that $elijah//td/p/[i = 'Elijah'] will not capture all Elijah
sings. In fact, I ended up using this to capture all paragraphs of his
sung text:
let $td := doc("/db/mjs/ElijahLibretto.xhtml")/html//td[p/i = 'Elijah']
let $elijah-para := $td/p[i = 'Elijah' or i = 'Both' or count(i) = 0]
(<i>'Both'</i> marks his lines of duet with the Widow.)
Thanks also for fixing up my use of tokenize(). I don't like using
something I don't understand (especially in a public forum). The results
were different using "\W+": I got 37 occurrences of 'Lord' instead of 36
(Jonathan Robie's regexp didn't tokenise 'Lord?' correctly).
Is there a web repository of XQuery questions and answers like Dave
Pawson's very useful Q&A for XSLT?
Michael Strasser
(P.S. Why did I choose this strange exercise? Last year I sang the part
of Elijah and wondered how often he uttered the word 'Lord'. Merely
going through the score and counting is not geeky enough!)
More information about the talk
mailing list