[xquery-talk] [SEARCHING XML FOR DATA] Total newbie question...

Daniel E. Renfer Duck at Kronkltd.net
Tue May 23 18:11:51 PDT 2006


This probably won't work for your project/exam, but you might want to check
out Orbeon's XQuery The Web Demo

http://www.orbeon.com/ops/goto-example/xquery-the-web

This does pretty much what you're looking for using the Orbeon Presentation
Server, but as it does *exactly* what you're looking for, I highly doubt
your instructor will accept it.  :)

Still, it might help you fine tune your query.

Daniel E. Renfer (http://kronkltd.net/)

On 5/23/06, Petri Alessandro <alessandro.petri at telecomitalia.it> wrote:
>
>
> Thank you for the wide explanation, i'll try to implement it tomorrow :)
>
> this is gonna save me enough time to finish the project before the exam
> deadline, thank you again-
> A.
>
>
> -----Original Message-----
> From: talk-bounces at xquery.com on behalf of Jeff Dexter
> Sent: Mon 5/22/2006 4:38 PM
> To: Petri Alessandro; talk at xquery.com
> Subject: RE: [xquery-talk] [SEARCHING XML FOR DATA] Total newbie
> question...
>
> Petri,
>
>
>
> Unfortunately my mailer garbled your original XML with its own
> HTML, so I had to just deal with the original HTML from your link, but
> hopefully this helps. Also note that I'm using some conventions from
> TigerLogic to deal with the HTML and how it's modeled in XQuery so you'll
> need to change these for your setup.
>
>
>
> The basic problem is that you're searching the space of table
> elements in the document, of which there are many, for one table without
> too
> many identifying characteristics. Some tables in the document are well
> identified using id attributes, but, alas, not the one for which you are
> searching, which means we need to identify it in some other manner. I've
> used the first table header as an identifying characteristic for the
> table,
> as follows.
>
>
>
> declare default element namespace 'http://www.w3.org/1999/xhtml';
>
> doc( 'http://it.finance.yahoo.com/q/cp?s=%5EMIB30', 'text/html'
> )//table[ (tr/td)[1] eq 'Codice' ]
>
>
>
> Note I've restricted the search to the first cell in the table - if I
> hadn't
> done that not only would I get an error (the eq operator doesn't handle
> multiple operands on either side) but it ensures I'm searching a minimal
> amount of the document for my table. Also note that I've surrounded tr/td
> in
> parentheses. (tr/td)[1] means the first td element out of the entire set
> under the tr/td path of this table. If I had simply written tr/td[1], this
> means the first td element in each tr element in the table, which is
> something quite different and would have led to an error, since there are
> multiple such items and the eq operator is designed to handle only one.
>
>
>
> Since you don't want the table but rather the contents of the table, you
> can
> iterate over these as follows.
>
>
>
> declare default element namespace
> 'http://www.w3.org/1999/xhtml';
>
> for $i in doc( 'http://it.finance.yahoo.com/q/cp?s=%5EMIB30', 'text/html'
> )//table[ (tr/td)[1] eq 'Codice' ]/tr
>
> return
>
> $i
>
>
>
> You can then construct the result you want by returning something other
> than
> $i, as in.
>
>
>
> declare default element namespace
> 'http://www.w3.org/1999/xhtml';
>
> for $i in doc( 'http://it.finance.yahoo.com/q/cp?s=%5EMIB30', 'text/html'
> )//table[ (tr/td)[1] eq 'Codice' ]/tr
>
> return
>
> { data($i/td[ 1 ]) } .
>
>
>
>
>
> Note here the use of the fn:data( ) function - the key here is that each
> table element can contain varying degrees of markup intended to format the
> text therein. Use of fn:data( ) will ensure all of the text will be
> extracted from the table cell but none of the markup. Each of the columns
> you desire to extract for your final result format can be done using the
> same code as above, but using a different index in the predicate to
> extract
> the column at that index (e.g. $i/td[ 1 ], $i/td[ 2 ], etc.).
>
>
>
> Other things you may want to do to improve this query:
>
>
>
> - Use a positional variable on the ForExpr to eliminate the first
> row - even though it's not identified using th, it's just the table header
> and therefore you probably want to eliminate from the query.
>
> - Cleanup the strings returned from each table cell - in many cases
> they're fully formatted with currency indicators, percentages, etc. Your
> XML
> would be better served marking these up as complex elements with decimal
> types and attributes defining the properties of this content. Being able
> to
> search on price could be important but you can't do it if the Euro symbol
> is
> carried along with the price.
>
>
>
> Hope this helps.
>
>
>
> Jeff Dexter.
>
> Chief Architect, TigerLogic
>
> www.rainingdata.com
>
>
>
> _____
>
> From: talk-bounces at xquery.com [mailto:talk-bounces at xquery.com] On Behalf
> Of
> Petri Alessandro
> Sent: Saturday, May 20, 2006 7:23 AM
> To: talk at xquery.com
> Subject: [xquery-talk] [SEARCHING XML FOR DATA] Total newbie question...
>
>
>
>
>
> Hi everyone. I'm doing a project for an university exam and i need advice
> on
> the xquery involved.
> I developed an application which parses the HTML taken from a web page and
> translates it into a well formad XML.
> I then query it through XQEngine java library. I basically want to extract
> from this URL: http://it.finance.yahoo.com/q/cp?s=%5EMIB30 the data from
> the
> central table. I'd like the return XML to be formed more or less this way:
>
>
> AL.MDD
> ALLEANZA ASS
> 9,4900
> -0,94%
> 0
>
>
> for each table row. I really need some hints here as i can perform easy
> queries on the document but can't get to the one i need to extract this
> data.
>
> Anticipated thanx to people who will answer :)
>
> PS: the XML document i got from the transformed HTML is the following
> (Sorry
> if it's big):
>
>
>
> ...cut...
>
>
> width="100%"
> cellpadding="0"
> cellspacing="0"
> border="0"
> class="yfnc_tableout1">
>
>
> width="100%"
> cellpadding="2"
> cellspacing="1"
> border="0">
>
> class="yfnc_tablehead1"
> align="center">Codice
> class="yfnc_tablehead1"
> align="center">Nome
> class="yfnc_tablehead1"
> align="center">Prezzo
> class="yfnc_tablehead1"
> align="center">Variazione
> class="yfnc_tablehead1"
> align="center">Volumi
>
>
> class="yfnc_tabledata1">
>
> href="/q?s=AL.MDD">AL.MDD
>
>
> class="yfnc_tabledata1">
> ALLEANZA ASS
>
> class="yfnc_tabledata1"
> align="center">
> 9,4900 ?
>
> 18 mag
>
> class="yfnc_tabledata1"
> align="center">
> width="10"
> height="14"
> border="0"
> src="http://us.i1.yimg.com/us.yimg.com/i/us/fi/03rd/down_r.gif"
> alt="Down" />
> **
> *style="color:#cc0000;">0,0900
> (0,94%)
> *
> class="yfnc_tabledata1"
> align="right">0
>
>
>
> ...cut...
>
> class="yfnc_tabledata1">
> UNICREDITO ITALIANO
>
> class="yfnc_tabledata1"
> align="center">
> 6,0650 ?
>
> 18 mag
>
> class="yfnc_tabledata1"
> align="center">
> width="10"
> height="14"
> border="0"
> src="http://us.i1.yimg.com/us.yimg.com/i/us/fi/03rd/down_r.gif"
> alt="Down" />
> **
> *style="color:#cc0000;">0,1750
> (2,80%)
> *
> **
> class="yfnc_tabledata1"
> align="right">0
>
>
>
>
>
> ...cut...
>
>
>
>
>
> --------------------------------------------------------------------
> CONFIDENTIALITY NOTICE
> This message and its attachments are addressed solely to the persons
> above and may contain confidential information. If you have received
> the message in error, be informed that any use of the content hereof
> is prohibited. Please return it immediately to the sender and delete
> the message. Should you have any questions, please contact us by
> replying to webmaster at telecomitalia.it.
> Thank you
>
> www.telecomitalia.it
> --------------------------------------------------------------------
>
>
> --------------------------------------------------------------------
> CONFIDENTIALITY NOTICE
> This message and its attachments are addressed solely to the persons
> above and may contain confidential information. If you have received
> the message in error, be informed that any use of the content hereof
> is prohibited. Please return it immediately to the sender and delete
> the message. Should you have any questions, please contact us by
> replying to webmaster at telecomitalia.it.
>         Thank you
>                                         www.telecomitalia.it
> --------------------------------------------------------------------
>
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xquery.com/pipermail/talk/attachments/20060523/3b4d94f9/attachment-0001.htm


More information about the talk mailing list