Thursday, August 27, 2009

XPath Style XML Selection

The xml API in scala allows xpath like (although not true xpath) queries. In combination with matching this makes it very easy to process XML documents. I am only going to discuss xpath style selection now. The code section is very long but primarily because the results are often quite lengthy.
  1. scala>val address = <address>
  2.      | <CI_Address>
  3.      | <deliveryPoint>
  4.      | <CharacterString>Viale delle Terme di Caracalla
  5.      | </CharacterString>
  6.      | </deliveryPoint>
  7.      | <city>
  8.      | <CharacterString>Rome</CharacterString>
  9.      | </city>
  10.      | <administrativeArea>
  11.      | <CharacterString />
  12.      | </administrativeArea>
  13.      | <postalCode>
  14.      | <CharacterString>00153</CharacterString>
  15.      | </postalCode>
  16.      | <country>
  17.      | <CharacterString>Italy</CharacterString>
  18.      | </country>
  19.      | <electronicMailAddress>
  20.      | <CharacterString>jippe.hoogeveen@fao.org
  21.      | </CharacterString>
  22.      | </electronicMailAddress>
  23.      | </CI_Address>
  24.      | </address>
  25. address: scala.xml.Elem =
  26. <address>
  27.        <CI_Address>
  28.       ...
  29. // create a pretty printer for writing out the document nicely
  30. scala>  val pp = new scala.xml.PrettyPrinter(80, 5);
  31. pp: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@6d87c12a
  32. // select the city
  33. scala> println( pp.formatNodes( address \ "CI_Address" \ "city" ) )                   
  34. <city>
  35.      <gco:CharacterString>Rome</gco:CharacterString>
  36. </city>
  37. // a second way to select city
  38. scala> println( pp.formatNodes( address \\ "city" ) )      
  39. <city>
  40.      <gco:CharacterString>Rome</gco:CharacterString>
  41. </city>
  42. // select all characterStrings and print then one per line (unless there is a \n in the text)
  43. scala> (address \\ "CharacterString").mkString( "\n" )
  44. res2: String =
  45. <CharacterString>Viale delle Terme di Caracalla
  46.        </CharacterString>
  47. <CharacterString>Rome</CharacterString>
  48. <CharacterString></CharacterString>
  49. <CharacterString>00153</CharacterString>
  50. <CharacterString>Italy</CharacterString>
  51. <CharacterString>jippe.hoogeveen@fao.org
  52.        </CharacterString>
  53. // iterate over the city node and all of its child nodes.
  54. scala> println( pp.formatNodes( address \\ "city" \\ "_"))
  55. <city>
  56.      <CharacterString>Rome</CharacterString>
  57. </city><CharacterString>Rome</CharacterString>
  58. // similar as above but iterate over all CI_Address nodes and each of its children
  59. scala>println( pp.formatNodes( address \\ "CI_Address" \\ "_")) 
  60. <CI_Address>
  61.      <deliveryPoint>
  62.           <CharacterString>Viale delle Terme di Caracalla </CharacterString>
  63.      </deliveryPoint>
  64.      <city>
  65.           <CharacterString>Rome</CharacterString>
  66.      </city>
  67.      <administrativeArea>
  68.           <CharacterString></CharacterString>
  69.      </administrativeArea>
  70.      <postalCode>
  71.           <CharacterString>00153</CharacterString>
  72.      </postalCode>
  73.      <country>
  74.           <CharacterString>Italy</CharacterString>
  75.      </country>
  76.      <electronicMailAddress>
  77.           <CharacterString>jippe.hoogeveen@fao.org </CharacterString>
  78.      </electronicMailAddress>
  79. </CI_Address><deliveryPoint>
  80.      <CharacterString>Viale delle Terme di Caracalla </CharacterString>
  81. </deliveryPoint><CharacterString>Viale delle Terme di Caracalla </CharacterString><city>
  82.      <CharacterString>Rome</CharacterString>
  83. </city><CharacterString>Rome</CharacterString><administrativeArea>
  84.      <CharacterString></CharacterString>
  85. </administrativeArea><CharacterString></CharacterString><postalCode>
  86.      <CharacterString>00153</CharacterString>
  87. </postalCode><CharacterString>00153</CharacterString><country>
  88.      <CharacterString>Italy</CharacterString>
  89. </country><CharacterString>Italy</CharacterString><electronicMailAddress>
  90.      <CharacterString>jippe.hoogeveen@fao.org </CharacterString>
  91. </electronicMailAddress><CharacterString>jippe.hoogeveen@fao.org </CharacterString>
  92. // print all text
  93. scala> address.text                      
  94. res4: String =
  95.       
  96.       
  97.        Viale delle Terme di Caracalla
  98.       
  99.       
  100.       
  101.        Rome
  102.       
  103.       
  104.       
  105.       
  106.       
  107.        00153
  108.       
  109.       
  110.        Italy
  111.       
  112.       
  113.        jippe.hoogeveen@fao.org
  114.       
  115. // print all character string text
  116. scala> (address \\ "CharacterString").text            
  117. res3: String =
  118. Viale delle Terme di Caracalla
  119.        Rome00153Italyjippe.hoogeveen@fao.org
  120.       
  121. // print all character string text one per line
  122. scala> (address \\ "CharacterString").map( _.text ).mkString("\n")
  123. res6: String =
  124. Viale delle Terme di Caracalla
  125.       
  126. Rome
  127. 00153
  128. Italy
  129. jippe.hoogeveen@fao.org
  130. // find the longest character string
  131. scala> (address \\ "CharacterString").reduceRight(  
  132.      | (elem, longest) => {
  133.      | if( elem.text.length > longest.text.length ) elem
  134.      | else longest
  135.      | })
  136. res8: scala.xml.Node =
  137. <CharacterString>Viale delle Terme di Caracalla
  138.        </CharacterString>
  139. // find the alphabetically first characterstring
  140. scala> (address \\ "CharacterString").reduceRight( (elem, longest) => {
  141.      | if( elem.text > longest.text ) elem
  142.      | else longest
  143.      | })
  144. res9: scala.xml.Node =
  145. <CharacterString>jippe.hoogeveen@fao.org
  146.        </CharacterString>

No comments:

Post a Comment