For some reason I can never remember the proper XPath for getting all the descendant nodes (both element and text nodes). I figure if I post it on my blog, I can just look it up whenever I forget (or maybe writing it down will force it permanently into my brain). Here's the XPath expression:
//*|//text()
Pretty simple, huh? At first, I thought it was //*|text(), but that doesn't actually work. Neither does //text()|*. Those two XPath expressions aren't even equivalent -- they actually give you different results.
Now for an example! Let's say that you have the following XML:
<html>
<head>
<title>Converting from Local Time to UTC</title>
<link rel="stylesheet" href="../preview.css" type="text/css" />
</head>
<body>
<div id="meta">
<table>
<tr>
<td><b>Title:</b></td>
<td>Converting from Local Time to UTC</td>
</tr>
<tr>
<td><b>Entry Id:</b></td>
<td>None</td>
</tr>
<tr>
<td><b>Labels:</b></td>
<td>python, utc, datetime</td>
</tr>
</table>
</div>
</body>
</html>
Using Python's lxml module, we can write a short script that prints out all the element tags and non-whitespace strings:
from lxml import etree
tree = etree.parse(open('temp.xml'))
for node in tree.xpath('//*|//text()'):
if isinstance(node, basestring):
if node.strip():
print repr(node.strip())
else:
print '<%s>' % node.tag
Running the above code, we get the following output:
<html> <head> <title> 'Converting from Local Time to UTC' <link> <body> <div> <table> <tr> 'Converting from Local Time to UTC' <tr> <td> <b> <tr> 'Title:' <td> <td> <b> 'Entry Id:' <td> 'None' <td> <b> 'Labels:' <td> 'python, utc, datetime'
As you can see, the XPath expression gives you the element and text nodes in the exact order that they appear in the document.
No comments:
Post a Comment