Saturday, February 16, 2008

XPath: Getting All the Descendant Nodes

For some reason I can never remember the proper XPath for getting all the descendant nodes (both element and text nodes). I figure if I post it on my blog, I can just look it up whenever I forget (or maybe writing it down will force it permanently into my brain). Here's the XPath expression:

//*|//text()

Pretty simple, huh? At first, I thought it was //*|text(), but that doesn't actually work. Neither does //text()|*. Those two XPath expressions aren't even equivalent -- they actually give you different results.

Now for an example! Let's say that you have the following XML:

<html>
<head>
    <title>Converting from Local Time to UTC</title>
    <link rel="stylesheet" href="../preview.css" type="text/css" />
</head>
    <body>
        <div id="meta">
            <table>
                <tr>
                    <td><b>Title:</b></td>
                    <td>Converting from Local Time to UTC</td>
                </tr>
                <tr>
                    <td><b>Entry Id:</b></td>
                    <td>None</td>
                </tr>
                <tr>
                    <td><b>Labels:</b></td>
                    <td>python, utc, datetime</td>
                </tr>
            </table>
        </div>
    </body>
</html>

Using Python's lxml module, we can write a short script that prints out all the element tags and non-whitespace strings:

from lxml import etree

tree = etree.parse(open('temp.xml'))

for node in tree.xpath('//*|//text()'):
    if isinstance(node, basestring):
        if node.strip():
            print repr(node.strip())
    else:
        print '<%s>' % node.tag

Running the above code, we get the following output:

<html>
<head>
<title>
'Converting from Local Time to UTC'
<link>
<body>
<div>
<table>
<tr>
'Converting from Local Time to UTC'
<tr>
<td>
<b>
<tr>
'Title:'
<td>
<td>
<b>
'Entry Id:'
<td>
'None'
<td>
<b>
'Labels:'
<td>
'python, utc, datetime'

As you can see, the XPath expression gives you the element and text nodes in the exact order that they appear in the document.

No comments: