Feihong Hsu: XPath: Getting All the Descendant Nodes

For some reason I can never remember the proper XPath for getting all the descendant nodes (both element and text nodes). I figure if I post it on my blog, I can just look it up whenever I forget (or maybe writing it down will force it permanently into my brain). Here's the XPath expression:

//*|//text()

Pretty simple, huh? At first, I thought it was //*|text(), but that doesn't actually work. Neither does //text()|*. Those two XPath expressions aren't even equivalent -- they actually give you different results.

Now for an example! Let's say that you have the following XML:

<html>
<head>
    <title>Converting from Local Time to UTC</title>
    <link rel="stylesheet" href="../preview.css" type="text/css" />
</head>
    <body>
        <div id="meta">
            <table>
                <tr>
                    <td><b>Title:</b></td>
                    <td>Converting from Local Time to UTC</td>
                </tr>
                <tr>
                    <td><b>Entry Id:</b></td>
                    <td>None</td>
                </tr>
                <tr>
                    <td><b>Labels:</b></td>
                    <td>python, utc, datetime</td>
                </tr>
            </table>
        </div>
    </body>
</html>

Using Python's lxml module, we can write a short script that prints out all the element tags and non-whitespace strings:

from lxml import etree

tree = etree.parse(open('temp.xml'))

for node in tree.xpath('//*|//text()'):
    if isinstance(node, basestring):
        if node.strip():
            print repr(node.strip())
    else:
        print '<%s>' % node.tag

Running the above code, we get the following output:

<html>
<head>
<title>
'Converting from Local Time to UTC'
<link>
<body>
<div>
<table>
<tr>
'Converting from Local Time to UTC'
<tr>
<td>
<b>
<tr>
'Title:'
<td>
<td>
<b>
'Entry Id:'
<td>
'None'
<td>
<b>
'Labels:'
<td>
'python, utc, datetime'

As you can see, the XPath expression gives you the element and text nodes in the exact order that they appear in the document.

.. editlink:: /feeds/5424252364534723300/posts/default/260303844194552688

XPath: Getting All the Descendant Nodes
=======================================
.. labels:: xpath, xml, python, lxml

For some reason I can never remember the proper XPath for getting all the descendant nodes (both element and text nodes). I figure if I post it on my blog, I can just look it up whenever I forget (or maybe writing it down will force it permanently into my brain). Here's the XPath expression::
    
    //*|//text()

Pretty simple, huh? At first, I thought it was ``//*|text()``, but that doesn't actually work. Neither does ``//text()|*``. Those two XPath expressions aren't even equivalent -- they actually give you different results.

Now for an example! Let's say that you have the following XML:

.. code:: xml

    <html>
    <head>
        <title>Converting from Local Time to UTC</title>
        <link rel="stylesheet" href="../preview.css" type="text/css" />
    </head>
        <body>
            <div id="meta">
                <table>
                    <tr>
                        <td><b>Title:</b></td>
                        <td>Converting from Local Time to UTC</td>
                    </tr>
                    <tr>
                        <td><b>Entry Id:</b></td>
                        <td>None</td>
                    </tr>
                    <tr>
                        <td><b>Labels:</b></td>
                        <td>python, utc, datetime</td>
                    </tr>
                </table>
            </div>
        </body>
    </html>

.. _lxml: http://codespeak.net/lxml/

Using Python's lxml_ module, we can write a short script that prints out all the element tags and non-whitespace strings:

.. code:: python

    from lxml import etree
    
    tree = etree.parse(open('temp.xml'))
    
    for node in tree.xpath('//*|//text()'):
        if isinstance(node, basestring):
            if node.strip():
                print repr(node.strip())
        else:
            print '<%s>' % node.tag
    
Running the above code, we get the following output::

    <html>
    <head>
    <title>
    'Converting from Local Time to UTC'
    <link>
    <body>
    <div>
    <table>
    <tr>
    'Converting from Local Time to UTC'
    <tr>
    <td>
    <b>
    <tr>
    'Title:'
    <td>
    <td>
    <b>
    'Entry Id:'
    <td>
    'None'
    <td>
    <b>
    'Labels:'
    <td>
    'python, utc, datetime'

As you can see, the XPath expression gives you the element and text nodes in the exact order that they appear in the document.

Feihong Hsu

Saturday, February 16, 2008

XPath: Getting All the Descendant Nodes

No comments:

About Me

Archive

Labels