38. PostgreSQL and XML

I'm been feeling mildly guilty about not publishing the work I did with John Gray last year on XPath support in PostgreSQL, but the interest in Jon Udell's 'SQL and XPath' post has goaded me into action: here's a brief outline of the four production-ready XPath functions which represent the most complete XML support of any open source RDBMS.

The four functions extend John's previous work (see contrib/xml directory in any recent Postgres installation) and rely on:


Postgres's support of user-defined functions written in C, Perl and PL/SQL.

The fast, light, C-based libxml library from xmlsoft.org.

As the example below demonstrates, Postgres UDFs can be used in both the SELECT and WHERE clauses of a SQL statement.

CREATE TABLE t_articles (

    article_id int4 NOT NULL,

    article_title varchar(255) NOT NULL,  

    article_xml text

    );

        

INSERT INTO t_articles (article_id, article_title, article_xml)

    VALUES (1, 'Some Beatles', 

    '<beatles id = "b1">

        <beatle instrument = "guitar" alive = "no">john lennon</beatle>

        <beatle instrument = "guitar" alive = "no">george harrison</beatle>

    </beatles>');


INSERT INTO t_articles (article_id, article_title, article_xml)

    VALUES (2, 'Other Beatles', 

    '<beatles id = "b2">

        <beatle instrument = "bass" alive = "yes">paul mccartney</beatle>

        <beatle instrument = "drums" alive = "yes">ringo starr</beatle>

    </beatles>');

 

SELECT 

     article_id, xpath_string(article_xml,'/beatles/@id') AS beatle_id

FROM 

     t_articles 

WHERE 

     xpath_bool(article_xml,'/beatles/beatle[@alive="yes"]');

 

 /*

 --  returns:

 -- 

 --  article_id  |  beatle_id

 -- ------------   -----------

 --  2           |  b2

 */

 

Usage of the two functions shown here - xpath_string and xpath_bool - should be self-explanatory. They each take two arguments: a column identifier and an XPath statement. xpath_string returns a string (the textual content of a node) and xpath_bool returns a boolean value. xpath_number returns a number, which is useful for numerical comparisons or running mathematical functions on the returned value:


WHERE xpath_number(article_xml,'//job_title/@year') > 2001 

The fourth function, xpath_nodeset, is a somewhat kludgey solution to the problem of returning values from repeated nodes in a single record. Here's an example of how it might be used:


SELECT 

    xpath_nodeset(article_xml,'//beatle/@instrument') AS beatle_instrument

FROM 

    t_articles

WHERE 

    article_id = 2;


 /*

 --  returns:

 -- 

 --  beatle_instrument  

 --  -------------------

 --  <instrument>bass</instrument><instrument>drums</instrument>

 */

This problem highlights the potentially awkward marriage between SQL sets and arbitrarily structured data, but I'm sure that better solutions are possible (how does Virtuoso handle repeated nodes?). PostgreSQL 7.3 allows UDFs to return multiple rows, which might help... suggestions are very welcome.


Contact me for the software in its current unlicensed, undocumented state, or watch this space for a properly packaged distribution.


Update: the functions have been updated and are now part of the standard PostgreSQL distribution.


© Tom Dyson 2006

Home | About | Contact | Search | Archives