The four functions extend John's previous work (see contrib/xml directory in any recent Postgres installation) and rely on:
- Postgres's support of user-defined functions written in C, Perl and PL/SQL.
- The fast, light, C-based libxml library from xmlsoft.org.
CREATE TABLE t_articles (
article_id int4 NOT NULL,
article_title varchar(255) NOT NULL,
article_xml text
);
INSERT INTO t_articles (article_id, article_title, article_xml)
VALUES (1, 'Some Beatles',
'<beatles id = "b1">
<beatle instrument = "guitar" alive = "no">john lennon</beatle>
<beatle instrument = "guitar" alive = "no">george harrison</beatle>
</beatles>');
INSERT INTO t_articles (article_id, article_title, article_xml)
VALUES (2, 'Other Beatles',
'<beatles id = "b2">
<beatle instrument = "bass" alive = "yes">paul mccartney</beatle>
<beatle instrument = "drums" alive = "yes">ringo starr</beatle>
</beatles>');
SELECT
article_id, xpath_string(article_xml,'/beatles/@id') AS beatle_id
FROM
t_articles
WHERE
xpath_bool(article_xml,'/beatles/beatle[@alive="yes"]');
/*
-- returns:
--
-- article_id | beatle_id
-- ------------ -----------
-- 2 | b2
*/
Usage of the two functions shown here - xpath_string and xpath_bool - should be self-explanatory. They each take two arguments: a column identifier and an XPath statement. xpath_string returns a string (the textual content of a node) and xpath_bool returns a boolean value. xpath_number returns a number, which is useful for numerical comparisons or running mathematical functions on the returned value:
WHERE xpath_number(article_xml,'//job_title/@year') > 2001
The fourth function, xpath_nodeset, is a somewhat kludgey solution to the problem of returning values from repeated nodes in a single record. Here's an example of how it might be used:
SELECT
xpath_nodeset(article_xml,'//beatle/@instrument') AS beatle_instrument
FROM
t_articles
WHERE
article_id = 2;
/*
-- returns:
--
-- beatle_instrument
-- -------------------
-- <instrument>bass</instrument><instrument>drums</instrument>
*/
This problem highlights the potentially awkward marriage between SQL sets and arbitrarily structured data, but I'm sure that better solutions are possible (how does Virtuoso handle repeated nodes?). PostgreSQL 7.3 allows UDFs to return multiple rows, which might help... suggestions are very welcome.
Contact me for the software in its current unlicensed, undocumented state, or watch this space for a properly packaged distribution.
Update: the functions have been updated and are now part of the standard PostgreSQL distribution.