s c h e m a t i c s : c o o k b o o k

/ Cookbook.ParsingWithLazySxml

This Web


WebHome 
WebChanges 
TOC (with recipes)
NewRecipe 
WebTopicList 
WebStatistics 

Other Webs


Chicken
Cookbook
Erlang
Know
Main
Plugins
Sandbox
Scm
TWiki  

Schematics


Schematics Home
Sourceforge Page
SchemeWiki.org
Original Cookbook
RSS

Scheme Links


Schemers.org
Scheme FAQ
R5RS
SRFIs
Scheme Cross Reference
PLT Scheme SISC
Scheme48 SCM
MIT Scheme scsh
JScheme Kawa
Chicken Guile
Bigloo Tiny
Gambit LispMe
GaucheChez

Lambda the Ultimate
TWiki.org

Parsing Large XML Files with SXML's lazy:xml->sxml and lazy:sxpath

Problem

Some XML files are too large to hold in memory all at once without significant overhead, and the traditional solution --- SAX --- is often a bit too low level for writing a simple solution.

Solution

The sxml PLaneT package provides a lazy pull-style parser that we can use to progressively parse through a large XML file.

(require (planet "sxml.ss" ("lizorkin" "sxml.plt" 1 4)))

(define lazy-doc
  (call-with-input-file "some-big-file.xml"
    (lambda (ip)
      (lazy:xml->sxml ip '()))))

At this point, lazy-doc is a lazy sxml tree, which we can then query using lazy:sxpath.

(require (planet "sxml.ss" ("lizorkin" "sxml.plt" 1 4)))

(let* ([ip   
        (open-input-string "<html><p>hello</p><p>world</p></html>")]
       [lazy-doc (lazy:xml->sxml ip '())])
  (display ((lazy:sxpath (list "html" "p")) lazy-doc))
  (newline))

When we run this, we see that we get an odd stream:

((p hello) #<struct:promise>)
and we can force the second element to access more results. This allows us to do parsing in pull-parsing style.

As a more realistic example, here is a test module to parse some essential details from the Gene Ontology, a standard hierarchy of terms used in bioinformatics:

(module gene-ontology mzscheme
  
  ;; This code shows how one might parse a large xml file progressively
  ;; by taking advantage of the lazy parsing in the sxml module.
  ;;
  ;; Input: an RDF file from the Gene Ontology (http://geneontology.org) that
  ;; conforms to the DTD found at: http://www.geneontology.org/dtd/go.dtd.
  ;;
  ;; Output: a sample parsing of all the terms in the RDF that
  ;; shows accession, name, and definition.
  ;;
  ;; I believe one can always download the latest copy of the Gene Ontology
  ;; by grabbing:
  ;;
  ;; http://archive.godatabase.org/latest-termdb/go_daily-termdb.rdf-xml.gz
  
  (require (planet "sxml.ss" ("lizorkin" "sxml.plt" 1 4))
           (lib "list.ss")
           (lib "file.ss")
           (lib "pretty.ss"))
  
  (provide (all-defined))
  
  
  ;; Our test code will just parse a term and print it out.  We'll call this
  ;; at the end of our module.
  (define (test)
    (call-with-input-file*
        (vector-ref (current-command-line-arguments) 0)
      (lambda (ip)
        (fold-term-elts (lambda (term-elt acc)
                          (pretty-print (term-elt->Term term-elt))
                          (newline))
                        (void)
                        ip))))
  
  
  ;; We'll say that a Term is a
  ;;
  ;;     (make-Term i n d)
  ;;
  ;; where i, n are strings.  d is either a string or void.  This is a
  ;; simplification, since the real Gene Ontology provides a lot of
  ;; other interesting attributes (including hierarchical data).
  (define-struct Term (id name definition) #f)
  
  
  ;; fold-lazy-sxpath-list: (sxml Y -> Y) Y sxpath-sxml-stream -> Y
  ;; Given the particular kind of lazy-list given by sxml's lazy:sxpath's query,
  ;; does a fold across its structure.
  (define (fold-lazy-sxpath-list f acc lazy-list)
    (cond [(empty? lazy-list)
           acc]
          [else 
           (fold-lazy-sxpath-list f 
                                  (f (first lazy-list) acc)
                                  (force (second lazy-list)))]))
  
  
  ;; make-namespaced-tag: string string -> symbol
  ;; Builds up a namespaced tag from the namespace ns and the suffix.
  (define (make-namespaced-tag ns suffix)
    (string->symbol (string-append ns ":" suffix)))
  
  
  ;; go-tag: string -> symbol
  (define (go-tag suffix) 
    (make-namespaced-tag "http://www.geneontology.org/dtds/go.dtd#" suffix))
  
  
  ;; rdf-tag: string -> symbol
  (define (rdf-tag suffix) 
    (make-namespaced-tag "http://www.w3.org/1999/02/22-rdf-syntax-ns#" suffix))
  
  
  ;; term-elt: sxml-fragment -> Term
  ;; Given an sxml fragment element elt, extracts a Term.
  (define (term-elt->Term elt)
    (define name-query (sxpath (list (go-tag "name") "text()")))
    (define defn-query (sxpath (list (go-tag "definition") "text()")))
    (define id-query (sxpath (list (go-tag "accession") "text()")))
    (define (first-or-void a-list)
      (cond
        [(empty? a-list) (void)]
        [else (first a-list)]))
    (make-Term (first (id-query elt))
               (first (name-query elt))
               (first-or-void (defn-query elt))))
  
  
  ;; fold-term-elts: (elt Y) Y input-port -> Y
  ;; Given an input port ip whose contents conform to the gene ontology
  ;; RDF, folds f across every term element we can find in ip, using acc
  ;; as the initial accumulator.
  (define (fold-term-elts f acc ip)
    (define doc (lazy:xml->sxml ip '()))
    (define query (lazy:sxpath (list (go-tag "go") 
                                     (rdf-tag "RDF")
                                     (go-tag "term"))))
    (fold-lazy-sxpath-list f acc (query doc)))
  
  
  
  ;; Finally, fire this test code up:
  (test))

Discussion


Comments about this recipe

Contributors

-- DannyYoo - 08 May 2006

CookbookForm
TopicType: Recipe
ParentTopic: XmlChapter
TopicOrder: 999

 
 
Copyright © 2004 by the contributing authors. All material on the Schematics Cookbook web site is the property of the contributing authors.
The copyright for certain compilations of material taken from this website is held by the SchematicsEditorsGroup - see ContributorAgreement & LGPL.
Other than such compilations, this material can be redistributed and/or modified under the terms of the GNU Lesser General Public License (LGPL), version 2.1, as published by the Free Software Foundation.
Ideas, requests, problems regarding Schematics Cookbook? Send feedback.
/ You are Main.guest