s c h e m a t i c s : c o o k b o o k

/ Cookbook.XMLRecipeRSS

This Web


WebHome 
WebChanges 
TOC (with recipes)
NewRecipe 
WebTopicList 
WebStatistics 

Other Webs


Chicken
Cookbook
Erlang
Know
Main
Plugins
Sandbox
Scm
TWiki  

Schematics


Schematics Home
Sourceforge Page
SchemeWiki.org
Original Cookbook
RSS

Scheme Links


Schemers.org
Scheme FAQ
R5RS
SRFIs
Scheme Cross Reference
PLT Scheme SISC
Scheme48 SCM
MIT Scheme scsh
JScheme Kawa
Chicken Guile
Bigloo Tiny
Gambit LispMe
GaucheChez

Lambda the Ultimate
TWiki.org

Reading and Writing RSS Files

Problem

You want to create an RSS (Rich Site Summary) file, or read one produced by another application. Handling RSS can be a difficult problem because of multiple incompatible specs calling themselves RSS, the generall looseness of the format, and issues with escaping and encoding content properly. RSS is a case study in how difficult it is to produce valid XML, partly because RSS traditionally includes fragments of HTML, which is marked up text, but not necessarily valid XML.

Solution

Let's stipulate that regardless of the RSS format, there's only a few things we're actually interested in: we want to come up with a list of items, where each item contains a date, a URL for the item, a description, and optionally, a title. We'll use David van Horn's script for scraping the word-a-day RSS Feed from wordsmith.org. David is using the xml library that ships with PLT, and it seems to work well enough, so we'll go with that.

(require (lib "xml.ss" "xml")
         (lib "match.ss")
         (lib "url.ss" "net")
         (lib "1.ss" "srfi"))

; url is a url as defined by url.ss
(define (get-rss url)
  (xml->xexpr
   ((eliminate-whitespace '(rss channel item) (lambda (x) x))
    (document-element (call/input-url url get-pure-port read-xml)))))

The get-rss function is used to return an S-Expression from a URL, i.e. retrieve some XML from a URL and convert it into an S-Expression. The eliminate-whitespace function returns a function that will remove strings containing only whitespace from the elements named in the first argument. This cleans up the S-Expression so the match expression we'll write is easier; we don't need to account for whitespace, which isn't significant to us anyway.

(define (rss->item rss)
  (letrec ((good-item (lambda (p) (and (pair? p) p))))
    (filter good-item
            (match rss
              (('rss _  ('channel _ . items))
               (map 
                (match-lambda
                    (('item _
                       ('title _ title)
                       ('link _ link)
                       ('description _ . desc ) . _)
                     (list link title desc))
                  (('item
                     ('title _ title)
                     ('link _ link)
                     (_ . _)
                     ('body _ . body))
                   (list link title body))
                  (_ '()))
                items))
              (('rdf:RDF (_ ...) _ ('channel . _). items)
               (map 
                (match-lambda
                    (('item _
                       ('title _ title)
                       ('link _ link)
                       ('description _ . desc ) . _)
                     (list link title desc))
                  (_ '()))
                items))
              ))))

This expanded version of David's match expression has been wrapped in a function. rss->item will handle RSS 2.0 and 0.91 in the first case and RSS 1.0 (RDF) in the second case. The matching is done in a nested manner, the initial match finds the items in the <channel> element, and then uses match-lambda to filter the child items found by the first match. The output of the match is a list of the link, title, and description elements; however, the match can also return an empty list, so we use SRFI-1's filter on the output of the whole thing to identify non-empty lists. Our good-item function returns either #f or the match.

This match is a bit fragile, since RSS doesn't dictate in what order child elements can appear under <item>. However, in practice it works well enough.

Resources

Feed Validaton

RSS 2.0 Specification

RSS 1.0 Specification

Parsing RSS At All Costs

HtmlPrag

-- HectorEGomezMorales - 05 May 2004

-- GordonWeakliem - 06 Aug 2004

CookbookForm
TopicType: Recipe
ParentTopic: XmlRecipes
TopicOrder: 090

 
 
Copyright © 2004 by the contributing authors. All material on the Schematics Cookbook web site is the property of the contributing authors.
The copyright for certain compilations of material taken from this website is held by the SchematicsEditorsGroup - see ContributorAgreement & LGPL.
Other than such compilations, this material can be redistributed and/or modified under the terms of the GNU Lesser General Public License (LGPL), version 2.1, as published by the Free Software Foundation.
Ideas, requests, problems regarding Schematics Cookbook? Send feedback.
/ You are Main.guest