Given an url, fetch the page, parse it and extract all links.
We will use
string->url and
get-pure-port from
url.ss to fetch the page. For parsing,
html->sxml from the permissive HTML parser library
HtmlPrag is used. Then
sxml->links makes a recursive descent of the parse tree collecting all links it discovers.
(require (lib "htmlprag.ss" "htmlprag")
(lib "url.ss" "net")
(lib "match.ss"))
(define (url->sxml url)
(html->sxml (get-pure-port (string->url url))))
(define (sxml->links sxml)
(let loop ([sxml sxml]
[links '()])
(match sxml
[('a ('@ ((not 'href ) _) ... ('href url) . more) text)
(cons (list url text) links)]
[(item ...)
(append links
(apply append
(map sxml->links sxml)))]
[else
links])))
> (sxml->links (url->sxml "http://www.neilvandyke.org/htmlprag/"))
(("htmlprag.scm" "file htmlprag.scm")
("htmlprag.html" "file htmlprag.html")
("htmlprag.pdf" "file htmlprag.pdf")
("htmlprag-0-7.plt" "file htmlprag-0-7.plt"))
> (url->sxml "http://www.neilvandyke.org/htmlprag/")
(*TOP*
(*PI* xml "version=\"1.0\" encoding=\"UTF-8\"") "\n"
(*DECL* DOCTYPE ...
(html ...
(head ...
(title "Neil W. Van Dyke: HtmlPrag") "\n") "\n"
(body ...
(p (a (@ (href "http://www.neilvandyke.org/")
(title "Home page of Neil W. Van Dyke")
(accesskey "0"))
(b "neilvandyke")
".org"))
...
Get
HtmlPrag from
http://www.neilvandyke.org/htmlprag/.
Installation instructions.
This code handles the general case where
href can occur anywhere in the attribute list:
('a ; a link
('@ ; with a list of attributes
((not 'href ) _) ... ; starting with zero or more non-hrefs
('href url) ; then a href with the url
. more) ; then some more
text) ; and finally the text of displayed
--
JensAxelSoegaard - 19 Apr 2004
--
GordonWeakliem - 20 Apr 2004
As a matter of interest, I once did something similar with SCSH, using its s-exp based regular expressions (SRE). The actual regular expression used to match the links was:
(rx (w/nocase (: (or "href" "file") (* whitespace) "=" (* whitespace) "\""
(submatch (* (~ ("?\"")))) "\"")))
There's a bit more about this, with a more readable version of the above, in
this post.
You could do the same thing in PLT with a string-based regexp. Of course, HTMLPrag is likely to scale much better to more complex requirements.
--
AntonVanStraaten - 19 Apr 2004
One problem with this solution is that it works only for anchor tags that have href as the first attribute.
(define (sxml->links sxml)
(let loop ([sxml sxml]
[links '()])
(match sxml
[('a ('@ ((not 'href ) _) ... ('href url) . more) text) (cons (list url text) links)]
[(item ...) (append links
(apply append
(map sxml->links sxml)))]
[else links])))
FYI, I had to
(require (lib "match.ss")) to get this example to work.
--
GordonWeakliem - 20 Apr 2004
I have changed the solution to use your match expression
and added an explanation of what's going on.
--
JensAxelSoegaard - 24 Apr 2004