You want to select words from a string.
Determine the defining features of a word for your specific application, then write a regular expression that models this idea.
(define words-1
(regexp "[^ ]+")) (define words-2
(regexp "[A-Za-z'-]+"))
> (regexp-match words-1 "'alpha-beta gamma")
("'alpha-beta")
> (regexp-match words-2 "'alpha-beta&or gamma")
("'alpha-beta")
Scheme does not have a built-in definition for words in strings. On the one hand, this is inconvenient since you have to define your own meaning of "word". On the other hand, this is the correct behavior since the concept of words varies significantly between applications, locales, encodings, and input source.
The meaning of "word" in a particular application's context can vary significantly. Languages usually support pluralization of singular nouns, attach posessive modifiers, allow hyphenated word combinations, and so forth. The regular expression used must reflect the expected range of words to be encountered.
The Perl-compatible regular expression module supports all of Perl's constructs (with the one proviso that escaped characters, such as
\b must receive two escape slashes to be parsed properly, i.e.,
\\b.) Using the
pregexp module, we can search based on word boundaries:
> (pregexp-match
(pregexp "\\b([A-za-z]+)\\b") "The quick brown fox")
("The" "The")
> (pregexp-match
(pregexp "\\b([A-za-z]+)\\b") "ended. Then we walked")
("ended" "ended")
> (pregexp-match
(pregexp "\\s([A-za-z]+)\\s") "The quick brown fox")
(" quick " "quick")
> (pregexp-match
(pregexp "\\s([A-za-z]+)\\s") "ended. Then we walked")
(" Then " "Then")
The
pregexp provides the "word" character set
\w, which matches a character that is part of a valid Perl identifier. However, this just means a string of alphanumerics and underscores. This is generally not what you want.
Note that
\\b and
\\B are still useful. For example, "\\Bis\\B" matches the string "is" within a word, but not at the edges. So, while "whistle" would match, "this" would not.
--
BrentAFulgham - 18 May 2004