[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
11.2 The syntax of the regular grammar
A regular grammar is built by the means of the form regular-grammar
:
- bigloo syntax: regular-grammar (binding …) rule …
-
The binding and rule are defined by the following grammar:
<binding> → (<variable> <re>) | <option> <option> → <variable> <rule> → <define> | (<cre> <s-expression> <s-expression> …) | (
else
<s-expression> <s-expression> …) <define> → (define <s-expression>) <cre> → <re> | (context
<symbol> <re>) | (when
<s-expr> <re>) | (bol
<re>) | (eol
<re>) | (bof
<re>) | (eof
<re>) <re> → <variable> | <char> | <string> | (:
<re> …) | (or
<re> …) | (*
<re>) | (+
<re>) | (?
<re>) | (=
<integer> <re>) | (>=
<integer> <re>) | (**
<integer> <integer> <re>) | (...
<integer> <re>) | (uncase
<re>) | (in
<cset> …) | (out
<cset> …) | (and
<cset> <cset>) | (but
<cset> <cset>) | (posix
<string>) <variable> → <symbol> <cset> → <string> | <char> | (<string>) | (<char> <char>)Here is a description of each construction.
(context <symbol> <re>)
This allows us to protect an expression. A protected expression matches (or accepts) a word only if the grammar has been set to the corresponding context. See section The semantics actions, for more details.
(when <s-expr> <re>)
This allows us to protect an expression. A protected expression matches (or accepts) a word only if the evaluation of
<s-expr>
is#t
. For instance,(define *g* (let ((armed #f)) (regular-grammar () ((when (not armed) (: "#!" (+ (or #\/ alpha)))) (set! armed #t) (print "start [" (the-string) "]") (ignore)) ((+ (in #\Space #\Tab)) (ignore)) (else (the-failure))))) (define (main argv) (let ((port (open-input-string "#!/bin/sh #!/bin/zsh"))) (print (read/rp *g* port))))
(bol <re>)
Matches
<re>
at the beginning of line.(eol <re>)
Matches
<re>
at the end of line.(bof <re>)
Matches
<re>
at the beginning of file.(eof <re>)
Matches
<re>
at the end of file.<variable>
This is the name of a variable bound by a <binding> construction. In addition to user defined variables, some already exist. These are:
all ≡ (out #\Newline) lower ≡ (in ("az")) upper ≡ (in ("AZ")) alpha ≡ (or lower upper) digit ≡ (in ("09")) xdigit ≡ (uncase (in ("af09"))) alnum ≡ (uncase (in ("az09"))) punct ≡ (in ".,;!?") blank ≡ (in #" \t\n") space ≡ #\Space
It is a error to reference a variable that it is not bound by a <binding>. Defining a variable that already exists is acceptable and causes the former variable definition to be erased. Here is an example of a grammar that binds two variables, one called ‘ident’ and one called ‘number’. These two variables are used within the grammar to match identifiers and numbers.
(regular-grammar ((ident (: alpha (* alnum))) (number (+ digit))) (ident (cons 'ident (the-string))) (number (cons 'number (the-string))) (else (cons 'else (the-failure))))
<char>
The regular language described by one unique character. Here is an example of a grammar that accepts either the character
#\a
or the character#\b
:(regular-grammar () (#\a (cons 'a (the-string))) (#\b (cons 'b (the-string))) (else (cons 'else (the-failure))))
<string>
This simple form of regular expression denotes the language represented by the string. For instance the regular expression
"Bigloo"
matches only the string composed of#\B #\i #\g #\l #\o #\o
. The regular expression".*["
matches the string#\. #\* #\[
.(: <re> ...)
This form constructs sequence of regular expression. That is a form
<re1> <re2> ... <ren>
matches the language construction by concatenation of the language described by<re1>
,<re2>
,<ren>
. Thus,(: "x" all "y")
matches all words of three letters, started by character the#\x
and ended with the character#\y
.(or <re> ...)
This construction denotes conditions. The language described by
(or re1 re2)
accepts words accepted by eitherre1
orre2
.(* <re>)
This is the Kleene operator, the language described by
(* <re>)
is the language containing, 0 or more occurrences of<re>
. Thus, the language described by(* "abc")
accepts the empty word and any word composed by a repetition of theabc
(abc
,abcabc
,abcabcabc
, ...).(+ <re>)
This expression described non empty repetitions. The form
(+ re)
is equivalent to(: re (* re))
. Thus,(+ "abc")
matches the wordsabc
,abcabc
, etc.(? <re>)
This expression described one or zero occurrence. Thus,
(? "abc")
matches the empty word or the wordsabc
.(= <integer> <re>)
This expression described a fix number of repetitions. The form
(= num re)
is equivalent to(: re re ... re)
. Thus, the expression(= 3 "abc")
matches the only wordabcabcabc
. In order to avoid code size explosion when compiling,<integer>
must be smaller than an arbitrary constant. In the current version that value is81
.(>= <integer> <re>)
The language described by the expression
(>= int re)
accepts word that are, at least,int
repetitions ofre
. For instance,(>= 10 #\a)
, accepts words compound of, at least, 10 times the character#\a
. In order to avoid code size explosion when compiling,<integer>
must be smaller than an arbitrary constant. In the current version that value is81
.(** <integer> <integer> <re>)
The language described by the expression
(** min max re)
accepts word that are repetitions ofre
; the number of repetition is in the rangemin
,max
. For instance,(** 10 20 #\a)
. In order to avoid code size explosion when compiling,<integer>
must be smaller than an arbitrary constant. In the current version that value is81
.(... <integer> <re>)
The subexpression
<re>
has to be a sequence of characters. Sequences are build by the operator:
or by string literals. The language described by(... int re)
, denotes, the first letter ofre
, or the two first letters ofre
, or the three first letters ofre
or theint
first letters ofre
. Thus,(... 3 "begin")
is equivalent to(or "b" "be" "beg")
.(uncase <re>)
The subexpression
<re>
has to be a sequence construction. The language described by(uncase re)
is the same asre
where letters may be upper case or lower case. For instance,(uncase "begin")
, accepts the words"begin"
,"beGin"
,"BEGIN"
,"BegiN"
, etc.(in <cset> ...)
Denotes union of characters. Characters may be described individually such as in
(in #\a #\b #\c #\d)
. They may be described by strings. The expression(in "abcd")
is equivalent to(in #\a #\b #\c #\d)
. Characters may also be described using a range notation that is a list of two characters. The expression(in (#\a #\d))
is equivalent to(in #\a #\b #\c #\d)
. The Ranges may be expresses using lists of string. The expression(in ("ad"))
is equivalent to(in #\a #\b #\c #\d)
.(out <cset> ...)
The language described by
(out cset ...)
is opposite to the one described by(in cset ...)
. For instance,(out ("azAZ") (#\0 #\9))
accepts all words of one character that are neither letters nor digits. One should not that if the character numbered zero may be used inside regular grammar, theout
construction never matches it. Thus to write a rule that, for instances, matches every character but#\Newline
including the character zero, one should write:(or (out #\Newline) #a000)
(and <cset> <cset>)
The language described by
(and cset1 cset2)
accepts words made of characters that are in bothcset1
andcset2
.(but <cset> <cset>)
The language described by
(but cset1 cset2)
accepts words made of characters ofcset1
that are not member ofcset2
.(posix <string>)
The expression
(posix string)
allows one to use Posix string notation for regular expressions. So, for example, the following two expressions are equivalent:(posix "[az]+|x*|y{3,5}") (or (+ (in ("az"))) (* "x") (** 3 5 "y"))
- bigloo syntax: string-case string rule ...
This form dispatches on strings. it opens an input on
string
a read into it according to the regular grammar defined by thebinding
andrule
. Example:(define (suffix string) (string-case string ((: (* all) ".") (ignore)) ((+ (out #\.)) (the-string)) (else "")))
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated on March 31, 2014 using texi2html 5.0.