Abstract:
Frown is an LALR(k) parser generator for Haskell 98 written in
Haskell 98.
Its salient features are:
-
The generated parsers are time and space efficient. On the downside,
the parsers are quite large.
- Frown generates four different types of parsers. As a common
characteristic, the parsers are genuinely functional (ie
`table-free'); the states of the underlying LR automaton are encoded
as mutually recursive functions. Three output formats use a typed
stack representation, one format due to Ross Paterson
(code=stackless) works even without a stack.
- Encoding states as functions means that each state can be
treated individually as opposed to a table driven-approach, which
necessitates a uniform treatment of states. For instance, look-ahead
is only used when necessary to resolve conflicts.
- Frown comes with debugging and tracing facilities; the standard
output format due to Doaitse Swierstra (code=standard) may be
useful for teaching LR parsing.
- Common grammatical patterns such as repetition of symbols can
be captured using rule schemata. There are several predefined
rule schemata.
- Terminal symbols are arbitrary variable-free Haskell patterns
or guards. Both terminal and nonterminal symbols may have an
arbitrary number of synthesized attributes.
- Frown comes with extensive documentation; several example
grammars are included.
Furthermore, Frown supports the use of monadic lexers, monadic
semantic actions, precedences and associativity, the generation of
backtracking parsers, multiple start symbols, error reporting and a
weak form of error correction.
Contents
Frown is an LALR(k) parser generator for Haskell 98 written in
Haskell 98.
The work on Frown started as an experiment in generating genuinely
functional LR parsers. The first version was written within three
days—yes, Haskell is a wonderful language for rapid prototyping.
Since then Frown has gone through several cycles of reorganization
and rewriting. It also grew considerably: dozens of features were
added, examples were conceived and tested, and this manual was
written. In the end, Frown has become a useable tool. I hope you
will find it useful, too.
1.1 Obtaining and installing Frown
Obtaining Frown
The parser generator is available from
http://www.informatik.uni-bonn.de/~ralf/frown.
The bundle includes the sources and the complete documentation (dvi,
ps, PDF, and HTML).
Requirements
You should be able to build Frown with every Haskell 98-compliant
compiler. You have to use a not too ancient compiler as there have
been some changes to the Haskell language in Sep. 2001 (GHC 5.02 and
later versions will do).
The Haskell interpreter Hugs 98 is needed for running the testsuite.
Various tools are required to generate the documentation from scratch:
lhs2TeX, LATEX, functional , HEVEA and HACHA. Note, however, that
the bundle already includes the complete documentation.
Installation
Unzip and untar the bundle. This creates a directory called Frown.
Enter this directory.
| ralf> tar xzf frown.tar.gz |
| ralf> cd Frown
|
The documentation resides in the directory Manual; example grammars
can be found in Examples and Manual/Examples (the files ending in
.g and .lg).
You can install Frown using either traditional makefiles or Cabal.
Using makefiles
Optionally, edit the Makefile to specify destinations for the binary
and the documentation (this information is only used by make install). Now, you can trigger
which compiles Frown generating an executable called frown (to use
Frown you only need this executable). Optionally, continue with
to install the executable and the documentation.
For reference, here is a list of possible targets:
-
make
Compiles Frown generating an executable called frown (to use Frown
you only need this executable).
- make install
Compiles Frown and installs the executable and the documentation.
- make test
Runs the testsuite.1
- make man
Generates the documentation in various formats (dvi, ps, PDF, and
HTML).
- make clean
Removes some temporary files.
- make distclean
Removes all files except the ones that are included in the distribution.
Using Cabal
Alternatively, you can build Frown using Cabal (version 1.1.3 or
later), Haskell's Common Architecture for Building Applications and
Libraries.
For a global install, type:
| ralf/Frown> runhaskell Setup.hs configure --ghc |
| ralf/Frown> runhaskell Setup.hs build |
| ralf/Frown> runhaskell Setup.hs install
|
If you want to install Frown locally, use (you may wish to replace
$HOME by a directory of your choice):
| ralf/Frown> runhaskell Setup.hs configure --ghc --prefix=$HOME |
| ralf/Frown> runhaskell Setup.hs build |
| ralf/Frown> runhaskell Setup.hs install --user
|
Usage
The call
displays the various options. For more information consult this
manual.
1.2 Reporting bugs
Bug reports should be send to Ralf Hinze (ralf@cs.uni-bonn.de).
The report should include all information necessary to reproduce the
bug: the compiler used to compile Frown, the grammar source file (and
possibly auxiliary Haskell source files), and the command-line
invocation of Frown.
Suggestions for improvements or request for features should also be
sent to the above address.
1.3 License
Frown is distributed under the GNU general public licence
(version 2).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %
% Frown --- An LALR(k) parser generator for Haskell 98 %
% Copyright (C) 2001-2005 Ralf Hinze %
% %
% This program is free software; you can redistribute it and/or modify %
% it under the terms of the GNU General Public License (version 2) as %
% published by the Free Software Foundation. %
% %
% This program is distributed in the hope that it will be useful, %
% but WITHOUT ANY WARRANTY; without even the implied warranty of %
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the %
% GNU General Public License for more details. %
% %
% You should have received a copy of the GNU General Public License %
% along with this program; see the file COPYING. If not, write to %
% the Free Software Foundation, Inc., 59 Temple Place - Suite 330, %
% Boston, MA 02111-1307, USA. %
% %
% Contact information %
% Email: Ralf Hinze <ralf@cs.uni-bonn.de> %
% Homepage: http://www.informatik.uni-bonn.de/~ralf/ %
% Paper mail: Dr. Ralf Hinze %
% Institut für Informatik III %
% Universität Bonn %
% Römerstraße 164 %
% 53117 Bonn, Germany %
% %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.4 Credits
Frown wouldn't have seen the light of day without the work of Ross
Paterson and Doaitse Swierstra. Ross invoked my interest in LR parsing
and he devised the --code=stackless and --code=gvstack output
formats. Doaitse invented the --code=standard format, which was
historically the first format Frown supported.
A big thank you goes to Andres Löh and Ross Paterson for various bug
reports and suggestions for improvement.
- 1
- There are some known problems. The
format code=stackless behaves differently for Loop.g (the
generated parser is less strict than the standard one). Also,
Empty.g does not work yet. Finally, error reports may differ for
different formats and for optimized and unoptimized versions (as some
parsers perform additional reductions before an error is
reported).
First install Frown as described in Sec. 1.1. Then enter
the directory QuickStart.
| ralf/Frown> cd QuickStart
|
The file Tiger.lg, listed in Fig. 2.1, contains a
medium-sized grammar for an imperative language.
A grammar file consists of two parts: the specification of the
grammar, enclosed in special curly braces, and Haskell source code.
The source file typically starts with a Haskell module header.
| > module Tiger where |
| > import Lexer |
| > import Syntax |
| > import Prelude hiding (exp) |
| > |
| > %{ |
|
The grammar part begins here. A context-free grammar consists of sets
of terminal and nonterminal symbols, a set of start symbols, and set
of productions or grammar rules. The declaration below introduces the
terminal symbols. Each terminal is given by a Haskell pattern of type
Terminal
.
| > Terminal = DO | ELSE | END | FUNCTION | IF |
| > | IN | LET | THEN | VAR | WHILE |
| > | ASSIGN as ":=" | COLON as ":" | COMMA as "," | CPAREN as ")" |
| > | DIV as "/" | EQU as "=" | LST as "<=" | MINUS as "-" |
| > | NEG as "~" | OPAREN as "(" | PLUS as "+" | SEMI as ";" |
| > | TIMES as "*" |
| > | ID {String} | INT {String};
|
|
A terminal symbol may carry a semantic value or attribute. The Haskell
type of the semantic value is given in curly braces. As a rule,
Haskell source code is always enclosed in curly braces within the
grammar part. The as-clauses define shortcuts for terminals, which may
then be used in the productions.
The declaration below introduces a nonterminal symbol called exp
followed by sixteen productions for that symbol. The asterix marks
exp as a start symbol; exp has a single attribute of type Expr
.
| > *exp {Expr}; |
| > exp {Var v} : ID {v}; |
| > {Block es} | "(", sepBy exp ";" {es}, ")"; |
| > {Int i } | INT {i}; |
| > {Un Neg e} | "-", exp {e}, prec "~"; |
| > {Call f es} | ID {f}, "(", sepBy exp "," {es}, ")"; |
| > {Bin e1 Eq e2} | exp {e1}, "=", exp {e2}; |
| > {Bin e1 Leq e2} | exp {e1}, "<=", exp {e2}; |
| > {Bin e1 Add e2} | exp {e1}, "+", exp {e2}; |
| > {Bin e1 Sub e2} | exp {e1}, "-", exp {e2}; |
| > {Bin e1 Mul e2} | exp {e1}, "*", exp {e2}; |
| > {Bin e1 Div e2} | exp {e1}, "/", exp {e2}; |
| > {Assign v e} | ID {v}, ":=", exp {e}; |
| > {IfThen e e1} | IF, exp {e}, THEN, exp {e1}; |
| > {IfElse e e1 e2} | IF, exp {e}, THEN, exp {e1}, ELSE, exp {e2}; |
| > {While e e1} | WHILE, exp {e}, DO, exp {e1}; |
| > {Let ds es} | LET, many dec {ds}, IN, sepBy exp ";" {es}, END;
|
|
Left-hand and right-hand side of a production are separated by a
colon; symbols on the right are separated by commas and terminated by
a semicolon. Alternative right-hand sides are separated by a vertical
bar.
The pieces in curly braces constitute Haskell source code. Each rule
can be seen as a function from the right-hand to the left-hand
side. On the right-hand side, Haskell variables are used to name the
values of attributes. The values of the attributes on the left-hand
side are given by Haskell expressions, in which the variables of the
right-hand side occur free.
The last production makes use of two (predefined) rule schemes: many x
implements the repetition of the symbol x, and sepBy x sep
denotes a repetition of x symbols separated by sep symbols.
The above productions are ambiguous as, for instance, 1 + 2 * 3 has
two derivations. The ambiguity can be resolved by assigning
precedences to terminal symbols.
| > left 7 "~"; left 6 "*"; left 6 "/"; left 5 "+"; left 5 "-"; |
| > right 0 THEN; right 0 ELSE; |
| > nonassoc 4 "<="; nonassoc 4 "="; nonassoc 0 DO; nonassoc 0 ":=";
|
|
The following declarations define the nonterminal dec and three
further nonterminals.
| > dec {Decl}; |
| > dec {d} : vardec {d}; |
| > {d} | fundec {d}; |
| > |
| > vardec {Decl}; |
| > vardec {Variable v e} : VAR, ID {v}, ":=", exp {e}; |
| > |
| > fundec {Decl}; |
| > fundec {Function f xs e} : FUNCTION, ID {f}, "(", sepBy (ID {}) "," {xs}, ")", "=", exp {e}; |
| > |
| > formal {(Ident, TyIdent)}; |
| > formal {(v, t)} : ID {v}, ":", ID {t}; |
| > |
| > }%
|
|
The grammar part ends here. The source file typically includes a
couple of Haskell declarations. The user-defined function frown is
the error routine invoked by the parser in case of a syntax error; its
definition is mandatory.
| > frown _ = error "syntax error" |
| > |
| > tc f = do { putStrLn "*** reading ..."; s <- readFile f; print s; |
| > putStrLn "*** lexing ..."; let {ts = lexer s}; print ts; |
| > putStrLn "*** parsing ..."; e <- exp ts; print e } |
|
Figure 2.1: A sample Frown grammar file.
Now, type
| ralf/Frown/QuickStart> frown -v Tiger.lg |
| ralf/Frown/QuickStart> hugs Tiger.hs |
| ... |
| Tiger> exp [ID "a", PLUS, ID "b", TIMES, INT "2"] >>= print |
| Bin (Var "a") Add (Bin (Var "b") Mul (Int "2")) |
| Tiger> tc "fib.tig" |
| ... |
The call frown -v Tiger.lg generates a Haskell parser which can then
be loaded into hugs (or ghci). The parser has type exp :: (Monad m) => [Terminal] -> m Expr, that is, the parser is a computation that
takes a list of terminals as input and returns an expression.
More examples can be found in the directory Manual/Examples:
-
Paren1.lg
well-balanced parentheses: a pure grammar (see Sec. 3.2.1);
- Paren2.lg
an extension of Paren1.lg illustrating the definition of attributes
(see Sec. 3.2.2);
- Calc.lg
a simple evaluator for arithmetic expressions: a parser that
interfaces with a separate lexer (see Sec. 3.2.3);
- MCalc.lg
a variant of the desktop calculator (Calc.lg) that prints all
intermediate results: illustrates monadic actions (see
Sec. 3.2.4);
- Let1.lg
an ambiguous expression grammar: illustrates backtracking parsers (see
Sec. 3.2.5);
- Let2.lg
an expression grammar: illustrates the use of precedences and
associativity (see Sec. 3.2.6);
- Let3.lg
a variant of the expression grammar: shows how to simulate inherited
attributes using a reader monad (see Sec. 3.2.8);
- Let4.lg
an expression grammar: illustrates a parser
that interfaces with a monadic lexer (see Sec. 3.3.1);
- Let5.lg
a variant of Let4.lg with better error reporting (see
Sec. 3.3.2);
- Let6.lg
a variant of Let5.lg with even better error reporting: prints a
list of expected tokens upon error (see Sec. 3.3.3);
- Let7.lg
yet another variant of the expression grammar: illustrates a simple
form of error correction (see Sec. 3.3.4);
- Let8.lg
variant of Let7.lg that notifies the user of corrections (see
Sec. 3.3.4);
- VarCalc.lg
a variant of the desktop calculator (Calc.lg) that works without a
separate lexer: illustrates guards (see Sec. 3.4.2);
- Paren3.lg
illustrates the tracing facilities (see Sec. 3.4.4);
- VarParen.lg
illustrates irrefutable patterns on the right-hand side of productions
(see Sec. 4.1);
- RepMin.lg
a solution to the rep-min problem: illustrates how to simulate
inherited attributes using functional attributes (see
Sec. 4.2).
This chapter introduces Frown by means of example.
3.1 Preliminaries: monads
Some elementary knowledge of monads is helpful in order to
use Frown effectively. For the most basic applications, however, one
can possibly do without. This section summarizes the relevant facts.
In Haskell, the concept of a monad is captured by the following
class definition.
| > class Monad m where |
| > return :: a -> m a |
| > (>>=) :: m a -> (a -> m b) -> m b |
| > (>>) :: m a -> m b -> m b |
| > fail :: String -> m a |
| > |
| > m >> n = m >>= const n |
| > fail s = error s
|
|
The essential idea of monads is to distinguish between
computations and values. This distinction is
reflected on the type level: an element of m a represents a
computation that yields a value of type a. The trivial or pure
computation that immediately returns the value a is denoted return a. The operator (>>=), commonly called `bind', combines two
computations: m >>= k applies k to the result of the computation
m. The derived operation (>>) provides a handy shortcut if one is
not interested in the result of the first computation. Finally, the
operation fail is useful for signaling error conditions (a common
thing in parsing).
Framing the concept of a monad as a type class is sensible for at
least two interrelated reasons. First, we can use the same names
(return, `>>=', and fail) for wildly different computational
structures.1 Second, by overloading a function with
the monad class we effectively parameterize the function by
computational structures, that is, we can call the same function with
different instances of monads obtaining different computational
effects.
The following instance declaration (Result.lhs) defines a simple
monad that we will use intensively in the sequel (the monad can be
seen as a simplified term implementation of the basic monad
operations).
| > module Result where |
| > |
| > data Result a = Return a | Fail String |
| > deriving (Show) |
| > |
| > instance Monad Result where |
| > return = Return |
| > Fail s >>= k = Fail s |
| > Return a >>= k = k a |
| > fail = Fail
|
|
In monad speak, this is an exception monad: a computation
in Result either succeeds gracefully yielding a value a
(represented by the term Return a) or it fails with an error message
s (represented by Fail s). That's all we initially need for
Frown: parsing a given input either succeeds producing a semantic
value (sometimes called an attribution) or it fails (hopefully, with a
clear indication of the syntax error).
3.2 Basic features
3.2.1 Pure grammars
Let's start with a simple example. The following complete Frown
source file (Paren1.lg2) defines the language of
well-balanced parentheses. The specification of the grammar is
enclosed in special curly braces `%{ ldots }%'. The remainder
contains Haskell source code, that is, a module header and a function
declaration.
| > module Paren where |
| > import Result |
| > |
| > %{ |
| > |
| > Terminal = '(' | ')'; |
| > |
| > Nonterminal = paren; |
| > |
| > paren : ; |
| > paren : paren, '(', paren, ')'; |
| > |
| > }% |
| > |
| > frown _ = fail "syntax error" |
|
The part enclosed in special curly braces comprises the typical
ingredients of a context-free grammar: a declaration of
the terminal symbols, a declaration of the
nonterminal symbols, and finally the
productions or grammar rules.
In general, the terminal symbols are given by Haskell patterns of the
same type. Here, we have two character patterns of type Char.
Nonterminals are just identifiers starting with a lower-case letter.
By convention, the first nonterminal is also the start symbol of
the grammar (this default can be overwritten, see Sec. 3.2.7).
Productions have the general form n : v_1, ldots, v_k; where n is
a nonterminal and v_1, ..., v_k are symbols. Note that the
symbols are separated by commas and terminated by a semicolon. The
mandatory trailing semicolon helps to identify so-called
є-productions, productions with an empty right-hand
side, such as paren : ;.
As a shorthand, we allow to list several alternative right-hand sides
separated by a vertical bar. Thus, the above productions could have
been written more succinctly as
| > paren : ; |
| > | paren, '(', paren, ')';
|
|
The two styles can be arbitrarily mixed. In fact, it is not even
required that productions with the same left-hand side are grouped
together (though it is good style to do so).
Now, assuming that the above grammar resides in a file called
Paren.g we can generate a Haskell parser by issuing the command
frown Paren.g
This produces a Haskell source file named Paren.hs that contains
among other things the function
| > paren :: (Monad m) => [Char] -> m () ,
|
|
which recognizes the language generated by the start symbol of the
same name. Specifically, if inp is a list of characters, then
paren inp is a computation that either succeeds indicating that
inp is a well-formed parentheses or fails indicating that inp
isn't well-formed. Here is a short interactive session using the
Haskell interpreter Hugs (type hugs Paren.hs at the command line).
| > Paren>> paren "(())()" :: Result () |
| > Return () |
| > Paren>> paren "(())(" :: Result () |
| > Fail "syntax error" |
|
Note that we have to specify the result type of the expressions in
order to avoid an unresolved overloading error. Or to put it
differently, we have to specify the monad, in which the parsing
process takes place. Of course, we are free to assign paren a more
constrained type by placing an appropriate type signature in the
Haskell section of the grammar file:
| > paren :: [Char] -> Result () .
|
|
By the way, since the nonterminal paren carries no semantic value,
the type of the computation is simply Result () where the empty
tuple type `()' serves as a dummy type. In the next section we will
show how to add attributes or semantic values to nonterminals.
Every once in a while parsing fails. In this case, Frown calls a
user-supplied function named, well, frown (note that you must supply
this function). In our example, frown has type
| > frown :: (Monad m) => [Char] -> m a
|
|
The error function frown is passed the remaining input as an
argument, that you can give an indication of the location of the
syntax error (more on error reporting in
Sec. 3.3). Note that frown must be polymorphic
in the result type.
Remark 1
For those of you who are knowledgable and/or interested in
LR parsing, Fig. 3.1 displays the Haskell file that is
generated by frown Paren.g3. For each state i of the underlying
LR(
0)
automaton, displayed in Fig. 3.2,
there is one function called parse_i. All these functions take two
arguments: the remaining input and a stack that records the
transitions of the LR(
0)
machine. The reader is
invited to trace the parse of "(())()".
| > module Paren where |
| > import Result |
| > |
| > {- frown :-( -} |
| > |
| > data Stack = Empty |
| > | T_1_2 Stack |
| > | T_2_3 Stack |
| > | T_2_5 Stack |
| > | T_4_5 Stack |
| > | T_4_6 Stack |
| > | T_5_4 Stack |
| > |
| > data Nonterminal = Paren |
| > |
| > paren tr = parse_1 tr Empty >>= (\ Paren -> return ()) |
| > |
| > parse_1 ts st = parse_2 ts (T_1_2 st) |
| > |
| > parse_2 tr@[] st = parse_3 tr (T_2_3 st) |
| > parse_2 ('(' : tr) st = parse_5 tr (T_2_5 st) |
| > parse_2 ts st = frown ts |
| > |
| > parse_3 ts (T_2_3 (T_1_2 st)) = return Paren |
| > |
| > parse_4 ('(' : tr) st = parse_5 tr (T_4_5 st) |
| > parse_4 (')' : tr) st = parse_6 tr (T_4_6 st) |
| > parse_4 ts st = frown ts |
| > |
| > parse_5 ts st = parse_4 ts (T_5_4 st) |
| > |
| > parse_6 ts (T_4_6 (T_5_4 (T_2_5 (T_1_2 st)))) = |
| > = parse_2 ts (T_1_2 st) |
| > parse_6 ts (T_4_6 (T_5_4 (T_4_5 (T_5_4 st)))) |
| > = parse_4 ts (T_5_4 st) |
| > |
| > {- )-: frown -} |
| > |
| > frown _ = fail "syntax error" |
|
Figure 3.1: A Frown generated parser.
Figure 3.2: The LR(0) automaton underlying the parser of Fig. 3.1.
3.2.2 Attributes
Now, let's augment the grammar of Sec. 3.2.1 by semantic
values (Paren2.lg). Often, the parser converts a given input into
some kind of tree representation (the so-called abstract
syntax tree). To represent nested parentheses we simply use binary
trees (an alternative employing n-ary trees is given in
Sec. 4.1).
| > module Paren where |
| > import Result |
| > |
| > data Tree = Leaf | Fork Tree Tree |
| > deriving (Show) |
| > |
| > %{ |
| > |
| > Terminal = '(' | ')'; |
| > |
| > Nonterminal = paren {Tree}; |
| > |
| > paren {Leaf} : ; |
| > {Fork t u} | paren {t}, '(', paren {u}, ')'; |
| > |
| > |
| > }% |
| > |
| > frown _ = fail "syntax error" |
|
Attributes are always given in curly braces. When we declare a
nonterminal, we have to specify the types of its attributes as in
paren {Tree}. The rules of the grammar can be seen as functions from
the right-hand side to the left-hand side. On the right-hand side,
Haskell variables are used to name the values of attributes. The
values of the attributes on the left-hand side are then given by
Haskell expressions, in which the variables of the right-hand side may
occur free. The Haskell expressions can be arbitrary, except that they
must not be layout-sensitive.
In general, a nonterminal may have an arbitrary number of attributes
(see Sec. 4.4 for an example). Note that Frown only
supports so-called synthesized attributes
(inherited attributes can be simulated, however, with the
help of a reader monad, see Sec. 3.2.8, or with functional
attributes, see Sec. 4.2).
The parser generated by Frown now has type
| > paren :: (Monad m) => [Char] -> m Tree .
|
|
The following interactive session illustrates its use.
| > Paren>> paren "(())()" :: Result Tree |
| > Return (Fork (Fork Leaf (Fork Leaf Leaf)) Leaf) |
| > Paren>> paren "(())(" :: Result Tree |
| > Fail "syntax error" |
|
3.2.3 Interfacing with a lexer
The parsers of the two previous sections take a list of characters as
input. In practice, a parser usually does not work on character
streams directly. Rather, it is prefaced by a lexer that first
converts the characters into a list of so-called
tokens. The separation of the lexical analysis from the
syntax analysis usually leads to a clearer design and as a benevolent
side-effect it also improves efficiency (Sec. 3.4.2
shows how to combine lexing and parsing in Frown, though).
A simple token type is shown in Fig 3.3
(Terminal1.lhs). (Note that the type comprises more constructors
than initially needed.)
| > module Terminal where |
| > import Maybe |
| > |
| > data Op = Plus | Minus | Times | Divide |
| > deriving (Show) |
| > |
| > name :: Op -> String |
| > name Plus = "+" |
| > name Minus = "-" |
| > name Times = "*" |
| > name Divide = "/" |
| > |
| > app :: Op -> (Int -> Int -> Int) |
| > app Plus = (+) |
| > app Minus = (-) |
| > app Times = (*) |
| > app Divide = div |
| > |
| > data Terminal = Numeral Int |
| > | Ident String |
| > | Addop Op |
| > | Mulop Op |
| > | KWLet |
| > | KWIn |
| > | Equal |
| > | LParen |
| > | RParen |
| > | EOF |
| > deriving (Show) |
| > |
| > ident, numeral :: String -> Terminal |
| > ident s = fromMaybe (Ident s) (lookup s keywords) |
| > numeral s = Numeral (read s) |
| > |
| > keywords :: [(String, Terminal)] |
| > keywords = [ ("let", KWLet), ("in", KWIn) ]
|
|
Figure 3.3:
The type of terminals (
Terminal1.lhs).
Fig. 3.4 (Lexer.lhs) displays a simple lexer for
arithmetic expressions, which are built from numerals using the
arithmetic operators `+', `-', `*', and `/'.
| > module Lexer (module Terminal, module Lexer) where |
| > import Char |
| > import Terminal |
| > |
| > lexer :: String -> [Terminal] |
| > lexer [] = [] |
| > lexer ('+' : cs) = Addop Plus : lexer cs |
| > lexer ('-' : cs) = Addop Minus : lexer cs |
| > lexer ('*' : cs) = Mulop Times : lexer cs |
| > lexer ('/' : cs) = Mulop Divide : lexer cs |
| > lexer ('=' : cs) = Equal : lexer cs |
| > lexer ('(' : cs) = LParen : lexer cs |
| > lexer (')' : cs) = RParen : lexer cs |
| > lexer (c : cs) |
| > | isAlpha c = let (s, cs') = span isAlphaNum cs in ident (c : s) : lexer cs' |
| > | isDigit c = let (s, cs') = span isDigit cs in numeral (c : s) : lexer cs' |
| > | otherwise = lexer cs
|
|
Figure 3.4:
A simple lexer for arithmetic expressions (
Lexer.lhs).
The following grammar, which builds upon the lexer, implements a
simple evaluator for arithmetic expressions (Calc.lg).
| > module Calc where |
| > import Lexer |
| > import Result |
| > |
| > %{ |
| > |
| > Terminal = Numeral {Int} |
| > | Addop {Op} |
| > | Mulop {Op} |
| > | LParen as "(" |
| > | RParen as ")"; |
| > |
| > Nonterminal = expr {Int} |
| > | term {Int} |
| > | factor {Int}; |
| > |
| > expr {app op v1 v2} : expr {v1}, Addop {op}, term {v2}; |
| > {e} | term {e}; |
| > term {app op v1 v2} : term {v1}, Mulop {op}, factor {v2}; |
| > {e} | factor {e}; |
| > factor {n} : Numeral {n}; |
| > {e} | "(", expr {e}, ")"; |
| > |
| > }% |
| > |
| > frown _ = fail "syntax error" |
|
The terminal declaration now lists patterns of type Terminal. Note
that terminals may also carry semantic values. The single argument of
Numeral, for instance, records the numerical value of the numeral.
When declaring a terminal we can optionally define a shortcut using an
as-clause as, for example, in LParen as "(". The shortcut can be
used in the productions possibly improving their readability.
Here is an example session demonstrating the evaluator.
| > Calc>> lexer "4 * (7 + 1)" |
| > [Numeral 4,Mulop Times,LParen,Numeral 7,Addop Plus,Numeral 1,RParen] |
| > Calc>> expr (lexer "4711") :: Result Int |
| > Return 4711 |
| > Calc>> expr (lexer "4 * (7 + 1) - 1") :: Result Int |
| > Return 31 |
| > Calc>> expr (lexer "4 * (7 + 1 - 1") :: Result Int |
| > Fail "syntax error" |
|
3.2.4 Monadic actions
The expression that determines the value of an attribute is usually a
pure one. It is, however, also possible to provide a monadic action
that computes the value of the attribute. The computation lives
in the underlying parsing monad. Monadic actions are enclosed in `{% ldots }' braces and have type m t where m is the type of the
underlying monad and t is the type of attributes.
As an example, the following variant of the desktop calculator
(MCalc.lg) prints all intermediate results (note that we only list
the changes to the preceeding example).
| > trace :: Op -> (Int -> Int -> IO Int) |
| > trace op v1 v2 = putStrLn s >> return v |
| > where v = app op v1 v2 |
| > s = show v1 ++ name op ++ show v2 ++ "=" ++ show v
|
|
| > expr {% trace op v1 v2} : expr {v1}, Addop {op}, term {v2}; |
| > term {% trace op v1 v2} : term {v1}, Mulop {op}, factor {v2};
|
|
The following session illustrates its working.
| > MCalc>> expr (lexer "4711") |
| > 4711 |
| > MCalc>> expr (lexer "4 * (7 + 1) - 1") |
| > 7+1=8 |
| > 4*8=32 |
| > 32-1=31 |
| > 31 |
| > MCalc>> expr (lexer "4 * (7 + 1 - 1") |
| > 7+1=8 |
| > Program error: user error (syntax error)
|
|
In general, monadic actions are useful for performing `side-effects' (for example,
in order to parse %include directives)
and for interaction with a monadic lexer (see Sec. 3.3.1).
3.2.5 Backtracking parsers
In the previous examples we have encoded the precedences of the
operators (`*' binds more tightly than `+') into the productions
of the grammar. However, this technique soon becomes unwieldy for a
larger expression language. So let's start afresh. The grammar file
shown in Fig. 3.5 (Let1.lg) uses only a single
nonterminal for expressions (we have also extended expressions by
local definitions).
| > module Let where |
| > import Lexer |
| > import Monad |
| > |
| > data Expr = Const Int | Var String | Bin Expr Op Expr | Let Decl Expr |
| > deriving (Show) |
| > |
| > data Decl = String :=: Expr |
| > deriving (Show) |
| > |
| > %{ |
| > |
| > Terminal = Numeral {Int} |
| > | Ident {String} |
| > | Addop {Op} |
| > | Mulop {Op} |
| > | KWLet as "let" |
| > | KWIn as "in" |
| > | Equal as "=" |
| > | LParen as "(" |
| > | RParen as ")"; |
| > |
| > expr {Expr}; |
| > expr {Const n} : Numeral {n}; |
| > {Var s} | Ident {s}; |
| > {Bin e1 op e2} | expr {e1}, Addop {op}, expr {e2}; |
| > {Bin e1 op e2} | expr {e1}, Mulop {op}, expr {e2}; |
| > {Let d e} | "let", decl {d}, "in", expr {e}; |
| > {e} | "(", expr {e}, ")"; |
| > |
| > decl {Decl}; |
| > decl {s :=: e} : Ident {s}, "=", expr {e}; |
| > |
| > }% |
| > |
| > frown _ = fail "syntax error" |
|
Figure 3.5:
An ambiguous grammar (
Let1.lg).
Also note that the grammar has no Nonterminal declaration. Rather,
the terminal symbols are declared by supplying type signatures before
the respective rules. Generally, type signatures are preferable to a
Nonterminal declaration if the grammar is long.
Of course, the rewritten grammar is no longer LALR(k) simply because
it is ambiguous. For instance, `1+2*3' can be parsed as Bin (Const 1) Plus (Bin (Const 2) Times (Const 3)) or as Bin (Bin (Const 1) Plus (Const 2)) Times (Const 3). Frown is also unhappy with the
grammar: it reports six shift/reduce conflicts:
| * warning: 6 shift/reduce conflicts
|
This means that Frown wasn't able to produce a deterministic parser.
Or rather, it produced a deterministic parser by making some arbitrary
choices to avoid non-determinism (shifts are preferred to reductions,
see Sec. 3.2.6). However, we can also instruct Frown to
produce a non-deterministic parser, that is, one that generates all
possible parses of a given input. We do so by supplying the option
--backtrack:
frown --backtrack Let.g
The generated parser expr now has type
| > expr :: (MonadPlus m) => [Terminal] -> m Expr .
|
|
Note that the underlying monad must be an instance of MonadPlus
(defined in the standard library Monad). The list monad and the
Maybe monad are both instances of MonadPlus. The following session
shows them in action.
| > Let>> expr (lexer "1 + 2 - 3 * 4 / 5") :: [Expr] |
| > [Bin (Const 1) Plus (Bin (Const 2) Minus (Bin (Const 3) Times (Bin (Const 4) Divide (Const 5)))),Bin (Const 1) Plus (Bin (Const 2) Minus (Bin (Bin (Const 3) Times (Const 4)) Divide (Const 5))),Bin (Const 1) Plus (Bin (Bin (Const 2) Minus (Bin (Const 3) Times (Const 4))) Divide (Const 5)),Bin (Bin (Const 1) Plus (Bin (Const 2) Minus (Bin (Const 3) Times (Const 4)))) Divide (Const 5),Bin (Const 1) Plus (Bin (Bin (Const 2) Minus (Const 3)) Times (Bin (Const 4) Divide (Const 5))),Bin (Const 1) Plus (Bin (Bin (Bin (Const 2) Minus (Const 3)) Times (Const 4)) Divide (Const 5)),Bin (Bin (Const 1) Plus (Bin (Bin (Const 2) Minus (Const 3)) Times (Const 4))) Divide (Const 5),Bin (Bin (Const 1) Plus (Bin (Const 2) Minus (Const 3))) Times (Bin (Const 4) Divide (Const 5)),Bin (Bin (Bin (Const 1) Plus (Bin (Const 2) Minus (Const 3))) Times (Const 4)) Divide (Const 5),Bin (Bin (Const 1) Plus (Const 2)) Minus (Bin (Const 3) Times (Bin (Const 4) Divide (Const 5))),Bin (Bin (Const 1) Plus (Const 2)) Minus (Bin (Bin (Const 3) Times (Const 4)) Divide (Const 5)),Bin (Bin (Bin (Const 1) Plus (Const 2)) Minus (Bin (Const 3) Times (Const 4))) Divide (Const 5),Bin (Bin (Bin (Const 1) Plus (Const 2)) Minus (Const 3)) Times (Bin (Const 4) Divide (Const 5)),Bin (Bin (Bin (Bin (Const 1) Plus (Const 2)) Minus (Const 3)) Times (Const 4)) Divide (Const 5)] |
| > Let>> expr (lexer "1 + - 3 * 4 / 5") :: [Expr] |
| > [] |
| > Let>> expr (lexer "1 + 2 - 3 * 4 / 5") :: Maybe Expr |
| > Just (Bin (Const 1) Plus (Bin (Const 2) Minus (Bin (Const 3) Times (Bin (Const 4) Divide (Const 5)))))
|
|
The list monad supports `deep backtracking': all possible parses are
returned (beware: the number grows exponentionally). The Maybe monad
implements `shallow backtracking': it commits to the first solution
(yielding the same results as the parser generated without the
option --backtrack).
3.2.6 Precedences and associativity
Instead of resorting to a backtracking parser we may also help Frown
to generate the `right' deterministic parser by assigning
precedences to terminal symbols. The understand the
working of precedences it is necessary to provide some background of
the underlying parsing technique.
LR parsers work by repeatedly performing two operations:
shifts and reductions. A shift moves a
terminal from the input onto the stack, the auxiliary data structure
maintained by the parser. A reduction replaces a top segment of the
stack matching the right-hand side of a production by its left-hand side.
Parsing succeeds if the input is empty and the stack consists of a
start symbol only. As an example, consider parsing `N*N+N'.
|
|
| |
N*N+N |
|
shift |
| N |
*N+N |
|
reduce by e : N; |
| e |
*N+N |
|
shift |
| e* |
N+N |
|
shift |
| e*N |
+N |
|
reduce by e : N; |
| e*e |
+N |
|
|
|
At this point, there are two possibilities: we can either perform a
reduction (using the production e : e, *, e;) or shift the next
input symbol. Both choices are viable.
|
|
| e*e |
+N |
|
reduce by e : e, *, e; |
| e |
+N |
|
shift |
| e+ |
N |
|
shift |
| e+N |
|
|
reduce by e : N; |
| e+e |
|
|
reduce by e : e, +, e; |
| e |
|
|
|
|
|
| e*e |
+N |
|
shift |
| e*e+ |
N |
|
shift |
| e*e+N |
|
|
reduce by e : N; |
| e*e+e |
|
|
reduce by e : e, +, e; |
| e*e |
|
|
reduce by e : e, * , e; |
| e |
|
|
|
|
Alas, the two choices also result in different parse trees. By default,
Frown prefers shifts to reductions. As a consequence, N*N+N is parsed
as N*(N+N), that is, `+' binds more tightly than `*'.
Now, we can direct the resolution of conflicts by assigning
precedences and associativity to terminal
symbols. The following declarations will do in our example (Let2.g).
| > left 6 Addop {}; |
| > left 7 Mulop {}; |
| > nonassoc 0 "in";
|
|
Thus, `*' takes precedence over `+', which in turn binds more
tightly than `in'. For instance, let a = 4 in a + 2 is parsed as
let a = 4 in (a + 2). A conflict between two symbols of equal
precedence is resolved using associativity: the succession
1+2+3 of left-associative operators is grouped as (1+2)+3;
likewise for right-associative operators; sequences of non-associative
operators are not well-formed.
Given the fixity declarations above Frown now produces the `right'
deterministic parser, which can be seen in action below.
| > Let>> expr (lexer "4 * (7 + 1) - 1") :: Result Expr |
| > Return (Bin (Bin (Const 4) Times (Bin (Const 7) Plus (Const 1))) Minus (Const 1)) |
| > Let>> expr (lexer "4 * 7 + 1 - 1") :: Result Expr |
| > Return (Bin (Bin (Bin (Const 4) Times (Const 7)) Plus (Const 1)) Minus (Const 1)) |
| > Let>> expr (lexer "let\n a = 4 * (7 + 1) - 1\n in a * a") :: Result Expr |
| > Return (Let ("a" :=: Bin (Bin (Const 4) Times (Bin (Const 7) Plus (Const 1))) Minus (Const 1)) (Bin (Var "a") Times (Var "a"))) |
| > Let>> expr (lexer "let\n a = 4 * (7 + 1 - 1\n in a * a") :: Result Expr |
| > Fail "syntax error" |
|
In general, a conflict between the actions `reduce by rule r' and
`shift terminal t' is resolved as follows (the precedence of a rule
is given by the precedence of the rightmost terminal on the right-hand
side):
|
|
| condition |
|
action |
|
example |
|
|
| prec r < prec t |
|
shift |
|
reduce by e:e,+,e; versus shift * |
|
| |
left t |
reduce |
|
reduce by e:e,*,e; versus shift * |
| prec r = prec t |
right t |
shift |
|
reduce by e:e,++,e; versus shift ++ |
| |
nonassoc t |
fail |
|
reduce by e:e,==,e; versus shift == |
|
| prec r > prec t |
|
reduce |
|
reduce by e:e,*,e; versus shift + |
|
|
Just in case you may wonder: there are no shift/shift conflicts by
construction; reduce/reduce conflicts cannot be cured using
precedences and associativity.
3.2.7 Multiple start symbols
A grammar may have several start symbols. In this case, Frown
generates multiple parsers, one for each start symbol (actually, these
are merely different entry points into the LR(0)
automaton4). We mark a symbol as a start symbol simply by putting
an asterix before its declaration (either in a Nonterminal
declaration or in a separate type signature). Consider our previous
example: most likely we want parsers both for expressions and
declarations. Thus, we write
| > *expr {Expr}; |
| > *decl {Decl};
|
|
and