Elkhound: A GLR Parser Generator
and
Elsa: An Elkhound-based C++ Parser

Norwegian Elkhound

Elkhound is a parser generator, similar to Bison. The parsers it generates use the Generalized LR (GLR) parsing algorithm. GLR works with any context-free grammar, whereas LR parsers (such as Bison) require grammars to be LALR(1).

Parsing with arbitrary context-free grammars has two key advantages: (1) unbounded lookahead, and (2) support for ambiguous grammars. Unbounded lookahead is achieved by allowing multiple potential parses to coexist for as long as necessary. Similarly, ambiguity is handled by letting potential parses be coalesced, with special action taken to handle the ambiguous region of the input. In general, by using the GLR algorithm, Elkhound avoids the difficulty of trying to make a language grammar fit the LALR(1) restrictions.

What's more, the Elkhound implementation of GLR is competitive with good LR implementations (particularly Bison) for grammars and grammar fragments that are LALR(1). Its high performance is a result of a novel combination of GLR and LR parsing, wherein the parser decides on a token-by-token basis which algorithm to use. Programmers can start with a grammar that is convenient (but perhaps slower to parse with), and gradually modify it to conform to LALR(1) in places for improved performance. Grammars that are already LALR(1) will work in Elkhound unmodified (though the input syntax is different), and with parsing speeds within a few percent of Bison-generated parsers.

The reason I wrote Elkhound is to be able to write a C++ parser. The parser is called Elsa, and is included in the distribution below.

Elkhound is written in C++, and can generate parsers written in either C++ or Ocaml. Elsa is written entirely in C++, and parses C and C++ input code.

Documentation

Downloads

How much C++ can Elsa parse?

(Info as of 2005-08-22.)

Elsa can parse most C++ "in the wild". It has been tested with some notable large programs, including Mozilla, Qt, ACE, and itself. I have not tried parsing KDE recently, so that's the next major goal.

Elsa supports most of the GCC extensions, and several important GCC bugs. Currently, gcc-3.4.x is the main compatibility target; GCC bugs that are fixed in 3.4 are (in most cases) not supported. It also supports a couple of MSVC bugs, but no extensions (other than those that overlap with GCC extensions), and no bugs/extensions for other compilers. Supported bugs are documented in sources/elsa/cc_lang.h.

Elsa has recently gained the ability to parse all of the gcc-3.4.3 standard library headers (except for valarray, which uses template template parameters). Previously, the recommended practice was to preprocess your program using gcc-2.95.3, but it now should work to use the 3.x headers, especially 3.4.3 or later. Unfortunately, the headers that come with gcc-3.3.x and earlier contain bugs, and Elsa does not currently emulate the gcc bugs that tolerate them. (Note that the 3.x headers are much more complicated than the 2.x headers, so they take significantly longer to parse and do template instantiation.) I have not yet tried the 4.x headers.

The one major feature that is not implemented at all is template template parameters and arguments (passing entire, uninstantiated templates as arguments to other templates). The only place I have seen them used is in the valarray header; let me know if this feature is a priority to you.

Additionally, there are many known "minor" bugs. See elsa/regrtest for details. The known failures are mostly pretty esoteric; in a few cases, they are (valid) inputs that no front end I know of can handle.

If you have a program you want to parse, and Elsa fails to, then it is likely that a small "point fix" can be found quickly to solve the problem. Send me a preprocessed .i file if you want help. Even better, use Delta to minimize the input (such that, say, gcc-3.4 accepts it but Elsa does not). Sending me minimized test cases that cause Elsa to fail is the among the best ways to help me make Elsa better!

How much C can Elsa parse?

Elsa supports parsing C by basically parsing it as C++ and then applying looser typechecking rules (the details are more complicated, but that is the basic approach). This support is enabled with the "-tr c_lang" argument to ccparse.

In C mode, Elsa can parse most C programs, including the Linux kernel (our highest-priority C program). It handles most gcc extensions, including K&R function definitions and the "implicit int" rule. There is a good chance it will parse your C program.

However, since Elsa still applies C++ rules in some places in C mode, Elsa may reject valid but "funky" C programs. Support for the looser C rules is being gradually added, and is usually easy to add; let me know if you want help parsing some C program.

Relationship between Elsa, Elkhound, and CIL; and Ocaml vs C++

CIL, the C Intermediate Language, is a C front-end written in Ocaml. George Necula, Wes Weimer and I wrote CIL as infrastructure for the CCured ("see-cured") project, which is a source-to-source translator for C that adds run-time checks for memory safety. CCured is also written in Ocaml.

Elsa and Elkhound are, at the moment, completely separate from CIL. I wrote them because our experiments extending CIL to handle C++ were unsuccessful. I chose to write them in C++ instead of Ocaml because I prefer C++; the rest of the group prefers Ocaml, which is why CIL and CCured use that language instead.

Long term, I would love to find a good way to merge Elsa and CIL. At the very least it should be possible to use Elsa to parse C++ into an AST, and then feed that to CIL so those that want to can write their analysis in Ocaml and the rest of the CIL infrastructure. Even better, I would like to make it possible to move ASTs back and forth among several tools (possibly written in different languages), retaining not just the AST forms but the annotations computed by prior analyses. There is ongoing effort to export the Elsa AST as an XML document; we expect to be able to advertise this in the next public release.

Contact

Are you using Elkhound? Do you want to be notified of new releases? Drop me a line:

  smcpeak cs berkeley edu
         @  .        .

If you want to report a bug in Elsa, please see the bug reporting guidelines for Elsa.

2006-12-09: It has been quite some time since I have had a lot of time to work on Elsa. The current version of Elsa is being hosted at www.cubewano.org by Karl Chen, and he and Daniel Wilkerson are doing most of the maintenance. The most active page is the Oink page; Oink is a project that includes Elsa. I plan to move the Elsa pages to someplace else reasonably soon.

Authors

(See also Elsa Contributors.)

Some other projects using Elkhound:

Changes

Major changes in version 2005.08.22b (matrix):

Major changes in version 2005.08.22 (matrix):

Major changes in version 2005.05.28 (matrix):

Major changes in version 2005.05.03 (matrix):

Major changes in version 2005.03.03 (matrix):

Major changes in version 2005.02.12 (matrix):

Major changes in version 2004.11.19 (matrix):

Major changes in version 2004.11.13 (matrix):

Major changes in version 2004.08.21 (matrix):

Major changes in version 2004.02.13 (matrix):

Major changes in version 2004.02.07 (matrix):

Major changes in version 2004.02.06 (matrix):

Major changes in version 2003.08.16:

Major changes in version 2003.07.22:

Major changes in version 2003.01.10:

Major changes in version 2003.01.06:

Major changes in version 2002.10.14:

Major changes in version 2002.10.04:

Major changes in version 2002.09.16:

Original public release: 2002.08.12.

Related Links

I've collected here some links to related projects. I don't claim these lists are exhaustive, but I'm happy to add links you want to send me. For parser generators, I include both the underlying parsing algorithm, and the license under which it is available. (GPL=GNU General Public License, LGPL=GNU Lesser General Public License, BSD=BSD License)

Other C++ Parsers. Here are some alternatives to Elsa. The main advantages I would claim for Elsa are ease of adding language extensions, a common activity in a research setting, and reasonably complete support for the C++ language, especially templates. (Note that "instantiating" templates is a necessary part of parsing code that uses templates.)

GLR Parser generators. These are all of the available systems I know of that use the GLR algorithm.

Parser generators based on algorithms other than GLR but which support multiple tokens of lookahead (or some other way of handling nondeterministic grammars):

Parser generators that use deterministic algorithms (e.g. LALR(1) or LL(1)). Since there are many such generators available, I attempt here only to list those that are popular, actively maintained, or otherwise relevant to Elkhound or the compiler community at large. Also, I'm only listing those with C language interfaces (more or less arbitrary decision, to cut down on the size).

Collections of tools and/or links:

Acknowledgements

This work is based upon work supported in part by the National Science Foundation under Grants No. CCR-9875171, CCR-0085949, CCR-0081588, CCR-0085949, CCR-0326577 and CCR-0234689; by the National Aeronautics and Space Administration under Grant No. NNA04CI57A; and gifts from Microsoft Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the other sponsors.

This support has been made available through, and supplemented by, the University of California, Berkeley, the Open Source Quality Project and my research advisor and collaborator, George Necula. Without this support, this work would not have been possible.

Valid HTML 4.01!