DISCLAIMER
I will warn you right now that I do not have an extensive knowledge of Bison. I’m still learning and the grammar.y source file will evolve over the course of this instructional as new syntax is added.
GNU Bison
More commonly referred to as simply “Bison” is a part of the GNU Project. It is a parser generator that reads a context free language, warns the developer about ambiguities within the language parsing, then generates a parser based on the supplied specifications.
components of a bison file
- C/C++ Declarations: This section contains C macro definitions, function prototypes, and variables to be used within the Grammar Rules section. The code here is copied into the beginning of the parser file to ensure they are placed before the yyparse() function. If unneeded the surrounding “%{” and “%}” syntax can be left out.
- Bison Declarations: This section contains declarations which define both terminal and nonterminal symbols as well as specifying details such as precedence and more. Bison declarations may not be needed in some simple grammars.
- Grammar Rules: The Grammar Rules section contains one or more Bison grammar rules and includes nothing else. “%%” indicates the start of the Grammar Rules section and must be present event if you omit the C declarations and Bison declarations, which would make it the first thing in the file.
- Additional C/C++ Code: The code in this section is copied verbatim to the end of the parser file the same way the C/C++ Declarations section is copied to the beginning of the parser. This is the ideal location for code which you want to include in the parser, but doesn’t need to be defined before yyparse(). Two examples are yylex() and yyerror(). If this section is empty the “%%” separating it from the Grammar Rules section is unnecessary.
Symbols
Symbols serve as grammarical categorizations of the language.
Terminal Symbols or Token Types signify a type of syntactically equivalent tokens. The symbols in the grammar rules mean that the class is allowed.
Nonterminal Symbols are named symbols representing syntactically equivalent grouping of symbols used in writing grammar rules.
Writing Terminal Symbols in the Grammar
- Named tokens: these are identified with an identifier which must be defined with a Bison declaration like %token.
- Character tokens: These are character literals as in C (e.g. ‘*’ or “%”). These do not need to be declared. Escape sequences for character literals used in C can be used in Bison. NOTE: Do not use the C null (i.e. ‘\0’) character as a character literal because it is an ASCII code that yylex returns for end of input.
- String tokens: These are string literals as in C (e.g. “void” or “return”. It is unneccessary to declare a string literal unless you need to specify its semantic value data type, associativity, precedence. By using the %token declaration, an alias can be associated with the string literal otherwise the lexical analyzer must retrieve the token number for the literal string token from the yytname table.
Syntax of Bison Grammar Rules
The general format of is as such.
result: components {action}
result is the nonterminal symbol that is described by this rule and components are various terminal and nonterminal symbols put together by the rule making a sentence.
EXAMPLE: PARAMS
Creating a function parameter list for something like a function’s definition a moderate effort for something so seeming trivial.
string name, integer age, integer weight, float bmi The tokens and symbols in order would like this
STRINGTYPE NAME COMMA INTEGERTYPE NAME COMMA INTEGERTYPE NAME COMMA FLOATTYPE FLOAT NAME
Creating that rule would work, but only for a list of four variables in the pattern of string, integer, integer, float.
Let’s start from the beginning.
name: NAME { /* The block is called an action and the code within the block will be executed each time the TOKEN, NAME, is recognized.*/ } typeint: TYPEINT { /*Process this token when the keyword "integer" is parsed*/ } typefloat: TYPEFLOAT { /*After the keyword "float" is parsed pass it to the grammar engine*/ } typevoid: TYPEVOID { /*To pass to the grammar engine a pointer to an object of the type specified in the %define macro in this case a subclass of GrammarTree*/ } typestring: TYPESTRING { /*(For PlayPen) Passing the parsed information looks like this: */ $$=new GrammarTreeTypeVoid();} }
The preceding rules are for type specifiers. These are activated when keywords like string, integer, or void are encountered. In the case of NAME, the token is recognized as the identifiers of symbols that act as the names of functions or variables. Once identified by bison the adjacent code block is executed. Currently we aren’t executing any code, but that will come soon.
It would become extremely tedious to have to type out a sentence for every type that could be used in a rule. In other words, when we are ready to make a rule for variable declaration (e.g. string address; or integer windspeed;)it will require that a sentence allow for every variable name be preceded by type specifier. We currently have three so that isn’t so be bad, but what about when add bool and char. Worse, what about when variable declarations are a component of other rules. The combinations add up quickly.
To avoid this problem, you can group symbols under a more general and unifying symbol. In the case of type specifiers, I create symbol entitled “vartype” to use in a sentence that can use any variable type in its declaration and defination.
vartype: typeint { $$=$1 } | typefloat { $$=$1 } | typevoid { $$=$1 } | typestring { $$=$1 }
Multiple rules for the same result can be created by writing multiple sentences separated by the pipe (‘|’) character.
Now all rules where any given type can be specified in the code, the component vartype can be used.
Also note: In the action section for each of the rules it simply reads “$$=$1” meaning the result ($$) is the first component ($1). Since the “TYPE” symbols have already been defined, they can now be referenced or returned by the nth-component convention.
nth-component: a one-indexed sequence of components where $n is the nth component in the sequence.
The symbol “name” remains unchanged so we have enough to declare a variable. Variable declarations will be need elsewhere and more than once in the params so let’s define it now.
vardec: vartype name { }
There is also a need for another definition that functions more as a helper component than as a independently useful component. We’ll call this one vardecs, it will represent a comma-separated list of vardec components.
vardecs: vardecs COMMA vardec { } | %empty { }
If you are new to bison, this may not seem intuitive to you. I’ve learned that it helps to read recursive rules right to left. The right most component is the ending component, and the components before it can repeat as long as one of their definitions is (or ends with) the same ending component. The %empty directive means the right most component is capable of matching an empty string(“”);
params: vardecs { } | typevoid { }
vardecs was needed because there was no way to write params without allowing for multiple voids to be added to the parameter list.
I referenced most of the material for this page from http://dinosaur.compilertools.net/bison/bison_6.html#SEC35 and the rest from https://www.gnu.org/software/bison/manual/html_node/index.html For a more complete and better understanding of Bison, I would reference one of those websites.
To distiguish one GrammarTree type from another I provide an enumerated value as the first argument of the constructor. Every other implementation I’ve seen uses a base class like GrammarTree and derives a subclass for each result. The decision of which techique to use depends on whether you prefer managing a potentially long list to switch on or managing a potentially large number of subclasses.
Practical Application: PlayPen
/** The MIT License (MIT) Copyright © 2021 John L. Mooney Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ %code requires { #include "../includes/GrammarTree.hpp" } %{ #include <math.h> #include <stdio.h> #include <stdlib.h> #include <memory> #include "../includes/GrammarTree.hpp" extern std::unique_ptr<playpen::grammartree> playpen::tree; int yylex (void); extern char* yytext; void yyerror (char const*); %} %define api.value.type {playpen::GrammarTree*} %token COLON SEMICOLON COMMA RIGHTARROW LEFTBRACE RIGHTBRACE EQUAL LEFTPAREN RIGHTPAREN STRING INT FLOAT NAME TYPEINT TYPEFLOAT TYPENAME TYPEVOID TYPESTRING TYPEFUNCTION VOID RETURN %start input %% input: function_definition function_definitions { } function_prototype: TYPEFUNCTION name COLON LEFTPAREN params RIGHTPAREN RIGHTARROW vartype { } function_definition: function_prototype block { } function_definitions: function_definitions function_definition { } | %empty { } statements: statements statement { } | %empty { } statement: vardec SEMICOLON { } block: LEFTBRACE statements RIGHTBRACE { } vartype: typeint { } | typefloat { } | typevoid { } | typename { } | typestring { } name: NAME { } typeint: TYPEINT { } typefloat: TYPEFLOAT { } typevoid: TYPEVOID { } typestring: TYPESTRING { } typename: TYPENAME { } vardec: vartype name { } vardecs: vardec { } | vardecs COMMA vardec { } params: vardecs { } | typevoid { } %% void yyerror(char const* x) { printf("%s\n", yytext); printf("Error: %s\n", x); //exit(0); }