PLAYPEN Chapter 2: Flex

Description

Flex is an free and open-source lexical analyzer generator that is frequently used in place of Lex. It produces scanners or lexers that is often used with Bison or Yacc parsers.

File Structure

The web pages I find describing the structure of a Lex/Flex file all say there are three sections in the file separated solely by the string: “%%”

options
%{
user code that is inserted at the begining of the generated C/C++ file
%}

definitions
%%
rules
%%
user code

options
%{
user code that is inserted at the beginning of the generated C/C++ file
%}

barebones lex file
[code liine=”1″ lang=”cpp” collapse=”false”] %option noyywrap
%{
#include “../includes/grammar.tab.hpp”

using namespace std;
%}

float /* regex for floating point integer */
int /* regex for integer */
whitespace /* regex for space, tab, carriage return, etc */
linefeed /* regex for new line */
name /* regex for the pattern of character sequences allowed for identifiers */
string /* regex for doublely or singularly quoted strings */

%%
{int} { }
{float} { }
{string} { }
“string” { }
“integer” { }
“float” { }
“void” { }
“fun” { }
“return” { }
“:” { }
“;” { }
“{” { }
“}” { }
“(” { }
“)” { }
“->” { }
“,” { }
“=” { }
{name} { }
{linefeed} { }
{whitespace} { }
%%

[/code]

Definition

In this section macros are defined, headers are included and C/C++ code can be written here. C/C++ code written in this section will be copied verbatim into the source code generated by Flex.

%{
	#include <iostream>
%}

Pattern Matching

Before the “rules” section there is what I consider a subsection which I have entitled: “Pattern Matching.” This subsection is optional and is used to associate identifiers with complex patterns expressed as regular expressions. These identifiers are to be used in the “Rules” section.

These are the regular expressions Flex uses to identify as it scans the input. This is important because based on the regular expression (regex) the lexer is able to discern the difference between a integer and a float or a identifier (I called it “name”) and a string.

float				(-?[0-9]+\.[0-9]+)
int					(-?[0-9]+)
whitespace			[\r\t\v\f]+
linefeed			\n
name				[a-zA-Z_]+[a-zA-Z0-9_]*
string				((["])(?:\\?+.)*(["]))|((['])(?:\\?+.)*([']))

Brief Explanation o f the regex

float: an optional hyphen to indicate negativity in front of an integer (One or more numeric digit) followed by a decimal point after which a another integer.

Acceptable:
100.0
1.234
-0.123

Unacceptable:
-abc.123
123
-123
123f

(Note: floats are required to have a decimal with a trailing number, and since there is no double no ‘f’ character is required at the end.)

int: an optional hyphen to indicate negativity in front of one or more single digit numbers (0 – 9). The plus (+) indicates there must be one or more of the preceding pattern.

Acceptable:
123
-123

Unacceptable:
123.0
343.343
-256abc

whitespace: an series of whitespace characters (carrige return, tab, etc.).

linefeed: the new line character.

name: a series of one or more single alpha characters or underscores (_) followed by a series of zero or more single alpha-numeric character. The astris (*) indicates zero or more of the preceding pattern.

Acceptable:
Addition
_Subtraction
Seven_of_9

Unacceptable:
7ofNine
“Addition”

string: I can’t begin to explain this one other then to say that it allows for any quoted (single or double) series of characters to be matched, but some characters such as single quotes or double quotes need to be escaped with a backslash.

Acceptable:
“Seven of Nine”
“7 of 9”
“1234”
“as34#$%^^&”
“\”\””
‘7 of 9’
‘Seven of Nine’
‘as34#$%^^&’
‘\”\”‘

Unacceptable:
“””
Addition
23233
-123

Rules

In this section regular expression patterns are associated with C/C++ statements. As the input into the lexical analyzer is parsed the lexer scans the text and once it sees a given pattern it executes the corresponding C/C++ statements known as “Actions.”

NOTE: In the case of input ambiguity flex chooses in the following manner.
  1. The longest match is preffered
  2. In the case of equal length matches, flex chooses the first rule specified

This is why the float pattern is defined before int. With the number, 123.456, the regex for int matches both the 123 and the 456. In this cases the lexer would object and exit, because period/decimal (.) is not defined anywhere, but if it were the pattern float would never be matched. For this reason float must be defined first.

PATTERNS
Pattern can be a simple string constant (e.g. “;”, “return”, etc.) or it can be the label of a substitution for a regex pattern. Take the label and wrap it in braces (e.g. {float}). Alternately, you can place the regex as the pattern and omit it from the “Pattern Matching” subsection (e.g. (-?[0-9]+\.[0-9]+)). I advise the substitution for the clarity when associating it with it’s action.
ACTIONS
Actions are the instruction to follow with the pattern is matched. In almost every example I have ever seen in the use of using flex for the purpose of making a compiler, the action is to just send a token to the parser to be handled. It’s worth noting that the token must be declared someone. For the most part, my tokens are all declared in the %token section of the bison file (grammar.y). Any C/C++ statements can be issued in the action section. I find it useful to print out a message to file or console to ensure that the pattern is matched and it’s value for debug purposes.

Flex has a few variables and functions which are useful (and in some cases necessary) to properly analyze the input and pass to the parser (this is not an exhaustive list).

  • yytext — when the scanner matches a pattern, the text of the token is stored in the null terminated string yytext.
  • yyleng — the length of the string, yytext.
  • yymore() — instructs the scanner on the next match of a rule the corresponding token (i.e. yytext) should be appended onto the current value of yytext instead of replacing it.
  • yyless(n)’ ignores the first n characters of the current token and returns the remaining characters to the input stream to be scanned again on the next match with the appropriate modification to yytext and yyleng. (i.e., yyleng will now equal n ).
  • yyterminate() — a method to be used inplace of a return statement in an action. It terminates the scanner and returns a 0 to the scanner’s caller, indicating “finished”. This function is called by default upon end of file (EOF). It is a macro that may be redefined.
  • yylex() — the scanner generated by the program flex has the entry point yylex(). It is called to start or resume the scanning of input. If an action in flex escapes the scanner to pass a value to the calling program, the next call to yylex() will resume where it left off.
  • unput(c) — places the character, c, back onto the beginning of the input stream. It will be the next character to be scanned.
  • input/yyinput (in C++) — reads the next character in the stream.

DIRECTIVES

  • ECHO — copies yytext to the lexer’s output
  • BEGIN — followed by the name of a start condition puts the lexer at the respective start condition.
  • REJECT — instructs the lexer to move onto the next best matching rule that matches the input

START CONDITIONS

Flex allows for the conditional activation of rules. Start Conditions are outside the scope of this tutorial. To learn more visit: https://www.cs.fsu.edu/~engelen/doc/reflex/html/index.html#reflex-states.

USER CODE

This is a section of C/C++ code that will be copied verbatim to the generated “lex.yy.c” file. This section is unused in PlayPen, but here is an example of its use.

void yyerror(char *s)
{
   fprintf(stderr, " line %d: %s\n", linenum, s);
}

int yywrap(void)
{
   if ( --num_input_files > 0 )
   {
      set_input_file( *++input_files );
      return 0;
   }
   else
      return 1;
}

/* set_input_file - open the given file (if NULL, stdin) for scanning */
void set_input_file( char *file )
{
   if ( file && strcmp( file, "-" ) )
   {
      infilename = xstrdup(file);
      yyin = fopen( infilename, "r" );

      if ( yyin == NULL )
      lerr( _( "can't open %s" ), file );
   }
   else
   {
      yyin = stdin;
      infilename = xstrdup("<stdin>");
   }

   linenum = 1;
}

PRACTICAL APPLICATION: PLAYPEN

	/**
	The MIT License (MIT)
	Copyright © 2021 John L. Mooney

	Permission is hereby granted, free of charge, to any person obtaining a copy of this software
	and associated documentation files (the “Software”), to deal in the Software without restriction,
	including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
	and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
	subject to the following conditions:

	The above copyright notice and this permission notice shall be included in all copies or substantial
	portions of the Software.

	THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
	TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
	THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
	CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
	DEALINGS IN THE SOFTWARE.
	*/

%option noyywrap
%{
	#include "../includes/grammar.tab.hpp"

	using namespace std;
%}

float					(-?[0-9]+\.[0-9]+)
int						(-?[0-9])[0-9]*
whitespace				[\r\t\v\f]+
linefeed				\n
name					[a-zA-Z_]+[a-zA-Z0-9_]*
string					((["])(?:\\?+.)*(["]))|((['])(?:\\?+.)*([']))

%%
{int}					{ return INT; }
{float}					{ return FLOAT; }
{string}				{ return STRING; }
"string" 				{ return TYPESTRING; }
"integer"				{ return TYPEINT; }
"float"					{ return TYPEFLOAT; }
"void"					{ return TYPEVOID; }
"fun"					{ return TYPEFUNCTION; }
"return"					{ return RETURN; }
":"						{ return COLON; }
";"						{ return SEMICOLON; }
"{"						{ return LEFTBRACE; }
"}"						{ return RIGHTBRACE; }
"("						{ return LEFTPAREN; }
")"						{ return RIGHTPAREN; }
"->"						{ return RIGHTARROW; }
","						{ return COMMA; }
"="						{ return EQUAL; }
{name}					{ return NAME; }
{linefeed}				++yylineno;
{whitespace}
%%

This chapter goes into greater detail than what is used in the PlayPen compiler, and I have only scratched the surface. Flex is a very powerful tool and if you wish to learn more I suggest visiting the following sites:

  • https://www.cs.fsu.edu/~engelen/doc/reflex/html/index.html
  • http://dinosaur.compilertools.net/flex/