CSCE 531 Lexical Analysis Programming Assignment – 1

$30.00

Category:

Description

5/5 - (3 votes)

Programming assignments will direct you to design and build a compiler for Extensions to the language
Core, which appears in the Programming Language Landscapes text, by Ledgard and Marcotty. Each
assignment will cover one component of the compiler: lexical analysis, parsing, semantic analysis, and
code generation. Each assignment will ultimately result in a working compiler phase which can interface
with other phases.
A grammar for Core2(a slightly modified version of Core) is given below:
program ::= PROGRAM
declaration…
BEGINT
statement…
ENDT ;
declaration ::= type : IDENTIFIER [, IDENTIFIER]… ;
type ::= INT
statement ::= assignment-statement
| output-statement
| input-statement
assignment-statement ::= IDENTIFIER := expression ;
output-statement ::= INPUT IDENTIFIER [, IDENTIFIER]… ;
input-statement ::= OUTPUT IDENTIFIER [, IDENTIFIER]… ;
expression ::= …
1
You will be implementing the projects in C using Flex and Bison. If you would like to try you may try using
C++ with these tools.
For this assignment, you are to write a lexical analyzer, using the lexical analyzer generator Flex. You will
describe the set of tokens for Core with regular expressions. Flex the lexical analyzer generator will generate
the actual code for recognizing tokens in Core programs. This is essentially an implementation of the
Thompson construction for converting regular expressions into an NFA followed by the subset construction
to convert the NFA to a DFA. There is also a little minimimazation of the DFA also.
Online documentation for all the tools needed for the project will be made available on the course website.
This includes manuals for Flex (used in this assignment), the documentation for bison (used in the next
assignment), as well as the manual for the spim simulator. You may work either individually or in pairs for
this assignment.
2 Introduction to Lex and Flex
Lex allows you to implement a lexical analyzer by writing rules that match on user-defined regular expressions and performing a specified actions when each pattern is matched. Flex is just a “Fast version of Lex”
and you should be using this new version. Flex compiles your regular expression and action code to a lexical
analyzer that is named yylex(). lexer. Flex) to C source code implementing a Finite automaton recognizing
the regular expressions that you specify in your rule File. Fortunately, it is not necessary to understand or
even look at the automatically generated (and often very messy) File implementing your rules. Rule Files
in Flex are structured as follows: The appropriate input file format is :
%{
Declarations
%}
Definitions
%%
Rules
%%
User subroutines
The Declarations and User subroutines sections are optional and allow you to write declarations and helper
functions in C. The Definitions section is also optional, but often very useful as definitions allow you to give
names to regular expressions. For example, the definition
DIGIT [0-9]
allows you to define a digit. Here, DIGIT is the name given to the regular expression matching any single
character between 0 and 9.
The most important part of your lexical analyzer is the Rules section. A rule in Flex specifies an action to
perform if the input matches the regular expression or definition at the beginning of the rule. The action to
perform is specified by writing regular C source code. For example, assuming that a digit represents a token
in our language. Note that this is not the case in Core2), the rule:
2
DIGIT core2 yylval.symbol = inttable.add string(yytext);
return DIGIT TOKEN;
records the value of the digit in the global variable core2 yylval and returns the appropriate token code.
(See Section 5 for a more detailed discussion of the global variable core2 yylval and see Section 4.2 for a
discussion of the inttable used in the above code fragment.) An important point to remember is that if the
current input (i.e., the result of the function call to yylex()) matches multiple rules, Flex picks the rule that
matches the largest number of characters. For instance, if you define the following two rules
}
[0-9]+ {// action 1} \\
[0-9a-z]+ {// action 2} \\
and if the character sequence 2a appears next in the file being scanned, then action 2 will be performed
since the second rule matches more characters than the first rule. If multiple rules match the same number
of characters, then the rule appearing first in the file is chosen. When writing rules in Flex, it may be
necessary to perform different actions depending on previously encountered tokens. For example, when
processing a closing comment token, you might be interested in knowing whether an opening comment was
previously encountered. One obvious way to track state is to declare global variables in your declaration
section, which are set to true when certain tokens of interest are encountered. Flex also provides syntactic
sugar for achieving similar functionality by using state declarations such as:
%Start COMMENT
which can be set to true by writing BEGIN(COMMENT). To perform an action only if an opening comment
was previously encountered, you can predicate your rule on COMMENT using the syntax:
// the rest of your rule …
There is also a special default state called INITIAL which is active unless you explicitly indicate the beginning of a new state. You might find this syntax useful for various aspects of this assignment, such as error
reporting.
3 Files and Directories
To get started, log in to one of the lab machines. Then there are examples in the directory /class/csce531.
.
— SimpleYacc
|– expr.y
3
|– Makefile
|– Makefile4
|– make.rules
|– SaveMakefile
|– simple0.l
|– simple0.y
|– simple1.y
|– simple3.l
|– simple3.y
|– simple4.l
|– simple4.y
|– simple.y
|– Table
|– kr.symtab.c
|– kr.symtab.h
|– SampleInput
|– strsave.c
|– strtok.c
|– strtokTable.c
2 directories, 19 files
3.1 Table
This directory contains routines for a symbol table (kr.symtab.c) from the C Book by Kernighan and Ricthie.
There is also a couple of programs illustrating the library function “strtok.” The files strtokTable.c uses the
symbol table routines in kr.symtab.c. To build it use the command:
gcc -c kr.symtab.c // this produces kr.symbtab.o gcc strtokTable.c kr.symtab.o -o symtab
Then one can run it on the sampleInput with
./symtab ¡ SampleInput
3.2 SimpleYacc
This directory contains sample expression lexical analyzer – parser pairs. We have discussed these in class.
Simple3 is the version you should base your efforts on. The Makefile is set up to work with simple3 and
there is:
1. simple3.l – a flex specification file
2. simple3.y – a grammar specification file
4
3. Makefile
4. simple3 – the target executable
3.3 Error Handling
All errors should be passed along to the parser. You lexer should not print anything. Errors are communicated to the parser by returning a special error token called ERROR. (Note, you should ignore the token
called error [in lowercase] for this assignment; it is used by the parser in PA3.) There are several requirements for reporting and recovering from lexical errors:
When an invalid character (one that can’t begin any token) is encountered, a string containing just that character should be returned as the error string. Resume lexing at the following character.
If a string contains an unescaped newline, report that error as ”Unterminated string constant” and resume
lexing at the beginning of the next line we assume the programmer simply forgot the close-quote.
When a string is too long, report the error as ”String constant too long” in the error string in the ERROR
token. If the string contains invalid characters (i.e., the null character), report this as ”String contains null
character”. In either case, lexing should resume after the end of the string. The end of the string is defined
as either:
1. the beginning of the next line if an unescaped newline occurs after these errors happen; or
2. after the closing ” otherwise.
If a comment remains open when EOF is encountered, report this error with the message ”EOF in comment”.
Do not tokenize the comment’s contents simply because the terminator is missing. Similarly for strings, if
an EOF is encountered before the close-quote, report this error as ”EOF in string constant”.
If you see outside a comment, report this error as ”Unmatched *)”, rather than tokenizing it as * and ).
Recall from lecture that this phase of the compiler only catches a very limited class of errors. Do not check
for errors that are not lexing errors in this assignment. For example, you should not check if variables are
declared before use. Be sure you understand fully what errors the lexing phase of a compiler does and does
not check for before you start.
3.4 Symbol Table
Programs tend to have many occurrences of the same lexeme. For example, an identifier is generally referred
to more than once in a program. To save space and time, a common compiler practice is to store lexemes in
a symbol table. We provide a string table implementation; see the following sections for the details. There
is an issue in deciding how to handle the special identifiers for the basic classes (Object, Int, Bool, String),
SELF TYPE, and self. However, this issue doesn’t actually come up until later phases of the compiler the
scanner should treat the special identifiers exactly like any other identifier.
5
3.5 Strings
Your scanner should convert escape characters in string constants to their correct values. For example, if the
programmer types the characters: ”a b
n c d” (spaces are just separators here) your scanner would return the token STR CONST whose semantic
value is these 5 characters: a b
n c d where
n represents the literal ASCII character for newline. However, the sequence of two characters 0 is allowed
but should be converted to the one character ’0’.
3.6 Other Notes
Your scanner should maintain the variable curr lineno that indicates which line in the source text is currently
being scanned. This feature will aid the parser in printing useful error messages. Finally, note that if the
lexical specification is incomplete (some input has no regular expression that matches), then the scanner
generated by Flex does undesirable things. Make sure your specification is complete.
4 Implementation Notes
Each call on the scanner returns the next token and lexeme from the input. The value returned by the
function core2 yylex is an integer code representing the syntactic category (e.g., integer literal, semicolon,
if keyword, etc.). The codes for all tokens are defined in the file core2-parse.h. The second component,
the semantic value or lexeme, is placed in the global union core2 yylval, which is of type YYSTYPE. The
type YYSTYPE is also defined in core2-parse.h. The tokens for single character symbols (e.g., …) are
represented just by the integer (ASCII) value of the character itself.
For class identifiers, object identifiers, integers, and strings, the semantic value should be a Symbol stored
in the field core2 yylval.symbol.
We provide you with a symbol table implementation, from the C-book by Kernighan and Ricthie.
When a lexical error is encountered, the routine core2 yylex should return the token ERROR. The semantic
value is the string representing the error message, which is stored in the field core2 yylval.error msg (note
that this field is an ordinary string, not a symbol). See the previous section for information on what to put
in error messages.
6