ANNA UNIVERSITY CSE 6th SEM NOTES: PRINCIPLES OF COMPILER DESIGN unit 1

UNIT - I LEXICAL ANALYSIS

Introduction to Compiling- Compilers-Analysis of the source program-The phases- Cousins-The grouping of phases-Compiler construction tools. The role of the lexical analyzer- Input buffering-Specification of tokens-Recognition of tokens-A language for

specifying lexical analyzer.

Introduction

Compiler

A compiler is a program that can read a program in one language — the source language and translate it into an equivalent program in another language — the target language.

An important role of the compiler is to report any errors in the source program that it detects during the translation process

Language Processing System

In addition to a compiler, several other programs may be required to create an executable target program.
- A source program may be divided into modules stored in separate files. The task of collecting the source program is sometimes entrusted to a separate program, called a preprocessor.
- The preprocessor may also expand shorthands, called macros, into source language statements.
- The modified source program is then fed to a compiler. The compiler may produce an assembly-language program as its output, because assembly language is easier to produce as output and is easier to debug.
- The assembly language is then processed by a program called an assembler that produces relocatable machine code as its output.
- Large programs are often compiled in pieces, so the relocatable machine code may have to be linked together with other relocatable object files and library files into the code that actually runs on the machine.
- The linker resolves external memory addresses, where the code in one file may refer to a location in another file.
- The loader then puts together all of the executable object files into memory for execution.
Cousins of the Compiler

The preprocessor , assembler,linker and loader are referred to as the cousins of the compiler.

Analysis – Synthesis Model of Compilation
- The process of compilation has two parts namely : Analysis and Synthesis
- Analysis :The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program.
- Synthesis : The synthesis part constructs the desired target program from the intermediate representation .
- The analysis part is often called the front end of the compiler; the synthesis part is the back end of the compiler.
Phases of the Compiler
There are six phases of the compiler namely
1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
4. Intermediate Code Generation
5. Code Optimization
6. Code Generation
Lexical Analysis
- The first phase of a compiler is called lexical analysis linear analysis or scanning.
- The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes.
- For each lexeme, the lexical analyzer produces as output a token of the form
(token-name, attribute-value)

that it passes on to the subsequent phase, syntax analysis.
- For example, suppose a source program contains the assignment statement
p o s i t i o n := i n i t i a l + r a t e * 60
- The characters in this assignment could be grouped into the following lexemes and mapped into the following tokens passed on to the syntax analyzer:
1. The identifier position

2. The assignment symbol :=

3. The identifier initial

4. The plus sign

5. The identifier rate

6. The multiplication sign

7. The number 60
- The blanks separating the characters are eliminated during lexical analysis
Syntax Analysis
- The second phase of the compiler is syntax analysis or hierarchical analysis or parsing.
- The tokens from the lexical analyzer are grouped hierarchically into nested collections with collective meaning.
- This is represented using a parse tree.For eg. For the assignment statement
p o s i t i o n: = i n i t i a l + rate * 60

the parse tree is as follows
Semantic Analysis
- The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition.
- It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation.
- An important part of semantic analysis is type checking, where the compiler checks that each operator has matching operands.
- For example, a binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert the integer into a floating-point number.
- For the above syntax tree we apply the type conversion considering all the identifiers to be real values,we get
Intermediate Code Generation
- The intermediate representation should have two important properties:
- We consider an intermediate form called three-address code, which consists of a sequence of assembly-like instructions with three operands per instruction.
- Properties of three-address instructions.
1. Each three-address assignment instruction has at most one operator on the right side.

2. The compiler must generate a temporary name to hold the value computed by a three-address instruction.

3. Some "three-address instructions may have fewer than three operands.

Code Optimization
- The machine-independent code-optimization phase attempts to improve the intermediate code so that better target code will result.
- There is a great variation in the amount of code optimization different compilers perform. Those that do the most, are called "optimizing compilers."A significant amount of time is spent on this phase.
- There are simple optimizations that significantly improve the running time of the target program without slowing down compilation too much.
Code Generation
- The code generator takes as input an intermediate representation of the source program and maps it into the target language.
- If the target language is machine code, registers or memory locations are selected for each of the variables used by the program.
- Then, the intermediate instructions are translated into sequences of machine instructions that perform the same task.
Symbol-Table Management
- An essential function of a compiler is to record the variable names used in the source program and collect information about various attributes of each name.
- These attributes may provide information about the storage allocated for a name, its type, its scope (where in the program its value may be used), and in the case of procedure names, such things as the number and types of its arguments, the method of passing each argument (for example, by value or by reference), and the type returned.
- The symbol table is a data structure containing a record for each variable name, with fields for the attributes of the name.
- The data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly.
Error Detection and Reporting
- Each phase can encounter errors.
- After detecting an error , a phase must be able to recover from the error so that compilation can proceed and allow further errors to be detected.
- A compiler which stops after detecting the first error is not useful.
Grouping Of Phases

Front and Back Ends:
Front End:
Back End:
Passes:
Reducing the number of passes:
Compiler-Construction Tools
- Some commonly used compiler-construction tools include
1. Parser generator

2. Scanner generator

3. Syntax-directed translation engine

4. Automatic code generator

5. Data flow engine
1. Parser generators
- produce syntax analyzers from input that is based on context-free grammar.

- Earlier, syntax analysis consumed large fraction of the running time of a compiler + large fraction of the intellectual effort of writing a compiler.

- This phase is now considered as one of the easiest to implement.

- Many parser generators utilize powerful parsing algorithms that are too complex to be carried out by hand.
1. Scanner generators
- automatically generates lexical analyzers from a specification based on regular expression.

- The basic organization of the resulting lexical analyzers is finite automation.
1. Syntax-directed translation engines
- produce collections of routines that walk a parse tree and generating intermediate code.

- The basic idea is that one or more “translations” are associated with each node of the parse tree.

- Each translation is defined in terms of translations at its neighbor nodes in the tree.
1. Automatic Coder generators
- A tool takes a collection of rules that define the translation of each operation of the intermediate language into the machine language for a target machine.

- The rules must include sufficient details that we can handle the different possible access methods for data.
1. Data-flow analysis engines
- gathering of information about how values are transmitted from one part of a program to each other part.

- Data-flow analysis is a key part of code optimization.

The Role of the Lexical Analyzer
- The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program.
- The stream of tokens is sent to the parser for syntax analysis.
- When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table.
.

Interactions between the lexical analyzer and the parser
- Other functions of lexical analyzer
1. stripping out comments and whitespace (blank, newline, tab characters that are used to separate tokens in the input).

2. Another task is correlating error messages generated by the compiler with the source program. For instance, the lexical analyzer may keep track of the number of newline characters seen, so it can associate a line number with each error message.
- Issues in Lexical Analysis
1. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly.

3. Compiler portability is enhanced.

Tokens, Patterns, and Lexemes

Token

A token is a pair consisting of a token name and an optional attribute value.

The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes.

Pattern

A pattern is a description of the form that the lexemes of a token may take.

In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings.

Lexeme

A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Lexical Errors
- It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error. For instance in the following C statement
f i ( a == f ( x ) ) . ..

a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an undeclared function identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other phase of the compiler — probably the parser in this case — handle an error due to transposition of the letters.
- However, suppose a situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is "panic mode" recovery. We delete successive characters from the remaining input, until the lexical analyzer an find a well-formed token at the beginning of what input is left.
- Other possible error-recovery actions are:
1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

Input Buffering
- We often have to look one or more characters beyond the next lexeme before we can be sure we have the right lexeme.
- Hence we introduce a two-buffer scheme that handles large lookaheads safely.
- We then consider an improvement involving "sentinels" that saves time checking for the ends of buffers.
Buffer Pairs
- Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.
- Using one system read command we can read N characters into a buffer.
- If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of the source file.
- Two pointers to the input are maintained:
1. Pointer lexeme_beginning, marks the beginning of the current lexeme, whose extent we are attempting to determine.

2. Pointer forward scans ahead until a pattern match is found.
- Once the next lexeme is determined, forward is set to the character at its right end.
- Then, after the lexeme is recorded as an attribute value of a token returned to the parser, lexeme_beginning is set to the character immediately after the lexeme just found.
- Advancing forward requires that we first test whether we have reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.
if forward at end of first half then begin

reload second half;

forward := forward + 1

end

else if forward at end of second half then begin

reload second half;

move forward to beginning of first half

end

else forward := forward + 1;

Code to advance forward pointer

Sentinels
- For each character read, we make two tests: one for the end of the buffer, and one to determine what character is read .
- We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end.
- The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.
- Note that eof retains its use as a marker for the end of the entire input.
- Any eof that appears other than at the end of a buffer means that the input is at an end.
forward : = forward + 1;

if forward ↑ = eof then begin

if forward at end of first half then begin

reload second half;

forward := forward + 1

end

else if forward at end of second half then begin

reload first half;

move forward to beginning of first half

end

else /* eof within a buffer signifying end of input */

terminate lexical analysis

end

Lookahead code with sentinels

Specification of Tokens
- Regular expressions are an important notation for specifying lexeme patterns.
Strings and Languages
- An alphabet is a finite set of symbols.
- A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
- A language is any countable set of strings over some fixed alphabet.
- In language theory, the terms "sentence" and "word" are often used as synonyms for "string." The length of a string s, usually written |s|, is the number of occurrences of symbols in s.
- For example, banana is a string of length six. The empty string, denoted ε, is the string of length zero.
Terms for Parts of Strings

The following string-related terms are commonly used:

1. A prefix of string s is any string obtained by removing zero or more symbols from the end of s.

For example, ban, banana, and e are prefixes of banana.

3. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s.

For example, nana, banana, and e are suffixes of banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s.

For example, banana, nan, and e are substrings of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those,prefixes, suffixes, and substrings, respectively, of s that are not ε or not equal to s itself.

4. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s.

For example, baan is a subsequence of banana.

Regular Expressions
- Each regular expression r denotes a language L(r).
- Here are the rules that define the regular expressions over some alphabet Σ and the languages that those expressions denote.
- ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the empty string.
- If a is a symbol in Σ, then a is a regular expression, and L(a) = {a}, that is, the language with one string, of length one, with a in its one position.
- Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
1. (r)|(s) is a regular expression denoting the language L(r) U L(s).

2. (r)(s) is a regular expression denoting the language L(r)L(s).

3. (r)* is a regular expression denoting (L(r))*.

4. (r) is a regular expression denoting L(r).
- The unary operator * has highest precedence and is left associative.
- Concatenation has second highest precedence and is left associative.
- | has lowest precedence and is left associative.
- A language that can be defined by a regular expression is called a regular set.
- If two regular expressions r and s denote the same regular set, we say they are equivalent and write r = s.
For instance, (a|b) = (b|a).
- There are a number of algebraic laws for regular expressions
Algebraic laws for regular expressions

Regular Definitions

Giving names to regular expressions is referred a Regular definition. If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:

where:

d_l→ r ₁

d₂→ r₂

_………

d_n→ r_n

1. Each d_i is a distinct name

2. Each r_i is a regular expression over the alphabet Σ U {d_l, d₂,. . . , d_i-l).

E.g: Identifiers is the set of strings of letters and digits beginning with a letter. Regular definition for this set:

letter → A | B | …. | Z | a | b | …. | z |

digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *

Notational Shorthands

Certain constructs occur so frequently in regular expressions that it is convenient to introduce notational shorthand’s for them.
1. One or more instances (+)
- The unary postfix operator + means “ one or more instances of” .

- If r is a regular expression that denotes the language L(r), then ( r )⁺is a regular expressionthat denotes the language ( L (r ) )⁺

- Thus the regular expressiona⁺ denotes the set of all strings of one or more a’s.

- The operator + has the same precedence and associativity as the operator *.
1. Zero or one instance ( ?)
- The unary postfix operator ? means “zero or one instance of”.

- The notation r? is a shorthand for r | ε.

- If ‘r’ is a regular expression, then ( r )? Is a regular expression that denotes the language L( r ) U { ε }.
1. Character Classes.
- The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b | c.

- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.

- Identifiers as being strings generated by the regular expression,

[ A – Z a – z ] [ A – Z a – z 0 – 9 ] ^*
1. Regular Set
- A language denoted by a regular expression is said to be a regular set.
1. Non-regular Set
- A language which cannot be described by any regular expression.

Eg. The set of all strings of balanced parentheses and repeating strings cannot be described by a regular expression. This set can be specified by a context-free grammar.

Transition Diagrams
- As an intermediate step in the construction of a lexical analyzer, we first convert patterns into stylized flowcharts, called "transition diagrams."
- Transition diagrams have a collection of nodes or circles, called states.
- Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns.
- Edges are directed from one state of the transition diagram to another.
- Each edge is labeled by a symbol or set of symbols.
Some important conventions about transition diagrams are:

1. Certain states are said to be accepting, or final. These states indicate that a lexeme has been found. We always indicate an accepting state by a double circle, and if there is an action to be taken — typically returning a token and an attribute value to the parser — we shall attach that action to the accepting state.

2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the symbol that got us to the accepting state), then we shall additionally place a * near that accepting state.

3. One state is designated the start state, or initial state; it is indicated by an edge, labeled "start," entering from nowhere.

4. The transition diagram always begins in the start state before any input symbols have been read.

ANNA UNIVERSITY CSE 6th SEM NOTES

Tuesday, February 15, 2011

PRINCIPLES OF COMPILER DESIGN unit 1

8 comments: