next_inactive up previous


CS253 Assignment 3

Your task in this assignment is to build a compiler for the Simple language, which compiles the language into SML instructions. For this assignment you only have to worry about a subset of the Simple language, but you do have to handle syntax errors. Details are after the language description. Speaking of which, the next part of this document is an in depth description of the Simple language and how to compile it. I know it is long, but read and understand everything before you start writing code!

1 Simpletron Machine Language

The Simpletron runs programs written in Simpletron Machine Language (SML). All information in the Simpletron is handled in terms of words. A word is a signed four-digit decimal number, such as, +3364, -1293, +0007, -0001, etc. The Simpletron is equipped with a 100-word memory, and these words are referenced by their location numbers 00, 01, ..., 99.

Before running an SML program, we must load or place the program into memory. The first instruction (or statement) of every SML program is always placed in location 00. The simulator will start executing at this location.

Each instruction written in SML occupies one word of Simpletron's memory. (Thus, instructions are signed four-digit decimal numbers.) The sign of an SML instruction is always plus, but the sign of a data word may be plus or minus. Each location in the Simpletron's memory may contain an instruction, a data value used by a program or an unused area of memory. The first two digits of each SML instruction are the operation code that specifies the operation to be performed. SML operation codes are given below

Operation Code                      Meaning

Input/output operations:

READ = 10 Read a word from the keyboard into a specific location in memory
WRITE = 11 Write a word from a specific location in memory to the screen

Load/store operations:

LOAD = 20 Load a word from a specific location in memory into the accumulator
STORE = 21 Store a word from the accumulator into a specific location in memory

Arithmetic operations:

ADD = 30 Add a word from a specific location in memory to the word in
         accumulator (leave the result in accumulator).
SUB = 31 Subtract a word from a specific location in memory to the word in
         accumulator (leave the result in accumulator).
DIV = 32 Divide a word from a specific location in memory by the word in
         accumulator (leave the result in accumulator).
MUL = 33 Multiply a word from a specific location in memory by the word in
         accumulator (leave the result in accumulator).

Transfer of control operations:

BRANCH = 40 Branch to a specific location in memory.
BRNNEG = 41 Branch to a specific location in memory if the accumulator 
         is negative. 
BRNZERO = 42 Branch to a specific location in memory if the accumulator
         is zero.
HALT = 43 Halt -- the program has completed its task.

The last two digits of an SML instruction are the operand - the address of the memory location containing the word to which the operation applies.

Given below is a sample SML program that reads two numbers from a keyboard and computes and prints their sum.

Location       Number        Instruction
00             +1007          (Read A)
01             +1008          (Read B)
02             +2007          (Load A)
03             +3008          (Add B)
04             +2109          (Store C)
05             +1109          (Write C)
06             +4300          (Halt)
07             +0000          (Variable A)
08             +0000          (Variable B)
09             +0000          (Result C)

2 Simple Compiler

The Simple compiler performs two passes of the Simple program to convert it to SML. The first pass constructs a symbol table (object) in which every line number (object), variable name (object) and constant (object) of the Simple program is stored with its type and corresponding location in the final SML code. The first pass also produces the corresponding SML instruction objects for each of the Simple statements. If the Simple program contains statements that transfer control to a line later in the program, the first pass results in an SML program containing some unfinished instructions. The second pass of the compiler locates and completes the unfinished instructions, and outputs the SML program to a file.

First Pass

The compiler begins by reading one statement of the Simple program into memory. Every statement begins with a line number followed by a command. As the compiler breaks a statement into tokens, if the token is a line number, a variable or a constant, it is placed in the symbol table. A line number is placed in the symbol table only if it is the first token in a statement. The symbolTable object is an array of tableEntry objects representing each symbol in the program. There is no restriction on the number of symbols that can appear in a program. Your program should be able to increase or decrease the size of the symbol table.

Each tableEntry object contains three members. Member symbol is an integer containing the ASCII representation of a variable, a line number or a constant. Member type is one of the following characters indicating the symbol's type: `C' for constant, `L' for line number and `V' for variable. Member location contains the Simpletron memory location (00 to 99) to which the symbol refers. Simpletron memory is an array of 100 integers in which SML instructions and data are stored. For a line number, the location is the element in the Simpletron memory array at which the SML instructions for the Simple statement begin. For a variable or a constant, the location is the element in the Simpletron memory array in which the variable or constant is stored. Variables and constants are allocated from the end of Simpletron's memory backwards. The first variable or constant is stored in location 99, the next in location 98, etc.

An SML instruction is a four-digit integer composed of two parts - the operation code and the operand. The operation code is determined by commands in Simple. For example, the Simple command input corresponds to SML operation code 10 (read), and the Simple command print corresponds to SML operation code 11 (write). The operand is a memory location containing the data on which the operation code performs its task (e.g. operation code 10 reads a value from the keyboard and stores it in the memory location specified by the operand). The compiler searches symbolTable to determine the Simpletron memory location for each symbol so the corresponding location can be used to complete the SML instructions.

The compilation of each Simple statement is based on its command. For example, after the line number in a rem statement is inserted in the symbol table, the remainder of the statement is ignored by the compiler because a remark is for documentation purposes only. The input, print, goto and end statements correspond to the SML read, write, branch and halt instructions. Statements containing these Simple commands are converted directly to SML (note that a goto statement may contain an unresolved reference if the specified line number refers to a statement further into the Simple program file; this is sometimes called a forward reference).

When a goto statement is compiled with an unresolved reference, the SML instruction must be flagged to indicate that the second pass of the compiler must complete the instruction. The flags are stored in 100-element array flags of type int in which each element is initialized to -1. If the memory location to which a line number in the Simple program refers is not yet known (that is, it is not in the symbol table), the line number is stored in array flags in the element with the same subscript as the incomplete instruction. The operand of the incomplete instruction is set to 00 temporarily. For example, an unconditional branch instruction (making a forward reference) is left as +4000 until the second pass of the compiler.

Compilation of if/goto and let statements is more complicated than other statements - they are the only statements that produce more than one SML instruction. For an if/goto, the compiler produces code to test the condition and to branch to another line if necessary. The result of the branch could be an unresolved reference. Each of the relational and equality operator can be simulated using SML's branch zero and branch negative instructions (or a combination of both). For a let statement, the compiler produces code to evaluate an arbitrarily complex arithmetic expression consisting of integer variables and constants. Expressions should separate each operand and operator with spaces. When a compiler encounters an expression, it converts the expression from infix notation to postfix notation and then evaluates the postfix expression.

The postfix evaluation algorithm that you did for assignment 2 must be modified to search the symbol table for each symbol it encounters (and possibly insert it), determine the symbol's corresponding memory location and push the memory location onto the stack (instead of the symbol). When an operator is encountered in the postfix expression, the two memory locations at the top of the stack are popped and machine language for effecting the operation is produced using the memory locations as operands. The result of each subexpression is stored in a temporary location in memory and pushed back onto the stack so the evaluation of the postfix expression can continue. When postfix evaluation is complete, the memory location containing the result is the only location left on the stack. This is popped and SML instructions are generated to assign the result to the variable at the left of the let statement.

Second Pass

The second pass of the compiler performs two tasks: resolve any unresolved references and output the SML code to a file. Resolution of references occurs as follows:

  1. Search the flags array for an unresolved reference (that is, an element with a value other than -1).
  2. Locate the object in array symbolTable, containing the symbol stored in the flags array (be sure that the type of the symbol is 'L' for line number).
  3. Insert the memory location from member location into the instruction with the unresolved reference (remember that an instruction containing an unresolved reference has operand 00).
  4. Repeat steps 1,2 and 3 until the end of the flags array is reached.
After the resolution process is complete, the entire array containing the SML code is output to a disk file with one SML instruction per line.

A Complete Example

The following example illustrates a complete conversion of a Simple program to SML as it will be performed by the Simple compiler. Consider a simple program that inputs an integer and sums the values from 1 to that integer.

The program and the SML instructions produced by the first pass of the Simple compiler are given below.

Simple Program            SML Location       Description 
                          and Instruction

5 rem sum 1 to x          none               rem ignored

10 input x                00 +1099           read x into location 99

15 rem check y == x       none               rem ignored

20 if y == x goto 60      01 +2098           load y (98) into accumulator
                          02 +3199           sub x (99) from accumulator
                          03 +4200           branch zero to unresolved location

25 rem increment y        none               rem ignored

30 let y = y + 1         04 +2098            load y into accumulator
                         05 +3097            add 1 (97) to accumulator
                         06 +2196            store in temporary location 96
                         07 +2096            load from temporary location 96
                         08 +2198            store accumulator in y

35 rem add y to total    none                rem ignored

40 let t = t + y         09 +2095            load t (95) into accumulator 
                         10 +3098            add y to accumulator 
                         11 +2194            store in temporary location 94 
                         12 +2094            load from temporary location 94 
                         13 +2195            store accumulator in t

45 rem loop y            none                rem ignored

50 goto 20               14 +4001            branch to location 01

55 rem output result     none                rem ignored

60 print t               15 +1195            output t to screen

99 end                   16 +4300            terminate execution

The symbol table constructed by the first pass is shown below:

Symbol    Type    Location 
5          L        00
10         L        00
`x'        V        99 
15         L        01
20         L        01
`y'        V        98
25         L        04
30         L        04
1          C        97
35         L        09
40         L        09
`t'        V        95
45         L        14
50         L        14
55         L        15
60         L        15
99         L        16

Most Simple statements compile directly to SML instructions. The exceptions in this program are remarks, the if/goto statement in line 20 and the let statements. Remarks do no translate into machine language. However, the line number for a remark is placed in the symbol table in case the line number is referenced in a goto statement or an if/goto statement. Line 20 of the program specifies that if the condition y == x is true, program control is transferred to line 60. Because line 60 appears later in the program, the first pass of the compiler has not as yet placed 60 in the symbol table (statement line numbers are placed in the symbol table only when they appear as first tokens in a statement). Therefore, it is not possible at this time to determine the operand of the SML branch zero instruction at location 03 in the array of SML instructions. The compiler places 60 in location 03 of the flags array to indicate that the second pass completes this instruction.

It is important to keep track of the next instruction location in the SML array because there is not a one-to-one correspondence between Simple statements and SML instructions. For example, the if/goto statement of line 20 compiles into three SML instructions. Each time an instruction is produced, the instruction counter must be incremented to the next location in the SML array. Note that the size of Simpletron's memory could present a problem for Simple programs with many statements, variables, and constants. It is conceivable that the compiler will run out of memory. To test for this case, your program should contain a data counter to keep track of the location at which the next variable or constant will be stored in the SML array. If the value of the instruction counter is larger than the value of the data counter, the SML array is full. In this case, the compilation process should terminate and the compiler should print an error message indicating that it ran out of memory during compilation.

A Step-by-Step View of the Compilation Process

Let us now walk through the compilation process for the Simple program given above. The compiler reads the first line of the program 5 rem sum 1 to x into memory. The first token in the statement (the line number) is determined. Since the symbol is not found, it is entered into the symbol table. So, 5 is entered in the symbol table as type L and assigned the first location in the SML array (00). Although this line is a remark, a space in the symbol table is still allocated for the line number (in case it is referenced by a goto or an if/goto. No SML instruction is generated for a rem statement, so the instruction counter is not incremented.

The statement 10 input x is tokenized next. The line number 10 is placed in the symbol table as type L and assigned the first location in the SML array (00 because a remark began the program so the instruction counter is currently 00). The command input indicates that the next token is a variable (only a variable can appear in an input statement). Because input corresponds directly to an SML operation code, the compiler has to determine the location of x in the SML array. Symbol x is not found in the symbol table. So, it is inserted into the symbol table as the ASCII representation of x, given type V, and assigned location 99 in the SML array (data storage begins at 99 and is allocated backwards). SML code can now be generated for this statement. Operation code 10 (the SML read operation code) is multiplied by 100, and the location of x (as determined in the symbol table) is added to complete the instruction. The instruction is then stored in the SML array at location 00. The instruction counter is incremented by 1 because a single SML instruction was produced.

The statement 15 rem check y == x is tokenized next. The symbol table is searched for line number 15 (which is not found). The line number is inserted as type L and assigned the next location in the array, 01 (remember that rem statements do not produce code, so the instruction counter is not incremented).

The statement 20 if y == x goto 60 is tokenized next. Line number 20 is inserted in the symbol table and given type L with the next location in the SML array 01. The command if indicates that a condition is to be evaluated. The variable y is not found in the symbol table, so it is inserted and given the type V and the SML location 98. Next, SML instructions are generated to evaluate the condition. Since there is no direct equivalent in SML for the if/goto, it must be simulated by performing a calculation using x and y and branching based on the result. If y is equal to x, the result of subtracting x from y is zero, so the branch zero instruction can be used with the result of the calculation to simulate the if/goto statement. The first step requires y to be loaded into the accumulator. This produces the instruction 01 +2098. Next, x is subtracted from the accumulator. This produces the instruction 02 +3199. The value in the accumulator may be zero, positive or negative. Since the operator is ==, we want branch zero. First, the symbol table is searched for the branch location (60 in this case), which is not found. So, 60 is placed in the flags array at location 03, and the instruction 03 +4200 is generated (we cannot add the branch location, because we have not yet assigned a location to line 60 in the SML array yet). The instruction counter is incremented to 04.

The compiler proceeds to the statement 25 rem increment y The line number 25 is inserted in the symbol table as type L and assigned the location 04. The line number 25 is inserted in the symbol table as type L and assigned SML location 04. The instruction counter is not incremented.

When the statement 30 let y = y + 1 is tokenized, the line number 30 is inserted in the symbol table as type L and assigned SML location 04. Command let indicates that the line is an assignment statement. First, all the symbols on the line are inserted in the symbol table (if they are not already there). The integer 1 is added to the symbol table as type C and assigned SML location 97. Next, the right side of the assignment is converted from infix to postfix notation. Then the postfix expression (y 1 +) is evaluated. Symbol y is located in the symbol table and its corresponding memory location is pushed onto the stack. Symbol 1 is also located in the symbol table and its corresponding memory location is pushed onto the stack. When the operator + is encountered, the postfix evaluator pops the stack into the right operand of the operator, pops the stack again into the left operand of the operator and then produces the SML instructions

04 +2098 (load y)

05 +3097 (add 1)

The result of the expression is stored in a temporary location in memory (96) with instruction

06 +2196 (store temporary)

and the temporary location is pushed on the stack. Now that the expression has been evaluated, the result must be stored in y (that is, the variable on the left side of =). So, the temporary location is loaded into the accumulator and the accumulator is stored in y with the instructions

07 +2096 (load temporary)

08 +2198 (store y)

When the statement 35 rem add y to total is tokenized, line number 35 is inserted in the symbol table as type L and assigned location 09.

The statement 40 let t = t + y is similar to line 30. The variable t is inserted in the symbol table as type V and assigned SML location 95. The instructions follow the same logic and format as line 30, and the instructions 09 +2095, 10 +3098, 11 +2194, 12 +2094, and 13 +2195 are generated. Note that the result of t + y is assigned to temporary location 94 before being assigned to t (95).

The statement 45 rem loop y is a remark, so line 45 is added to the symbol table as type L and assigned SML location 14.

The statement 50 goto 20 transfers control to line 20. Line number 50 is inserted in the symbol table as type L and assigned SML location 14. The equivalent of goto in SML is an unconditional branch (40) instruction that transfers control to a specific SML location. The compiler searches the symbol table for line 20 and finds that it corresponds to SML location 01. The operation code (40) is multiplied by 100 and the location 01 is added to it to produce the instruction 14 +4001.

The statement 55 rem output result is a remark, so line 55 is inserted in the symbol table as type L and assigned SMl location 15.

The statement print t is an output statement. Line number 60 is inserted in the symbol table as type L and assigned SML location 15. The equivalent of print in SMl is operation code 11 (write). The location of t is determined from the symbol table and added to the result of the operation code multiplied by 100.

The statement 99 end is the final line of the program. Line number 99 is stored in the symbol table as type L and assigned SML location 16. The end command produces the SML instruction +4300 (43 is halt in SML), which is written as the final instruction in the SML memory array.

This completes the first pass of the compiler. We now consider the second pass. The flags array is searched for values other than -1. Location 03 contains 60, so the compiler knows that instruction 03 is incomplete. The compiler completes the instruction by searching the symbol table for 60, determining its location and adding the location to the incomplete instruction. In this case, the search determines that line 60 corresponds to SML location 15, so the completed instruction 03 +4215 is produced, replacing 03 +4200. The Simple program has now been compiled successfully.

3 Assignment 3 - Due before noon on March 23, 2011

This assignment is due on March 23, 2011. For this part, your compiler should be able to process Simple programs (that do NOT contain goto,if/goto and let statements ) and generate the appropriate SML instructions. Your program will take in two command line arguments - the first one being the name of the input file that is to be compiled and the second one is the SML executable that your program generates.

Requirements

You are required to provide the following:

Output Submission Grading

About this document ...

CS253 Assignment 3

This document was generated using the LaTeX2HTML translator Version 2008 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -no_subdir -split 0 -show_section_numbers assign3.tex

The translation was initiated by keegan patmore on 2011-03-20


next_inactive up previous
keegan patmore 2011-03-20