Essentials

Due:
Key: Use the key ASM for checkin

Acknowledgement

This assignment is patterned after this assignment by Milo Martin when he was at the University of Pennsylvania. Used with permission.

Goals of Assignment

In this assignment, you will complete an assembler for the LC3 assembly language by completing the file assembler.c. You will reuse code that you wrote in previous assignments, add new code and integrate with code provided to you. Some of that code is C source code. Other parts are just a library. You are given header files so that you know what functionality is provided and can call functions in the library even though you do not have the source code. This assignment serves several purposes:

Learn how to translate an assembly language to binary code
Learn how to do simple file I/O in C
Learn to decompose functionality into smaller pieces
Practice working with structures and pointers
Practice working on a larger project
Practice integrating your work into existing code

Overview of an assembler

An assembler is a program which translates assembly language statement to code. It must translate ADD R1,R2,R3 to the hex code 1283. It is like a compiler except that the language it deals with is much simpler than a high level language like Java or C. Several things make assembly language easier to deal with:

Every statement is on a single line of source code
There is at most one statement per source line
The syntax is very regular and is quite simple.

For the LC3 assembly language the syntax is:


    [optional label] opcode operand(s) [; optional end of line comment]

The assembler reads the source code a line at a time, analyzes each line and produces the output file(s) required to run the program.

an object file containing the code (.obj, .hex)
a symbol table file (.sym)

Because the assembly code may contain references to labels that have not yet been encountered (e.g. a branch to a location later in the code), the assembler normally makes two passes over the "code".

The first pass of the assembler must, at a minimum, do two things:

Verify that each line hast the correct syntax. This involves determining the LC3 opcode (if any) and that the operands are correct in number and type for that opcode. If a line is empty or contains only a comment, simply continue with the following line. For an actual code lines, you must determine how much space this instruction will take. Most instructions only take one word and in those cases, the address is simply incremented. Other pseudo-ops may update the address differently.
Whenever a line contains a label, insert it and its address into the symbol table. This is required so that the PCoffset for the LD/ST/LDI/STI/BR/JSR/LEA opcodes can be computed in the second pass.

Additionally, the first pass may choose to store results from the syntax analysis for use by the second pass. This will make the second pass easier. It requires building a list of information about each instruction.

Alternatively, you can skip storing this information and only create the symbol table. Then, in the second pass, the source file is re-read and syntatic nalysis performed again. At this point, there are no syntaic errors, because they would have been found in the first pass. This approach requires reading the source file twice.

The second pass of the assembler is responsible for generating the object code for the .asm file. The actual work depends on how the first pass was structured. It may:

scan a data structure created with the first pass, or
re-read the source file and reprocess each line.

In either case, it must generate the LC3 word(s) that are required for each instruction. This involves creating the correct 16-bit bit pattern(s) that defines the instruction. When an LD/ST/LDI/STI/BR/JSR/LEA instruction is encountered, the code needs to compute the PCoffset, determine if it is in range and insert it into the bit pattern. Offsets out of range and references to undefined labels are reported and are the only errors generated by during the second pass.

Getting Started

This assignment depends on the files symbol.c, util.c you wrote in previous assignments. You may need to fix any remaining bugs in those files to complete this assignment.

Create a directory for this assignment and cd there
Download the file appropriate for your OS
- lc3asm.linux.tar
- lc3asm.osx.tar
Unpack the lc3asm.XXX.tar file. WARNING: this tar ball will spew files into the current directory. It does not unpack into a subdirectory.
```
    tar -xvpf lc3asm.XXX.tar
```
Replace the files symbol.c, util.c by your versions.
Review the Makefile and make sure the variable GCC is appropriate for your C compiler.
Build the excutable by typing make. There should not be any errors or warnings.

You are now ready to begin implementation. However, before writing ANY code, you should make yourself a map of all the components in this project, what functionality each provides, and how they relate to one another. In doing this you are creating a model of the project. Modeling is something you will formally study when you take your Software Engineering course cs314.

Implementing `asm_init(), asm_term()`

Figure out what global variables (if any) and modules need to be initialized. Add code to do so. As you work on your assembler, you may need to add more code to this function. Similarly, figure out what must be cleaned up, and add code to asm_term().

Implementing `asm_pass_one()`, phase 1

Study the documentation for this routine. You may translate this outline directly into code. When you encounter a label, add it to the symbol table with an address of 0. For initial testing, you might skip steps 5.2 and 5.4. Simply save the source line in the field reference using the function strdup(). Each time you read a line that contains code, create a lin_info_t, link it into the list and print it using asm_print_line_info() if the variable printPass1 is non-zero. It is set to a zon-zero value by running the assembler with the -pass1 flag. To test your code, make the assembler and run it with a small assembly file(s). The name of your assembler is mylc3as. What you should get is two things:

a symbol table file (.sym) with a header, and symbols. The addresses will all be 0.
a list of the lines of the source file (with line numbers). The only lines that you should see are those that contain a valid opcode. This list will be printed if you run mylc3as -pass1

To understand how to convert a line into tokens, make testTokens, and run this program. Study the source of testTokens.c to see how to do it. Understand what happens with blank lines and lines that contain only comments.

You can see from step 5.3 that this assembler is building a data structure that will be re-used in the second pass. This will let you practice your C dynamic memory management skills.

Implementing `asm_pass_one()`, phase 2

You are now ready to add a bit more. Specifically, you will be verifying the syntax of each line. Make the executable seeLC3 and run it. Try different LC3 opcodes and see what the program prints. Then study the code in seeLC3.c to understand how it works. Use of the code and ideas presented in this file is optional. You may find it easier to write your own code for verifying the syntax.

You now have a model of how to determine the type(s) of operands expected by any LC3 instruction. Note that many LC3 instructions have common operands. For example, many instructions require one of more registers. Therefore, it may be appropriate to write a helper method that converts a token to a register, and reports an error if the token is not a register. Similarly, if may be useful to have a helper method that gets immediate values. Immediates are used in a variety of instructions. Those instructions differ in the number of bits used to store the value, and some values are signed, while others are unsigned. A helper method can take care of all these cases.

Add code to collect and store each operand into the fields of line_info_t. Create very short assembly language files to test your code. These files will often have .ORIG, .END and a single additional LC3 instruction. Although you may be tempted to start with ADD/AND instructions, they are actually a little tricky. This is because you do not know from the name whether or not the third operand will be an immediate or a register. Only when you encounter the third operand will you know which form the writer used. Compare this to JMP, RET which use two different "names" for the two forms.

Implementing `asm_pass_one()`, phase 3

Next you will add code to keep track of the address of each instruction. Most instructions take only a single word of memory. The instructions .BLKW/.STRINGZ are the exceptions. Also, .ORIG is handled a little differently. At any rate, your code should now put the address of each label in the symbol table. The symbol table file that you create can be compared to that produced by the regular LC3 assembler.

Implementing `asm_pass_one()`, phase 4

Now add code for error checking. Depending on how you write your helper routines, you may already have completed a substantial portion of the error checking. Create multiple simple error case files and test them with your assembler.

Implementing `asm_pass_two()`

You should first writte the basic loop to traverse that data structure returned by asm_pass_one(). Simply print the source line. Once you have verified that your basic loop is correct, your code will now need to generate the machine code for each instruction. If you have completed the first pass correctly, then each line_info_t contains the information you need to generate the machine code.

The first step is to copy the prototype from the format field of the information on this line into a variable that will contain the final machine code. The next step is to insert the operand(s) into the correct locations of the machine code. For example, if the line was an ADD with an immediate value, then the fields DR, SR1, immediate are inserted into the appropriate bit locations of the macine code. Different instructions will have different number of operands. When the machine code is complete, write it to the object file using lc3_write_LC3_word(). When you encounter an instruction that uses a PCoffset, you will need to look up the address of the reference in the symbol table and compute the offset.

Note that the .BLKW, .STRINGZ opcodes may generate multiple words in the object file. If you are ever unsure what should be put in the output file, run ~cs270/lc3tools/lc3as -hex file.asm and look at the hex code that it generates (file file.hex).

Extending the assembler (optional)

An assembler is a big step up from machine code. When using an assembler, the programmer can write ADD R1,R2,R3 instead of 1283. As you know, LC-3 assembly language is very limited. In order to make LC-3 programming slightly simpler, we will introduce several pseudo instructions. Pseudo instructions are instructions that are recognized by the assembler but don't actually exist in the machine. For example, LC-3 does not actually support a HALT instruction, but we've been putting them in our assembly code. The assembler recognizes that HALT is not a real instruction and generates a TRAP x25. Note that pseudo instructions do not give programmers any additional power, because anything that can be done with a pseudo instruction can be done with 1 or more regular instructions. They just make programming easier.

Two existing psuedo instructions that generate multiple LC3 instructions are .BLKW and .STRINGZ. You will add a couple of others that will make programming a little simpler.

There are several things that you must do to extend the assembler.

Add a new entry to the opcode_t defined in lc3.h. This adds an "opcode" for the pseudo instructrion.
Add a new entry to lc3_instructions[] defined in lc3.c. This defines the number and types of operands for the pseudo instruction.
Add a new entry in lc3_instruction_map[] defined in util.c. This associates the "name" of the pseudo instruction with its "opcode".
Add code to asm_pass_one() to analyze the operands and determine how many LC3 instructions this pseudo instruction will produce. This may not even require any additional code.
Add code to asm_pass_two() to generate the LC3 instructions correponding to the pseudo instruction and add them to the object file.

The `.SETR` pseudo instruction

The .SETR pseudo instruction allows the programmer to initialize a register with an immediate value. Thus, one can write .SETR R1,#100. The end result is that register R1 contains the value 100 and the condition code is set according to the value. The following table shows the two cases.

small value (.SETR DR,#5) large value (.SETR DR,#1000)


AND DR,DR,#0  ; clear the register
ADD DR,DR,val ; put val in register


          ; * represents current address
BR *+2    ; skip over value (code 0x0E01)
xxxx      ; encode large val here
LD DR,*-1 ; load from memory (prototype 0X21FE)
          ; encode DR in bits 11..9 of prototype

When the value is "small", you can take advantage of the immediate field in a ADD instruction and use only two LC3 instructions. If the immediate exceeds to size allowed in ADD, three instructions will be required.

The `.SUB` pseudo instruction

.SUB is just like ADD except that it performs subtraction. Like ADD, the third operand can be a register or an immediate. For reasons that will become clear soon, if the third operand is an immediate, it can only have a value from decimal -15 to 16 (versus -16 to 15 for ADD). If the third operand is an immediate, just use an ADD with the value negated.

When the third operand is a register, one can generate multiple instructions. Consider this example.

using `.SUB`	without using `.SUB`
`.SUB DR,SR1,SR2`	`NOT SR2,SR2 ; invert bits (part of two's comp) ADD DR,SR1,SR2 ; DR = SR1 + ~SR2 NOT SR2,SR2 ; restore original SR2 val ADD DR,DR,#1 ; two's comp - DR = SR1 + ~SR2 + 1`

Pretty neat how we performed the subtraction without an additional register, eh? But... There's a problem: if the first and third registers are the same register (.SUB R1,R2,R1), because the second NOT instruction will corrupt the result of the subtraction. In this case, don't emit the second NOT. And there's another problem: if the second and third registers are the same (.SUB R1,R2,R2); when you negate the third register you will corrupt the value of the second register (which is the same as the third). In this case, rather than the 4-instruction sequence above, we'll simply emit an instruction that puts 0 in the DR register (AND DR,DR,#0).
Milo Martin

Checking in Your Code

You will submit the single file XXX.tar using the checkin program. Use the key ASM. Or use the checkin tab of the course web page.

Essentials

Acknowledgement

Goals of Assignment

Overview of an assembler

Getting Started

Implementing asm_init(), asm_term()

Implementing asm_pass_one(), phase 1

Implementing asm_pass_one(), phase 2

Implementing asm_pass_one(), phase 3

Implementing asm_pass_one(), phase 4

Implementing asm_pass_two()