CS 270

float
or
double
types. You will learn how to:
When you have completed this assignment, you will understand how floating
point values are stored in the computer, and how to perform several operations
in the case where the underlying hardware/software does not provide floating
point support. For example, the LC3
computer you will use later
in this course has no floating point support.
First, read the Getting Started section below and then study the
documentation for iFloat.h
in the Files tab to understand
the details of the assignment.
Save Target As..
.FLOAT.C.LINUX.tar
(if you're compiling on Linux)  includes corrected testFloat.c
FLOAT.C.OSX.tar
(corrected Mac version)  includes corrected testFloat.c
Corrected testFloat.c
(both Linux and Mac)You can also download this file from a terminal using the
wget
command (make sure you are currently in the P2
folder):
wget https://www.cs.colostate.edu/~cs270/CurrentSemester/assignments/P2/FLOAT.C.LINUX.tar
Unpack the tar using the following command:
tar xvf FLOAT.C.LINUX.tar
This will unpack the files into a subdirectory src
.
src
folder from the previous step. make
You should see the following output:
gcc g std=c11 Wall c DDEBUG DHALF Debug.c
gcc g std=c11 Wall c DDEBUG DHALF iFloat.c
gcc g std=c11 Wall c DDEBUG DHALF testFloat.c
gcc o testFloat g std=c11 Wall Debug.o iFloat.o testFloat.o convert.a
./testFloat
and read how to run the
the program. There are 9 functions which you must implement. The testFloat
program will call
one of your functions depending on the arguments that you pass. For example, if you type
./testFloat sign 1.25
, your floatGetSign(...)
function will be called with 1.25
(represented as a 16bit integer on which you can perform bitwise operations). This allows you to test one
function at a time.
./testFloat bin 3.625
and you should see the output:
dec: 15552 hex: 0xFFFFC340 bin: 1100001101000000
What you are seeing is the IEEE bit pattern of the floating point value
3.625
expressed as a decimal integer, as hex, and as binary. The decimal integer and hex representations have been
expanded to 32 bits, so you should only pay attention to the lower 16 bits since we're working in half precision
(0xC340 instead of 0xFFFFC340).
You now have a functioning program. All the commands (abs, all, add, etc.) do something. However,
only bin
will produce correct results at this point.
iFloat.c
,
study the documentation found in the Files tab. You will be especially
interested in the documentation for iFloat.h
. Plan what you need
to do before writing code.
The best way to complete the code is to follow a write/compile/test sequence. Do not attempt to write everything at once. Rather choose one function and do the following steps.
iFloat.c
using your
favorite editor.iFloat.c
using
make
. You will find it convenient to work with both a
terminal and editor window at the same time.You should work on your functions in the following suggested order. A sample solution prepared by the author contained the shown approximate line counts (including empty lines and debugging statements):
floatGetSign()
 1 line of codefloatGetExp()
 1 line of codefloatGetVal()
 1 line of codefloatGetAll()
 3 lines of codefloatAbs()
 1 line of codefloatNegate()
 1 lines of codefloatLeftMost1()
 10 lines of codefloatAdd()
 76 lines of codefloatSub()
 1 line of codeThe binary and operation (&) will be used to extract the value of a bit and to set a bit to 0. This relies on the fact that any bit anded with 1 results in the original bit. Also, any bit anded with 0 results in 0. The binary or operation () is used to set a bit to 1. This relies on the the fact that any bit ored with 1 results in a 1.
You will create masks. A mask is a bit pattern that contains 0's and 1's in appropriate places so that when the binary and/or operation is performed, the result has extracted/modified the bits of interest. For example, suppose we want to extract bits 2, 3, and 4 from the 32bit binary number 1001 0001 1010 1100 0011 1001 1110 1101 and clear all other bits. Remember that bit 0 is the rightmost bit. We can achieve this by anding this number with the mask below:
value: 1001 0001 1010 1100 0011 1001 1111 0101 & (AND)
mask: 0000 0000 0000 0000 0000 0000 0001 1100

result: 0000 0000 0000 0000 0000 0000 0001 0100 = 0x14
Notice I designed this mask by placing ones in the bits I am interested in extracting while placing zeroes elsewhere. In C, this could be written as
int value = 0x91AC39F5;
int mask = 0x1C;
int result = value & mask;
printf("0x%X\n", result); // This will print 0x14
Now, suppose we want to clear bits 5, 6, and 7 from the same number while leaving the rest of the bits untouched. We can achieve this by anding this number with the mask below:
value: 1001 0001 1010 1100 0011 1001 1111 0101 & (AND)
mask: 1111 1111 1111 1111 1111 1111 0001 1111

result: 1001 0001 1010 1100 0011 1001 0001 0101 = 0x91AC2915
Notice I designed the mask by placing zeroes in the bits I am interested in clearing while placing ones elsewhere. In C, this could be written as
int value = 0x91AC39F5;
int mask = 0xFFFFFF1F;
int result = value & mask;
printf("0x%X\n", result); // This will print 0x91AC3915
Now, suppose we want to set bits 28, 29, and 30 in the same number while leaving the rest of the bits untouched. We can achieve this by oring this number with the mask below:
value: 1001 0001 1010 1100 0011 1001 1111 0101  (OR)
mask: 0111 0000 0000 0000 0000 0000 0000 0000

result: 1111 0001 1010 1100 0011 1001 1111 0101 = 0xF1AC39F5
Notice I designed the mask by placing ones in the bits I am interested in setting while placing zeroes elsewhere. In C, this could be written as
int value = 0x91AC39F5;
int mask = 0x70000000;
int result = value  mask;
printf("0x%X\n", result); // This will print 0xF1AC39F5
Creating masks is an art that is mastered through practice. Sometimes, masks can be hardcoded (as above). Other times, masks depend on other variables. For example, you may not know in advance which bits you need to modify/extract. In these cases, you will need to create masks by using other operations such as negation, subtraction, and shifts (this is where creativity and hard work comes in).
In this program, you will need to create masks to extract the sign/exponent/mantissa fields out of a IEEE754formatted number and use shift operations to convert them to values you can use. When you have computed the answer, you will use shift and bitwise operations to reassemble the parts into the correct format.
In 16bit precision, a floating point number has the following format:
For example, suppose we want to represent 3.625 as a 16bit IEEE floating point number. First, we obtain the fixed point representation:
3.625_{10} = 11.101_{2}
Then, we normalize the number:
11.101_{2} = 1.1101_{2} * 2^{1}
From this, we can deduce the following:
Putting all three fields together, we get:
1 10000 1101000000 = 1100 0011 0100 0000
You may have noticed that all of the functions that you must complete accept
arguments of type iFloat_t
. In this program, the
iFloat_t
type is simply an alias of the builtin short
type which can represent a 16bit integer number. For example, suppose you run
your program using:
./testFloat sign 125.25
Think of the iFloat_t
argument in the floatGetSign
function as containing the 16bit representation of 125.25. In other words, the
translation from decimal to IEEE has already been done for you. The fact that
125.25 is stored as a 16bit integer number means that you can perform bitwise
operations on it in order to extract the sign information.
floatAdd()
is the only complex function in
this assignment. Many of the things you need to do can be done by calling the
support methods you have already written and thoroughly tested.
The general algorithm for floating point addition is as follows:
At this point, these steps may seem a little abstract, so let's go through a very detailed example. Suppose you run your program using:
./testFloat add 125.75 63.125
Remember that the conversion to IEEE format is done for you. So, at entry
into the floatAdd
function, we know that:
x = 125.75 = 1101 0111 1101 1100
y = 63.125 = 0101 0011 1110 0100
Step 1: Extract the exponent and value
First let's extract the exponent of x:
EXP_{x} = 0000 0000 0001 0101
Now, the value of x. First, we get the mantissa with the implicit 1 made explicit:
0000 0111 1101 1100
However, we notice that x is a negative number because the sign bit in the IEEE representation is 1. Hence, we have to adjust the previous number by taking its 2's complement:
VAL_{x} = 1111 1000 0010 0100
We do the same with y. However, since y is positive, we don't need to get the 2's complement:
EXP_{y} = 0000 0000 0001 0100
VAL_{y} = 0000 0111 1110 0100
Step 2: Equalize the exponents
First, let's motivate this step. Recall addition in scientific notation. For example, suppose you wanted to add the following two numbers:
1.230000 * 10^{12} +
5.457000 * 10^{15}
We cannot add 1.230000 and 5.457000 directly because the exponents are
different. We need to make the exponents equal without changing the numbers. To
do this, we will take the number with the smaller exponent
(1.230000 * 10^{12}
) and rewrite it differently:
0.001230 * 10^{15} +
5.457000 * 10^{15}
I incremented the exponent by 3 to make it 15. I also shifted the value (1.230000) to the right that many times (3) in order to compensate for the increase in the exponent (remember that we want our numbers to stay the same). Shifting the value to the right is actually equivalent to moving the point to the left. You may need to work through a few scientific notation examples to convince yourself of this modification. Now we're in a position to add the numbers. The result will have the common exponent:
0.001230 * 10^{15} +
5.457000 * 10^{15}

5.458230 * 10^{15}
Addition in floating point works the same because the IEEE representation is just scientific notation for binary numbers. Recall from the previous step:
EXP_{x} = 0000 0000 0001 0101 (this is 21 in decimal)
EXP_{y} = 0000 0000 0001 0100 (this is 20 in decimal)
In this case, y has the smaller exponent. Hence, we need to shift the value of y to the right by 1 (because the difference in exponents is 21  20 = 1). The new value of y will be:
VAL_{y} = 0000 0011 1111 0010
And the common exponent (which I will call the exponent of the result) is 21:
EXP_{r} = 0000 0000 0001 0101
Note: in this example, we had to shift the positive value to the right. However, in a different scenario, we may have to shift a negative value to the right. In this case, we must ensure that the sign of the number is preserved. Thankfully, when right shifting a signed number, GCC performs an arithmetic right shift which means that the leading bit is replicated as the number is shifted (as opposed to simply padding with 0's). Hence, we don't have to do anything extra as long as we are using a signed type (like iFloat_t) to store the value.
Step 3: Integer addition
Since the exponents have been equalized, we can add the values just like we did with decimal scientific notation in order to obtain the value of the result. We will add the values using integer addition (this is how you would do it in decimal scientific notation as well: you would leave the point in the same position that it appears in the operands and use normal integer arithmetic):
VAL_{x} = 1111 1000 0010 0100 (from step 1) +
VAL_{y} = 0000 0011 1111 0010 (from step 2)

VAL_{r} = 1111 1100 0001 0110
Step 4: Convert the two's complement back to signmagnitude
After performing integer addition in the previous step, we notice that the
result is negative (because the leading bit is 1). This makes sense because
adding 125.75 and 63.125 should produce a negative number. We will remember the
fact that our final answer will be negative and we will take the 2's complement
of VAL_{r}
in the previous step to make it positive (we'll
call this the magnitude of the result):
SIGN_{r} = 1
MAG_{r} = 0000 0011 1110 1010
Note that taking the 2's complement is only necessary if the result of the addition is negative. If it's positive, then we know that the sign of the final answer is 0 and the magnitude of the result is just the result of the addition.
Step 5: Normalize the result
Let's first look at the information that we have available regarding our result:
SIGN_{r} = 1 (from step 4)
EXP_{r} = 0000 0000 0001 0101 (from step 2)
MAG_{r} = 0000 0011 1110 1010 (from step 4)
The only problem is that the magnitude is not normalized. Imagine that there is a binary point between bits 9 and 10:
MAG_{r} = 0000 00.11 1110 1010
For the magnitude to be normalized, the leftmost 1 must be to the left of the binary point, i.e., on bit position 10 (at this moment, the leftmost 1 is on bit 9). This is analogous to the following situation in decimal scientific notation:
0.456000 * 10^{3}
In order to normalize this number, we need to shift the value to the left (which is equivalent to moving the point to the right) and decrement the exponent to compensate. That way, the leftmost nonzero digit is to the left of the decimal point:
4.560000 * 10^{2}
You may have to do a few examples to convince yourself that this modification works. We will do the same in our binary case. In order to get the leftmost 1 to the left of the binary point, we just have to shift the magnitude by 1 to the left and decrement the exponent by that same amount. Hence, the new exponent and magnitude are:
EXP_{r} = 0000 0000 0001 0100
MAG_{r} = 0000 0111 1101 0100
Note: in this example, we had to shift to the left. In a different example you may have to shift to the right or not shift at all. You also need to calculate how many times you must shift.
Step 6: Reassemble all the components of the result into a 16bit value
We're finally ready to assemble the final result. We have the following components:
SIGN_{r} = 1 (from step 4)
EXP_{r} = 0000 0000 0001 0100 (from step 5)
MAG_{r} = 0000 0111 1101 0100 (from step 5)
We already know the layout of a 16bit IEEE floating point number:
MAG_{r}
because this is the implicit 1. Then, we can combine
the components to get our final answer:
ANSWER = 1 10100 1111010100 = 1101 0011 1101 0100
You can verify that this corresponds to 62.625.
If you want to waste many hours trying to figure out what is wrong with your code, feel free to write the entire addition function all at once. If you want to make better use of your time, follow this advice: develop incrementally. This works quite well for the addition function because it's made up of many small steps. You should only move on to the next step when you're confident that what you have so far works. This is not done by calling your TA and asking them if what you have is right (this would be doing a disservice to yourself). This is done by testing your code. For example, suppose that you have written some code for step 1. Before you start on step 2, print the results of step 1 and run a few examples to see if the output is what you expect. If you don't know what to expect, you shouldn't be coding yet. You need to make sure that you can understand the problem that we're trying to solve in this program.
In order to make debugging easier, Fritz Sieker has provided a function
called debug()
. You can use this function like a printf. The
difference is that the debug()
function will print stuff only if
you run your program with the debug
option. This way, if you don't
want to see the debugging output, you don't have to remove the calls to
debug()
. You can simply remove the debug
option from
the command line. Here's an example of how to use the provided
debug()
and getBinary()
functions to print the binary
representation of an argument in the addition function:
debug("x = %s", getBinary(x));
Here's a run of our solution with the operands shown in the example above and with debugging enabled:
./testFloat debug add 125.75 63.125
DEBUG iFloat.c[61] floatAdd() 1101011111011100: bits of X (IEEE 754)
DEBUG iFloat.c[62] floatAdd() 0101001111100100: bits of Y (IEEE 754)
DEBUG iFloat.c[72] floatAdd() STEP 1
DEBUG iFloat.c[73] floatAdd() X sign: 1 exp: 21 val: 1111100000100100
DEBUG iFloat.c[74] floatAdd() Y sign: 0 exp: 20 val: 0000011111100100
DEBUG iFloat.c[88] floatAdd() Value of Y shifted by 1
DEBUG iFloat.c[93] floatAdd() STEP 2
DEBUG iFloat.c[94] floatAdd() 1111100000100100: Value of X after equalizing exponents
DEBUG iFloat.c[95] floatAdd() 0000001111110010: Value of Y after equalizing exponents
DEBUG iFloat.c[101] floatAdd() STEP 3
DEBUG iFloat.c[102] floatAdd() 1111110000010110: Value of result after addition
DEBUG iFloat.c[112] floatAdd() STEP 4
DEBUG iFloat.c[113] floatAdd() 0000001111101010: Magnitude of result after converting from 2's complement to signmagnitude
DEBUG iFloat.c[126] floatAdd() STEP 5
DEBUG iFloat.c[127] floatAdd() 0000000000010100: Exponent of result after normalizing
DEBUG iFloat.c[128] floatAdd() 0000011111010100: Magnitude of result after normalizing
DEBUG iFloat.c[134] floatAdd() STEP 6
DEBUG iFloat.c[135] floatAdd() 1101001111010100: Answer (IEEE 754)
125.75 + 63.125 > 62.625000
Note that the autograder will run your program without the
debug
option, so it doesn't matter if you follow the format above
for debugging. However, if you choose to use printf statements instead, then you
have to make sure you remove them before you submit your final solution.
Otherwise, your grade will be affected. Here's the output of our solution with
debugging disabled:
./testFloat add 125.75 63.125
125.75 + 63.125 > 62.625000
We will not consider negative zero (0.0) to be a valid value. This would
correspond to bit 15 equal to 1 and the rest of the bits equal to 0. Therefore,
we will not pass this value to your functions.
We also expect your
floatAbs
, floatNegate
, floatAdd
, and
floatSub
functions not to return a negative zero.
The only IEEE754 special case we expect you to handle is 0.0 (when the bit pattern is all zeroes). For example, your code should be able to add/subtract any number and 0.0. It should also return 0.0 correctly when the result of the addition/subtraction is 0.0. You should also negate 0.0 correctly (the negation of 0.0 is not 0.0!).
None of your functions should print
anything when your program is run without the debug
option. You
should only see the output from testFloat.c
. If you print extra
stuff, the autograder won't be able to match your output against the expected
output.
Finally, you may not make use of the functions declared in the convert.h
file.
iFloat.c
. You can either submit it through the website or
use the checkin
program:
~cs270/bin/checkin P2 iFloat.c
Relax, you are done with your assignment!