CS 270
CS270 Programming Assignment P2 - Floating Point Math

About The Assignment

This assignment is designed to teach you how to do several floating point operations in C without using either the float or double types. You will learn how to:

When you have completed this assignment, you will understand how floating point values are stored in the computer, and how to perform several operations in the case where the underlying hardware/software does not provide floating point support. For example, the LC-3 computer you will use later in this course has no floating point support.

First, read the Getting Started section below and then study the documentation for iFloat.h in the Files tab to understand the details of the assignment.

Getting Started

Perform the following steps
  1. Create a directory for this assignment. The name of the directory is arbitrary, but you may find it useful to name it for the assignment (e.g. P2).
  2. Copy the appropriate file into this directory. It is easiest to right click on the link, and do a Save Target As...

    You can also download this file from a terminal using the wget command (make sure you are currently in the P2 folder):

        wget https://www.cs.colostate.edu/~cs270/CurrentSemester/assignments/P2/FLOAT.C.LINUX.tar

    Unpack the tar using the following command:

        tar -xvf FLOAT.C.LINUX.tar

    This will unpack the files into a subdirectory src.

  3. Open a terminal and go to the src folder from the previous step.
  4. In the terminal type the following command to build the executable.
    You should see the following output:
        gcc -g -std=c11 -Wall -c -DDEBUG -DHALF Debug.c
        gcc -g -std=c11 -Wall -c -DDEBUG -DHALF iFloat.c
        gcc -g -std=c11 -Wall -c -DDEBUG -DHALF testFloat.c
        gcc -o testFloat -g -std=c11 -Wall Debug.o iFloat.o testFloat.o convert.a
  5. In the terminal type ./testFloat and read how to run the the program. There are 9 functions which you must implement. The testFloat program will call one of your functions depending on the arguments that you pass. For example, if you type ./testFloat sign -1.25, your floatGetSign(...) function will be called with -1.25 (represented as a 16-bit integer on which you can perform bitwise operations). This allows you to test one function at a time.
  6. In the terminal type ./testFloat bin -3.625 and you should see the output:
        dec: -15552  hex: 0xFFFFC340  bin: 1100-0011-0100-0000
    What you are seeing is the IEEE bit pattern of the floating point value -3.625 expressed as a decimal integer, as hex, and as binary. The decimal integer and hex representations have been expanded to 32 bits, so you should only pay attention to the lower 16 bits since we're working in half precision (0xC340 instead of 0xFFFFC340).

You now have a functioning program. All the commands (abs, all, add, etc.) do something. However, only bin will produce correct results at this point.

Completing the Code

Before attempting to write any of the functions in iFloat.c, study the documentation found in the Files tab. You will be especially interested in the documentation for iFloat.h. Plan what you need to do before writing code.

The best way to complete the code is to follow a write/compile/test sequence. Do not attempt to write everything at once. Rather choose one function and do the following steps.

  1. Write some/all of one function in iFloat.c using your favorite editor.
  2. Save your changes and recompile iFloat.c using make. You will find it convenient to work with both a terminal and editor window at the same time.
  3. Repeat steps 1 and 2 until there are no errors or warnings.
  4. Test the function you have been working on. Do not attempt to move on until you complete and thoroughly test a function.
  5. Repeat steps 1 thru 4 for the remaining functions.

You should work on your functions in the following suggested order. A sample solution prepared by the author contained the shown approximate line counts (including empty lines and debugging statements):

Your code may be a little longer, but in every case, these methods are quite simple. If you find any of your solutions is much longer than stated, you will want to think about how you are approaching the problem.

Basic Bit manipulation

The binary and operation (&) will be used to extract the value of a bit and to set a bit to 0. This relies on the fact that any bit anded with 1 results in the original bit. Also, any bit anded with 0 results in 0. The binary or operation (|) is used to set a bit to 1. This relies on the the fact that any bit ored with 1 results in a 1.

You will create masks. A mask is a bit pattern that contains 0's and 1's in appropriate places so that when the binary and/or operation is performed, the result has extracted/modified the bits of interest. For example, suppose we want to extract bits 2, 3, and 4 from the 32-bit binary number 1001 0001 1010 1100 0011 1001 1110 1101 and clear all other bits. Remember that bit 0 is the rightmost bit. We can achieve this by anding this number with the mask below:

   value: 1001 0001 1010 1100 0011 1001 1111 0101 & (AND)
    mask: 0000 0000 0000 0000 0000 0000 0001 1100
  result: 0000 0000 0000 0000 0000 0000 0001 0100 = 0x14

Notice I designed this mask by placing ones in the bits I am interested in extracting while placing zeroes elsewhere. In C, this could be written as

  int value = 0x91AC39F5;
  int mask = 0x1C;
  int result = value & mask;
  printf("0x%X\n", result); // This will print 0x14

Now, suppose we want to clear bits 5, 6, and 7 from the same number while leaving the rest of the bits untouched. We can achieve this by anding this number with the mask below:

   value: 1001 0001 1010 1100 0011 1001 1111 0101 & (AND)
    mask: 1111 1111 1111 1111 1111 1111 0001 1111
  result: 1001 0001 1010 1100 0011 1001 0001 0101 = 0x91AC2915

Notice I designed the mask by placing zeroes in the bits I am interested in clearing while placing ones elsewhere. In C, this could be written as

  int value = 0x91AC39F5;
  int mask = 0xFFFFFF1F;
  int result = value & mask;
  printf("0x%X\n", result); // This will print 0x91AC3915

Now, suppose we want to set bits 28, 29, and 30 in the same number while leaving the rest of the bits untouched. We can achieve this by oring this number with the mask below:

   value: 1001 0001 1010 1100 0011 1001 1111 0101 | (OR)
    mask: 0111 0000 0000 0000 0000 0000 0000 0000
  result: 1111 0001 1010 1100 0011 1001 1111 0101 = 0xF1AC39F5

Notice I designed the mask by placing ones in the bits I am interested in setting while placing zeroes elsewhere. In C, this could be written as

  int value = 0x91AC39F5;
  int mask = 0x70000000;
  int result = value | mask;
  printf("0x%X\n", result); // This will print 0xF1AC39F5

Creating masks is an art that is mastered through practice. Sometimes, masks can be hard-coded (as above). Other times, masks depend on other variables. For example, you may not know in advance which bits you need to modify/extract. In these cases, you will need to create masks by using other operations such as negation, subtraction, and shifts (this is where creativity and hard work comes in).

In this program, you will need to create masks to extract the sign/exponent/mantissa fields out of a IEEE-754-formatted number and use shift operations to convert them to values you can use. When you have computed the answer, you will use shift and bitwise operations to reassemble the parts into the correct format.

16-Bit IEEE Floating Point Numbers (Half Precision)

In 16-bit precision, a floating point number has the following format:

For example, suppose we want to represent -3.625 as a 16-bit IEEE floating point number. First, we obtain the fixed point representation:

-3.62510 = -11.1012

Then, we normalize the number:

-11.1012 = -1.11012 * 21

From this, we can deduce the following:

Putting all three fields together, we get:

1 10000 1101000000 = 1100 0011 0100 0000

The iFloat_t Type

You may have noticed that all of the functions that you must complete accept arguments of type iFloat_t. In this program, the iFloat_t type is simply an alias of the built-in short type which can represent a 16-bit integer number. For example, suppose you run your program using:

    ./testFloat sign 125.25

Think of the iFloat_t argument in the floatGetSign function as containing the 16-bit representation of 125.25. In other words, the translation from decimal to IEEE has already been done for you. The fact that 125.25 is stored as a 16-bit integer number means that you can perform bitwise operations on it in order to extract the sign information.

Floating Point Addition

The single function floatAdd() is the only complex function in this assignment. Many of the things you need to do can be done by calling the support methods you have already written and thoroughly tested.

The general algorithm for floating point addition is as follows:

  1. Extract the exponent and value for each of the numbers. The value is the mantissa with the implicit 1 made explicit and adjusted for the sign.
  2. Equalize the exponents.
  3. Do an integer addition.
  4. Convert the two's complement back to sign-magnitude.
  5. Normalize the result using shift operations and increment/decrement the exponent appropriately.
  6. Re-assemble all the components of the result into a 16-bit value.

At this point, these steps may seem a little abstract, so let's go through a very detailed example. Suppose you run your program using:

    ./testFloat add -125.75 63.125

Remember that the conversion to IEEE format is done for you. So, at entry into the floatAdd function, we know that:

    x = -125.75  = 1101 0111 1101 1100
    y =   63.125 = 0101 0011 1110 0100

Step 1: Extract the exponent and value

First let's extract the exponent of x:

    EXPx = 0000 0000 0001 0101

Now, the value of x. First, we get the mantissa with the implicit 1 made explicit:

    0000 0111 1101 1100

However, we notice that x is a negative number because the sign bit in the IEEE representation is 1. Hence, we have to adjust the previous number by taking its 2's complement:

    VALx = 1111 1000 0010 0100

We do the same with y. However, since y is positive, we don't need to get the 2's complement:

    EXPy = 0000 0000 0001 0100
    VALy = 0000 0111 1110 0100

Step 2: Equalize the exponents

First, let's motivate this step. Recall addition in scientific notation. For example, suppose you wanted to add the following two numbers:

    1.230000 * 1012 +
    5.457000 * 1015

We cannot add 1.230000 and 5.457000 directly because the exponents are different. We need to make the exponents equal without changing the numbers. To do this, we will take the number with the smaller exponent (1.230000 * 1012) and rewrite it differently:

    0.001230 * 1015 +
    5.457000 * 1015

I incremented the exponent by 3 to make it 15. I also shifted the value (1.230000) to the right that many times (3) in order to compensate for the increase in the exponent (remember that we want our numbers to stay the same). Shifting the value to the right is actually equivalent to moving the point to the left. You may need to work through a few scientific notation examples to convince yourself of this modification. Now we're in a position to add the numbers. The result will have the common exponent:

    0.001230 * 1015 +
    5.457000 * 1015
    5.458230 * 1015

Addition in floating point works the same because the IEEE representation is just scientific notation for binary numbers. Recall from the previous step:

    EXPx = 0000 0000 0001 0101 (this is 21 in decimal)
    EXPy = 0000 0000 0001 0100 (this is 20 in decimal)

In this case, y has the smaller exponent. Hence, we need to shift the value of y to the right by 1 (because the difference in exponents is 21 - 20 = 1). The new value of y will be:

    VALy = 0000 0011 1111 0010

And the common exponent (which I will call the exponent of the result) is 21:

    EXPr = 0000 0000 0001 0101

Note: in this example, we had to shift the positive value to the right. However, in a different scenario, we may have to shift a negative value to the right. In this case, we must ensure that the sign of the number is preserved. Thankfully, when right shifting a signed number, GCC performs an arithmetic right shift which means that the leading bit is replicated as the number is shifted (as opposed to simply padding with 0's). Hence, we don't have to do anything extra as long as we are using a signed type (like iFloat_t) to store the value.

Step 3: Integer addition

Since the exponents have been equalized, we can add the values just like we did with decimal scientific notation in order to obtain the value of the result. We will add the values using integer addition (this is how you would do it in decimal scientific notation as well: you would leave the point in the same position that it appears in the operands and use normal integer arithmetic):

    VALx = 1111 1000 0010 0100 (from step 1) +
    VALy = 0000 0011 1111 0010 (from step 2)
    VALr = 1111 1100 0001 0110

Step 4: Convert the two's complement back to sign-magnitude

After performing integer addition in the previous step, we notice that the result is negative (because the leading bit is 1). This makes sense because adding -125.75 and 63.125 should produce a negative number. We will remember the fact that our final answer will be negative and we will take the 2's complement of VALr in the previous step to make it positive (we'll call this the magnitude of the result):

    SIGNr = 1
    MAGr  = 0000 0011 1110 1010

Note that taking the 2's complement is only necessary if the result of the addition is negative. If it's positive, then we know that the sign of the final answer is 0 and the magnitude of the result is just the result of the addition.

Step 5: Normalize the result

Let's first look at the information that we have available regarding our result:

    SIGNr = 1 (from step 4)
    EXPr  = 0000 0000 0001 0101 (from step 2)
    MAGr  = 0000 0011 1110 1010 (from step 4)

The only problem is that the magnitude is not normalized. Imagine that there is a binary point between bits 9 and 10:

    MAGr  = 0000 00.11 1110 1010

For the magnitude to be normalized, the leftmost 1 must be to the left of the binary point, i.e., on bit position 10 (at this moment, the leftmost 1 is on bit 9). This is analogous to the following situation in decimal scientific notation:

    0.456000 * 103

In order to normalize this number, we need to shift the value to the left (which is equivalent to moving the point to the right) and decrement the exponent to compensate. That way, the leftmost non-zero digit is to the left of the decimal point:

    4.560000 * 102

You may have to do a few examples to convince yourself that this modification works. We will do the same in our binary case. In order to get the leftmost 1 to the left of the binary point, we just have to shift the magnitude by 1 to the left and decrement the exponent by that same amount. Hence, the new exponent and magnitude are:

    EXPr  = 0000 0000 0001 0100
    MAGr  = 0000 0111 1101 0100

Note: in this example, we had to shift to the left. In a different example you may have to shift to the right or not shift at all. You also need to calculate how many times you must shift.

Step 6: Re-assemble all the components of the result into a 16-bit value

We're finally ready to assemble the final result. We have the following components:

    SIGNr = 1 (from step 4)
    EXPr  = 0000 0000 0001 0100 (from step 5)
    MAGr  = 0000 0111 1101 0100 (from step 5)

We already know the layout of a 16-bit IEEE floating point number:

We do have to be careful about clearing the 1 in bit 10 from MAGr because this is the implicit 1. Then, we can combine the components to get our final answer:
    ANSWER = 1 10100 1111010100 = 1101 0011 1101 0100

You can verify that this corresponds to -62.625.


If you want to waste many hours trying to figure out what is wrong with your code, feel free to write the entire addition function all at once. If you want to make better use of your time, follow this advice: develop incrementally. This works quite well for the addition function because it's made up of many small steps. You should only move on to the next step when you're confident that what you have so far works. This is not done by calling your TA and asking them if what you have is right (this would be doing a disservice to yourself). This is done by testing your code. For example, suppose that you have written some code for step 1. Before you start on step 2, print the results of step 1 and run a few examples to see if the output is what you expect. If you don't know what to expect, you shouldn't be coding yet. You need to make sure that you can understand the problem that we're trying to solve in this program.

In order to make debugging easier, Fritz Sieker has provided a function called debug(). You can use this function like a printf. The difference is that the debug() function will print stuff only if you run your program with the -debug option. This way, if you don't want to see the debugging output, you don't have to remove the calls to debug(). You can simply remove the -debug option from the command line. Here's an example of how to use the provided debug() and getBinary() functions to print the binary representation of an argument in the addition function:

    debug("x = %s", getBinary(x));

Here's a run of our solution with the operands shown in the example above and with debugging enabled:

    ./testFloat -debug add -125.75 63.125

    DEBUG iFloat.c[61] floatAdd() 1101-0111-1101-1100: bits of X (IEEE 754)
    DEBUG iFloat.c[62] floatAdd() 0101-0011-1110-0100: bits of Y (IEEE 754)
    DEBUG iFloat.c[72] floatAdd() STEP 1
    DEBUG iFloat.c[73] floatAdd() X sign: 1 exp: 21 val: 1111-1000-0010-0100
    DEBUG iFloat.c[74] floatAdd() Y sign: 0 exp: 20 val: 0000-0111-1110-0100
    DEBUG iFloat.c[88] floatAdd() Value of Y shifted by 1
    DEBUG iFloat.c[93] floatAdd() STEP 2
    DEBUG iFloat.c[94] floatAdd() 1111-1000-0010-0100: Value of X after equalizing exponents
    DEBUG iFloat.c[95] floatAdd() 0000-0011-1111-0010: Value of Y after equalizing exponents
    DEBUG iFloat.c[101] floatAdd() STEP 3
    DEBUG iFloat.c[102] floatAdd() 1111-1100-0001-0110: Value of result after addition
    DEBUG iFloat.c[112] floatAdd() STEP 4
    DEBUG iFloat.c[113] floatAdd() 0000-0011-1110-1010: Magnitude of result after converting from 2's complement to sign-magnitude
    DEBUG iFloat.c[126] floatAdd() STEP 5
    DEBUG iFloat.c[127] floatAdd() 0000-0000-0001-0100: Exponent of result after normalizing
    DEBUG iFloat.c[128] floatAdd() 0000-0111-1101-0100: Magnitude of result after normalizing
    DEBUG iFloat.c[134] floatAdd() STEP 6
    DEBUG iFloat.c[135] floatAdd() 1101-0011-1101-0100: Answer (IEEE 754)
    -125.75 + 63.125 --> -62.625000

Note that the autograder will run your program without the -debug option, so it doesn't matter if you follow the format above for debugging. However, if you choose to use printf statements instead, then you have to make sure you remove them before you submit your final solution. Otherwise, your grade will be affected. Here's the output of our solution with debugging disabled:

    ./testFloat add -125.75 63.125

    -125.75 + 63.125 --> -62.625000

Important Notes

We will not consider negative zero (-0.0) to be a valid value. This would correspond to bit 15 equal to 1 and the rest of the bits equal to 0. Therefore, we will not pass this value to your functions. We also expect your floatAbs, floatNegate, floatAdd, and floatSub functions not to return a negative zero.

The only IEEE-754 special case we expect you to handle is 0.0 (when the bit pattern is all zeroes). For example, your code should be able to add/subtract any number and 0.0. It should also return 0.0 correctly when the result of the addition/subtraction is 0.0. You should also negate 0.0 correctly (the negation of 0.0 is not -0.0!).

None of your functions should print anything when your program is run without the -debug option. You should only see the output from testFloat.c. If you print extra stuff, the autograder won't be able to match your output against the expected output.

Finally, you may not make use of the functions declared in the convert.h file.

Grading Criteria

Checking in Your Code

You will submit the single file iFloat.c. You can either submit it through the website or use the checkin program:

    ~cs270/bin/checkin P2 iFloat.c

Relax, you are done with your assignment!