CS270 Colorado State University ====================== Floating Point Numbers ====================== N = d_n d_{n-1} d_{n-2} ... d_0 . d_{-1} d_{-2} ... d_{-m} N = Sum_{i=-m}^{n} d_i * r^i if (r==10) then powers of 10 are: whole numbers: 1, 10, 100, ... fractions: 1/10, 1/100, 1/1000, ... if (r==2) then pwers of 2 are: whole numbers: 1, 2, 4, ... fractions: 1/2, 1/4, 1/8, .. Conversion from decimal fixed-point to binary fixed-point 13.5625_{10} = ____________________ _{2} ------------------- IEEE Floating Point C/C++ type: float (32 bits) Sign Bit: 1 Exponent bits: 8 Mantissa bits: 23 Precision: ~7 Range: ~10^{+/- 38} Normalization Representation of a floating-point number such that the most significant (non-zero) digit is immediately to the right of the radix point. - used to ensure maximum precision of stored values Numbers are normalized by shifting the radix point left or right, one digit at at time, until the most significant digit is the right of the radix point, reducing, or increasing the exponent by the number of shifts. 1101.1001 normalized = 0.000101101001 normalized = Hidden Bit Operation Normalizing a binary number will always reesult in a 1 digit immediately to the right of the binary point. - take advantage of this fact by getting rid of this 1 bit (we'll put it back when we reserve the conversion) - gives us an extra bit of precision 1101.1001 normalized + hidden bit operation = Exponent Representation Single precision Excess-127/Bias-127 notation Real Decimal Binary Hex Exponent Value Value Value ----------------------------------------- special 255 1111 1111 FF +/- infinity and NaN +127 254 1111 1110 FE +126 253 1111 1101 FD +125 252 1111 1100 FC ... ... ... ... +3 130 1000 0010 82 +2 129 1000 0001 81 +1 128 1000 0000 80 0 127 0111 1111 7F -1 126 0111 1110 7E -2 125 0111 1101 7D -3 124 0111 1100 7C ... ... ... ... -124 3 0000 0011 03 -125 2 0000 0010 02 -126 1 0000 0001 01 special 0 0000 0000 00 zero and sub-normal #s Now we are ready to put all the pieces together: Sign bit = sign of original FP number Mantissa bits = mantissa/fraction bits after normalization and hidden bit operation Exponent bits = binary exponent after normalization and hidden bit operation + 127 single-precision IEEE FP representation of 13.5625 (decimal) = 1) convert to binary fixed-point 2) normalize and hidden bit operation 3) store mantissa/fraction bits and bias-127 exponent bits (real binary exponent + 127) We will represent the resulting internal bit pattern for floating point numbers in hex Now let's start with an internal bit pattern representing a single-precision IEEE FP number 0xC1590000 = FP value = +/- 1.mantissa * 2(stored exp - 127) ^ | | hidden bit More examples of decimal FP to single-precision IEEE FP representation 2.5 -2 0.625 Some examples of IEEE FP representation to decimal FP 0x80000000 0xC5000000 Special Cases if ( stored exp == 0 && mantissa == 0 ) value = 0 if ( stored exp == 0 && mantissa != 0 ) value = 0.mantissa * 2(stored exp -- 126) if ( stored exp == 255 && mantissa == 0 ) value = + or - infinity if ( stored exp == 255 && mantissa != 0 ) value = NaN Smallest Positive Value 0x00000001 = ~1.4E-45 Largest Positive Value 0x7F7FFFFF = ~3.4E38 ------------------------------------------------------------------------ Questions you should be able to answer at the end of this unit - Convert numbers between decimal floating point and IEEE floating point. - What are the smallest and largest exponents in IEEE 32-bit floating point? ------------------------ Copied and revised with permission from Rick Ord's CS 30 notes. mstrout@cs.colostate.edu, 9/3/08