Last time: Logical ops (Sec. 2.5), Sort Ex. (Sec 2.13)
Today: Floating point Sec. 3.5 pp. 242-250, 259-266
Next time: Section 2.4, 2.5 (instruction formats, not on syllabus by mistake)
Floating Point Numbers in C and Java:
We know floating point numbers from C/Java. From the app standpoint:
float x; 32-bit floating point, only 6-7-decimal-digit significance, avoid if possible
double x; 64-bit floating point, has 15-decimal-digit significance, usually plenty
int x; (32-bits) has 9-decimal digit capacity. Max positive = 2G-1, approx 2 billion. Max negative = -2G. Can be a problem.
So double can hold whole numbers beyond what int can, sometimes a useful fact.
Java long x; (C long long x): (64 bits) has 18-decimal digit capacity.
Format of
Floating Point:
· 1 sign bit
· exponent bits: 8 for single precision, 11 for double precision (IEEE 754)
· fraction bits: 23 for single, 52 for double
· Total bits: 32 vs 64
Number of bits in fraction determines number of significant decimal digits:
2^23 different float numbers, = 2^20*2^3 = 8M = 8*10^6 approx., nearly 10^7, so we say 6-7 digits for float
2^52 = (2^10)^5*2^2 = (10^3)^5*4 = 4*10^15, so 15 digits for double
Idea of Normalization, in decimal: 23.456 = 2.3456 x 10^1, 44 x 10^6 = 4.4 x 10^7, .0012 = 1.2x10(-3). This is called the normalized scientific notation.
But for floating point representations, we need to do this in binary.
Ex: 23 = 0x17 = 00010111, so normalized 1.0111 x 2^4
4M = 1.000 x 2^22
4.25M = (4 + ¼) x2^20 = 1+ 1/16 x 2^22 = 1.0001 x 2^22
Note all normalized binary reps start with 1, so the 1 is suppressed (0 is a special case).
For 23, the actual fraction stored in the rep is 0111000..., that is, the leading 1 is dropped.
The exponent is tricky too, “biased” by a certain number (127 for single prec., 1023 for double), so that floating point numbers sort correctly by binary comparison
Ex: 23.0 in single
precision
sign = 0
exponent = 127 + 4 = 131 (binary, in 8 bits) <--127 is the “bias” added to the 4, the real exponent
fraction = 01110000... total of 23 bits
Ex: 23.0 in double
precision
sign = 0
exponent = 1023 + 4 = 1027 (binary, in 11 bits)
fraction = 01110000... total of 52 bits
Negative number example (also negative exponential) from book -0.75 = -(1/2 + ¼)= - binary .11 = 1.1x2(-1)
sign = 1
exponent = 127 -1 = 126 = 01111110 for single prec, 1023-1 = 1022 , 2 down from 1024
fraction = 1000... either way
Special cases: 0 = 0...0, infinity, -infinity, NaN.
Floating point instructions, pp 259-266
add.s, add.d
sub.s, sub.d
mul.s, mul.d
div.s, div.d
comparison, branch
FP Registers: $f0, $f`, ... $f31
Load: lwcl $f4, 8($a0) (load word “ coprocessor”, 32 bits)
Store: swcl $f4, 0($s1)
Double prec: need 2 loads, 2 stores, assembler provides pseudo-instruction l.d, s.d:
l.d $f4, 8($a0) # loads $f4 and $f5 from 8 bytes starting at 8($a0)
s.d $f4, 0($s1)
even numbered ones can be used for double precision, and use 2 registers:
add.d $f2, $f4, $f6 : adds $f4 to $f4 and puts in $f2, using also $f3, $f5, and $f7
Example of
Code using Floating Point variables
Floating point arguments, return values
Look at handout: argument in $f12, result in $f0, as suggested in book