CS641 Class 6

Last time: Logical ops (Sec. 2.5), Sort Ex. (Sec 2.13)

Today: Floating point Sec. 3.5 pp. 242-250, 259-266

Next time: Section 2.4, 2.5 (instruction formats, not on syllabus by mistake)

 

Floating Point Numbers in C and Java:

We know floating point numbers from C/Java.  From the app standpoint:

float x;  32-bit floating point, only 6-7-decimal-digit significance, avoid if possible

double x;  64-bit floating point, has 15-decimal-digit significance, usually plenty

int x;  (32-bits) has 9-decimal digit capacity. Max positive = 2G-1, approx 2 billion.  Max negative = -2G.  Can be a problem.

So double can hold whole numbers beyond what int can, sometimes a useful fact.

Java long x;  (C long long  x):  (64 bits) has 18-decimal digit capacity.

 

Format of Floating Point:

·         1 sign bit

·         exponent bits: 8 for single precision, 11 for double precision (IEEE 754)

·         fraction bits: 23 for single, 52 for double

·         Total bits: 32 vs 64

Number of bits in fraction determines number of significant decimal digits:

2^23 different float numbers, = 2^20*2^3 = 8M = 8*10^6 approx., nearly 10^7, so we say 6-7 digits for float

2^52 = (2^10)^5*2^2 = (10^3)^5*4 = 4*10^15, so 15 digits for double

Idea of Normalization, in decimal: 23.456 = 2.3456 x 10^1, 44 x 10^6 = 4.4 x 10^7, .0012 = 1.2x10(-3). This is called the normalized scientific notation.

But for floating point representations, we need to do this in binary.  

Ex: 23 = 0x17 = 00010111, so normalized 1.0111 x 2^4

4M = 1.000 x 2^22

4.25M = (4 + ¼) x2^20 = 1+ 1/16 x 2^22 = 1.0001 x 2^22

Note all normalized binary reps start with 1, so the 1 is suppressed (0 is a special case).

For 23, the actual fraction stored in the rep is 0111000..., that is, the leading 1 is dropped.

The exponent is tricky too, “biased” by a certain number (127 for single prec., 1023 for double), so that floating point numbers sort correctly by binary comparison

Ex: 23.0 in single precision

sign = 0

exponent = 127 + 4 = 131 (binary, in 8 bits)   <--127 is the “bias” added to the 4, the real exponent

fraction = 01110000... total of 23 bits

Ex: 23.0 in double precision

sign = 0

exponent = 1023 + 4 = 1027 (binary, in 11 bits)

fraction = 01110000... total of 52 bits

Negative number example (also negative exponential) from book -0.75  = -(1/2 + ¼)= - binary .11 = 1.1x2(-1)

sign = 1

exponent = 127 -1  = 126 = 01111110 for single prec, 1023-1 = 1022 , 2 down from 1024

fraction = 1000... either way

Special cases: 0 = 0...0, infinity, -infinity, NaN.

Floating point instructions, pp 259-266

add.s, add.d

sub.s, sub.d

mul.s, mul.d

div.s, div.d

comparison, branch

FP Registers:  $f0, $f`, ... $f31

Load:   lwcl $f4, 8($a0)  (load word “ coprocessor”, 32 bits) 

Store: swcl $f4, 0($s1)

Double prec: need 2 loads, 2 stores, assembler provides pseudo-instruction l.d, s.d:

l.d $f4, 8($a0)  # loads $f4 and $f5 from 8 bytes starting at 8($a0)

s.d $f4, 0($s1)

even numbered ones can be used for double precision, and use 2 registers:

add.d $f2, $f4, $f6 : adds $f4 to $f4 and puts in $f2, using also $f3, $f5, and $f7

Example of Code using Floating Point variables

Floating point arguments, return values

Look at handout:  argument in $f12, result in $f0, as suggested in book