Floating point arithmetic derives its name from something that happens when you use exponential notation.
Consider the number 123: it can be written using exponential notation as:
Notice how the decimal point "floats" within the number as the exponent is changed.
This phenomenon gives floating point numbers their name. Only two of the representations of the number 123
above are in any kind of standard form. The first representation, 1.23 * 10 2, is in a form called
"scientific notation", and is distinguished by the normalization of the significand:
etc. All of these representations of the number 123 are numerically equivalent.
They differ only in their "normalization": where the decimal point appears in the first number.
In each case, the number before the multiplication operator ("*") represents the significant figures in the
number (which distinguish it
from other numbers with the same normalization and exponent); we will call this number the "significand"
(also called the "mantissa" in other texts, which call the exponent the "characteristic").
in scientific notation, the significand is always a number
greater than or equal to 1 and less than 10.
Standard computer normalization for floating point numbers follows the fourth form in the list above:
the significand is greater than or equal to .1, and is always less than 1.
Of course, in a
binary
computer, all numbers are stored in base 2 instead of base 10; for this reason, the normalization of a binary
floating point number simply requires that there be no leading zeroes after the binary point (just as the decimal
point separates the 10 0 place from the 10 -1 place, the binary point separates
the 2 0 place from the 2 -1 place). We will continue to use the decimal number system
for our numerical examples, but the impact of the computer's use of the binary number system will be felt as we
discuss the way those numbers are stored in the computer.
For this reason, we will discuss both the IEEE standards as well as the floating point formats implemented in
the very common Intel chips (such as the 80387, 80486 and the Pentium series). Each of these formats has a name
like "single precision" or "double precision", and specifies the numbers of bits
which are used to store both the
exponent and the significand. We will defined the notion of "precision" in the following way: if the
significand is stored in n bits, it can represent a decimal number between 0 and 2 n - 1 (since a
significand is stored as an unsigned integer). If we find the largest number "m"
such that 10 m - 1
is less than or equal to 2 n - 1, m will be the precision. Consider the following:
The following table describes the IEEE standard formats as well as those used in common Intel processors:
Exponents are commonly stored in these formats as unsigned integers; however, an exponent can be negative as
well as positive, and so we must have some technique for representing negative exponents using unsigned integers.
This technique is called "biasing": a positive number is added to the exponent before it is stored in to the
floating point number. The stored exponent is then called a "biased exponent". If the exponent contains
8 bits, the bias number 127 is added to the exponent before it is stored so that, for example, an exponent of
1 is stored as 128. Since the unsigned exponent can represent numbers between 0 and 255, it should be theoretically
possible to store exponents whose values range from -127 to +128 (-127 would stored as the biased exponent value
0, and +128 would be stored as the biased value 255). In practice, the IEEE specification reserves
the values 0 and 255, which means that an 8 bit exponent can represent exponent values between -126 and +127.
If the stored (biased) exponent has the value 0, and the significand is 0 as well, the value of the floating point
number is exactly 0. A floating point number with a stored exponent of 0 and a nonzero significand is of
course unnormalized. If the stored exponent has the value 255 (all ones), the floating point number has one of
two special meanings:
For our first example, consider the sum
Representation errors can also occur during multiplication. Consider the product 125 * 21. This product is
represented in normalized form as
It is therefore useful to know the range of exponents which your computer can represent. As an example,
the Intel double precision format supports exponents in the range -1,022 to +1,023. Since these are
exponents of 2, the range of numbers which can be represented as floating point doubles in an Intel
CPU are 2 -1,022 to 2 1,023, which is
(approximately) decimal 2.225 * 10 -308 to
8.988 * 10 307. Any results with exponents outside those ranges will result in an overflow
error.
It is worth noting that floating point operations are much slower than their corresponding fixed point
counterparts. For example, on a 1.4 GHz Pentium 4 CPU, two 32 bit fixed point numbers can be added in
about 71 billionths of a second (71 nanoseconds). A fixed point multiply of
two 32 bit numbers may take 4 or 5 times longer. By comparison, floating point
operations may take tens or even hundreds of times longer to perform.
The next section is devoted to the representation of characters in the binary computer.
©2002, Kenneth R. Koehler. All Rights Reserved. This document may be freely
reproduced provided that this
copyright notice is included.
Please send comments or suggestions to
the author.
Floating Point Formats
Over the years, floating point formats in computers have not exactly been standardized. While the IEEE
(Institute of Electrical and Electronics Engineers) has developed standards in this area, they have not been
universally adopted. This is due in large part to the issue of "backwards compatibility": when a hardware
manufacturer designs a new computer chip, they usually design it so that programs which ran on their old
chips will continue to run in the same way on the new one. Since there was no standardization in floating point
formats when
the first floating point processing chips (often called "coprocessors" or "FPU"s: "Floating Point Units") were
designed, there was no rush among computer designers to conform to the IEEE floating point standards
(although the situation has improved with time).
From the last example, it is easy to see that a 20 bit significand provides just over 6 decimal digits of
precision. In the other examples, there is more precision than we have indicated. For example, a 16 bit
significand is certainly sufficient to represent many decimal numbers with more than 4 digits; however, not all
5 digit decimal numbers can be represented in 16 bits, and so the precision of a 16 bit significand is said to
be "> 4" (but less than 5). Some texts attempt to more accurately describe the precision using fractions, but
we do not feel the need to do so.
2 4 - 1 = 15 10 1 - 1 = 9 2 8 - 1 = 255 10 2 - 1 = 99 2 12 - 1 = 4,095 10 3 - 1 = 999 2 16 - 1 = 65,535 10 4 - 1 = 9,999 2 20 - 1 = 1,048,575 10 6 - 1 = 999,999
Note first that all of the formats reserve one bit to store the sign of the number; this is necessary
because the significand is stored as an unsigned fraction in all of these formats (often the first bit of the
significand is not even stored, because it is always 1 in a properly normalized floating point number). The
rows describing the IEEE extended formats specify the minimum number of bits which the exponent and significand must
have in order to satisfy the standard. The Intel "internal" format is an extended precision format used inside
the CPU chip, which allows consecutive floating point operations to be performed with greater precision than
that which will eventually be stored.
Precision Sign Exponent Significand
Total Length Decimal digits (# of bits) (# of bits) (# of bits)
(in bits) of precision IEEE / Intel single 1 8 23 32 > 6 IEEE single extended 1 >= 11 >= 32 >= 44 > 9 IEEE / Intel double 1 11 52 64 > 15 IEEE double extended 1 >= 15 >= 64 >= 80 > 19 Intel internal 1 15 64 80 > 19
In general, if n bits are used to store the exponent, the bias value is 2 n - 1 - 1, the range of
exponents which can be represented are from
-2 n - 1 + 2 to +2 n - 1 - 1, and
a biased exponent of 2 n - 1 indicates either infinity or a NaN (as above). The
Intel double precision floating point format, which has an 11 bit exponent field, uses a bias of
2 10 - 1 = 1,023,
and can represent exponents between
-2 10 + 2 = -1,022 to +2 10 - 1 = +1,023.
Note that while all of our examples use the decimal number system (for your convenience), the computer uses
binary as the base for the exponents as well (although in the past, some computers used 16 as the base for the
exponents). So, for example, the number 1 has a normalized binary floating point value of
.1 2 * 2 1 (with a 1 in the 2 -1 place, this is equivalent to 1/2 * 2);
the number 3 has the normalized binary floating point value of
.11 2 * 2 2 (with a 1 in the 2 -1 place and a 1 in the 2 -2
place, this is equivalent to (1/2 + 1/4) * 4), etc.
In contrast, of course, you would represent these numbers in decimal as
.1 * 10 1
and
.3 * 10 1, respectively.
Examples
In order to illustrate some of the details of floating point arithmetic, we will consider an imaginary floating
point format in which the exponent is stored in 5 bits, the significand is stored in 10 bits, and 1 bit is used to
store the sign of the number. Using exponent biasing and reserving the values 0 and 31 (2 5 - 1), our
bias value will be 15 and our exponent will therefore be able to represent the values -14 to 15. Since the
significand is stored in 10 bits, and 2 10 - 1 = 1,023, we see that our imaginary format provides us
with three decimal digits of precision (since all of the numbers from 0 to 999 fit in 10 bits, but not all
those from 0 to 9,999 fit). We will do all of our examples using decimal, but always keep in mind that
the computer always uses binary!
122 + 12.
We first normalize these numbers as
.122 * 10 3 and .12 * 10 2.
But already there is a complication: we can't simply add two decimal numbers which are multiplied
by different exponents! That is, the answers
.242 * 10 3 or .242 * 10 2
are obviously incorrect! To solve this problem, the number with the smaller exponent must be denormalized
before the addition can take place:
.12 * 10 2 becomes .012 * 10 3.
Now it is clear that we can simply add the decimal numbers, since a * 10 x + b * 10 x =
(a + b) * 10 x, and we get the answer
.134 * 10 3
which is of course 134. If the relative sizes of the two numbers are too different, we may have one of two
errors. As an example of the first type of error, consider the sum 1220 + 14. In our hypothetical computer,
these numbers are normalized as
.122 * 10 4 and .14 * 10 2.
But when we denormalize the smaller number before adding, the fact that we have only 3 decimal digits of
precision causes a truncation error:
.14 * 10 2 becomes .001 * 10 4;
that is, the second significant digit was lost: it was denormalized out of existence. In fact, if we consider the
sum 1220 + 1.4, we see that the second operand (1.4) is denormalized to zero:
.14 * 10 1 becomes .000 * 10 4!
This is called an "underflow" error; it and the truncation errors are called
"representation errors".
You are already familiar with representation errors: some numbers
have no finite representation in the decimal number system, such as 1/3 (which cannot be written down
as a finite string of numbers to the right of the decimal point).
For the same reasons, many numbers have no finite representation
in binary. This includes all so-called "non-terminating" numbers in decimal, as well as any fraction with a
power of 5 in the denominator (which may have finite representations in decimal; ie., 1/5 = .2).
And of course there is representation error any time you need more precision than the computer provides:
1273 is represented on our hypothetical computer as
.127 * 10 4,
because, again, we only have 3 decimal digits of precision!
.125 * 10 3 * .21 * 10 2.
Now the convenience of exponential notation for multiplication has not been lost on the computer
architects (this is why they choose exponential representations in the first place!). Since
x * 10 a * y * 10 b = (x * y) * 10 a + b,
the product is computed by us as
(.125 * .21) * 10 5 = .02625 * 10 5
but because of the finite precision inherent in the computer (which here has only 3 digits of precision),
the result is truncated (before normalization!) to
.026 * 10 5
and is then normalized to
.26 * 10 4.
In general, in order to perform any floating point arithmetic operation, the computer must:
The step which actually performs the operation can result in another kind of error:
overflows
can occur in floating point arithmetic as well as in fixed, but they are
detected in the exponent rather than the significand. If we attempt to multiply 2 * 10 7 times
7 * 10 7, the normalized product is
.2 * 10 8 * .7 * 10 8 = .14 * 10 16;
but our imaginary computer can only represent numbers with exponents up to 15!
Go to: Title Page Table of Contents Index