Introduction
In this chapter we consider how numbers are represented on a computer largely with respect to the errors that occur when basic arithmetical operations are performed on them. We are most interested here in so-called rounding errors (also called roundoff errors). Floating-point computation is emphasized. This is due to the fact that most numerical computation is performed with floating-point numbers, especially when numerical methods are implemented in high-level programming languages such as C, Pascal, FORTRAN, and C++. However, an understanding of floating-point requires some understanding of fixed-point schemes first, and so this case will be considered initially. In addition, fixed-point schemes are used to represent integer data (i.e., subsets of Z), and so the fixed-point representation is important in its own right. For example, the exponent in a floating-point number is an integer.
Fixed-Point Representations
We now consider fixed-point fractions. We must do so because the mantissa in a floating-point number is a fixed-point fraction.
We assume that fractions are t + 1 digits long. If the number is in binary, then we usually say “t + 1 bits” long instead. Suppose, then, that x is a (t + 1)-bit fraction. We shall write it in the form