Basics of Floating–Point Quantization

Bernard Widrow; István Kollár

doi:10.1017/CBO9780511754661.014

12 - Basics of Floating–Point Quantization

from Part III - Floating–Point Quantization

Published online by Cambridge University Press: 06 July 2010

Bernard Widrow and

István Kollár

Show author details

Bernard Widrow: Affiliation:
Stanford University, California
István Kollár: Affiliation:
Budapest University of Technology and Economics

Book contents

Get access

Summary

Representation of physical quantities in terms of floating–point numbers allows one to cover a very wide dynamic range with a relatively small number of digits. Given this type of representation, roundoff errors are roughly proportional to the amplitude of the represented quantity. In contrast, roundoff errors with uniform quantization are bounded between ±q/2 and are not in any way proportional to the represented quantity.

Floating–point is in most cases so advantageous over fixed–point number representation that it is rapidly becoming ubiquitous. The movement toward usage of floating–point numbers is accelerating as the speed of floating–point calculation is increasing and the cost of implementation is going down. For this reason, it is essential to have a method of analysis for floating–point quantization and floating–point arithmetic.

THE FLOATING–POINT QUANTIZER

Binary numbers have become accepted as the basis for all digital computation. We therefore describe floating–point representation in terms of binary numbers. Other number bases are completely possible, such as base 10 or base 16, but modern digital hardware is built on the binary base.

The numbers in the table of Fig. 12.1 are chosen to provide a simple example. We begin by counting with nonnegative binary floating–point numbers as illustrated in Fig. 12.1. The counting starts with the number 0, represented here by 00000. Each number is multiplied by 2E, where E is an exponent. Initially, let E = 0. Continuing the count, the next number is 1, represented by 00001, and so forth.

Type: Chapter
Information: Quantization Noise
Roundoff Error in Digital Computation, Signal Processing, Control, and Communications
, pp. 257 - 306

DOI: https://doi.org/10.1017/CBO9780511754661.014 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book contents

12 - Basics of Floating–Point Quantization

Summary

Access options

Book purchase

Temporarily unavailable

Book contents

12 - Basics of Floating–Point Quantization

Summary

Access options

Book purchase

Temporarily unavailable

Save book to Kindle

Save book to Dropbox

Save book to Google Drive