
Book contents
- Frontmatter
- Contents
- Preface
- Acknowledgments
- Glossary of Symbols
- Acronyms and Abbreviations
- Part I Background
- Part II Uniform Quantization
- Part III Floating–Point Quantization
- 12 Basics of Floating–Point Quantization
- 13 More on Floating–Point Quantization
- 14 Cascades of Fixed–Point and Floating–Point Quantizers
- Part IV Quantization in Signal Processing, Feedback Control, and Computations
- Part V Applications of Quantization Noise Theory
- Part VI Quantization of System Parameters
- APPENDICES
- Bibliography
- Index
12 - Basics of Floating–Point Quantization
from Part III - Floating–Point Quantization
Published online by Cambridge University Press: 06 July 2010
- Frontmatter
- Contents
- Preface
- Acknowledgments
- Glossary of Symbols
- Acronyms and Abbreviations
- Part I Background
- Part II Uniform Quantization
- Part III Floating–Point Quantization
- 12 Basics of Floating–Point Quantization
- 13 More on Floating–Point Quantization
- 14 Cascades of Fixed–Point and Floating–Point Quantizers
- Part IV Quantization in Signal Processing, Feedback Control, and Computations
- Part V Applications of Quantization Noise Theory
- Part VI Quantization of System Parameters
- APPENDICES
- Bibliography
- Index
Summary
Representation of physical quantities in terms of floating–point numbers allows one to cover a very wide dynamic range with a relatively small number of digits. Given this type of representation, roundoff errors are roughly proportional to the amplitude of the represented quantity. In contrast, roundoff errors with uniform quantization are bounded between ±q/2 and are not in any way proportional to the represented quantity.
Floating–point is in most cases so advantageous over fixed–point number representation that it is rapidly becoming ubiquitous. The movement toward usage of floating–point numbers is accelerating as the speed of floating–point calculation is increasing and the cost of implementation is going down. For this reason, it is essential to have a method of analysis for floating–point quantization and floating–point arithmetic.
THE FLOATING–POINT QUANTIZER
Binary numbers have become accepted as the basis for all digital computation. We therefore describe floating–point representation in terms of binary numbers. Other number bases are completely possible, such as base 10 or base 16, but modern digital hardware is built on the binary base.
The numbers in the table of Fig. 12.1 are chosen to provide a simple example. We begin by counting with nonnegative binary floating–point numbers as illustrated in Fig. 12.1. The counting starts with the number 0, represented here by 00000. Each number is multiplied by 2E, where E is an exponent. Initially, let E = 0. Continuing the count, the next number is 1, represented by 00001, and so forth.
- Type
- Chapter
- Information
- Quantization NoiseRoundoff Error in Digital Computation, Signal Processing, Control, and Communications, pp. 257 - 306Publisher: Cambridge University PressPrint publication year: 2008
- 1
- Cited by