Introduction

With the advent of digital computation, one of the first applications of programs was to solve problems within the scientific computing field. Scientific computing focuses on operations with real numbers. Unfortunately, perfectly representing any arbitrary real number within a machine architecture is impossible. As a result, numerous real number representation schemes have been proposed and implemented to ensure such representations are as accurate as possible.

Floating-point numbers are a data type that represents real numbers in machine architecture. Understanding the detailed mechanics of floating-point numbers is not necessary for unit testing. However, it is vital to understand the potential pitfalls when dealing with floating-point numbers, so that when one of our floating-point assertions doesn’t pass, we will know how to correct it.

The non-precise nature of floating-point types

The floating-point numeric types all represent real numbers. Real numbers are those that assume any value along a continuous number line. Real numbers can equal a value from negative to positive infinity and may equal an infinitely small or large whole part as well as an infinitely small or large fraction. Examples include the following:

11

5.55.5

4242

6.7876764100592837456.787676410059283745

7.257.25

2\sqrt{2}

π=3.14159...\pi =3.14159...

e=2.71828...e=2.71828...

The last two numbers are Pi and Euler’s numbers, respectively. Floating-point numeric types represent real numbers. If real numbers can assume any value and floating-point numbers are machine representations of real numbers, it is impossible to guarantee absolutely precise storage of any arbitrary real number in memory. This is because real numbers, by definition, can have infinitely long whole and fractional parts. However, computer memory runs on infrastructure (physical RAM memory), and this infrastructure has limited free space. Furthermore, even a very short number of digits like 0.10.1 cannot be stored accurately as a floating-point number.

Several different representations of real numbers have been proposed. However, the most widely used is the floating-point representation. Floating-point representations have a base β\beta(which is always assumed to be even) and a precision pp. If β=10\beta = 10 and p=3p = 3, the number 0.10.1 is represented as 1.00×1011.00 × 10^{-1}. If β=2\beta = 2 and p=24p = 24, the decimal number 0.10.1 cannot be represented exactly, but is approximately 1.10011001100110011001101×241.10011001100110011001101 \times 2^{-4}.

– “What every computer scientist should know about floating-point arithmetic”, ACM Computing Surveys (CSUR), 23(1), 5-48, Goldberg, D. (1991)

The reason why this is a problem is that floating-point numeric types are stored as floating binary point numbers. For further reading, you may read this excellent Educative interactive lesson on the topic.

Using real numbers

The problem of representing real numbers as floating-point types has implications for code used in scientific computing. Scientific computing deals with real numbers all the time. It’s important to note that just because an application code deals with decimal points does not imply it needs floating-point numbers. One may easily model a bank balance using the decimal type. A bank balance has a finite whole part and two digits representing the fractional part.

When floating-point numeric types create problems

Two problems may arise when working with floating-point numbers. These are outlined below.

Operations on floating-point numeric types

Mathematical operations on floating-point numeric types may lead to more significant errors. These are called unstable rounding errors. This means that rounding errors are magnified with each calculation step or iteration of an algorithm.

Change in the execution flow

If the result of floating-point calculations is input into conditional branches, a program’s execution path may be unintentionally altered. An example of this is shown in the code below. We sum ten increments of 0.1. The total should equal 1, which triggers the first conditional statement.

Get hands-on with 1300+ tech skills courses.