What are floating-point binary numbers?

Computers process floating-point numbers, in the form of zeros and ones, by storing them in floating-point representations. Floating-point numbers are represented in the computer systems using the IEEE 754 standard.

According to the IEEE 754 standard, a floating-point number has three components:

  • Base: This is also known as radix. Its value is 2 for binary and 10 for decimal. Here, we'll restrict ourselves to binary.

  • Precision: This is the number of significant digits. These digits are bits in the case of binary.

  • Exponent range: This is specified by the number of exponent bits.

Half-precision (16 bits)

In half-precision, we have 16 bits, and they are divided in the following way:

  • Sign bit (1): A sign bit represents whether a number is positive (0) or negative (1)

  • Exponent bits (5): Exponent bits represent the power of 2 that will be multiplied with the mantissa.

  • Mantissa bits (10): Mantissa bits are the fractional part of the binary number expressed in scientific notation. Since only 11 will always be the leading number when representing a binary number in scientific notation, we do not need to add it. That's why we only concern ourselves with the fractional part.

Note: Some authors use the term fraction bits instead of mantissa bits.

We reserve a bit in exponent bits since exponents can be positive or negative. We can calculate the exponent range with the following formula:

It means that for half-precision, emaxe_{max} will be 241=152^4 - 1 = 15 and emine_{min} will be 14-14.

Half-precision

Example


Let's take the number 4.254.25 and convert it into half-precision.

  • Step 1: In the first step, we convert the integer part (44) and the decimal part (0.250.25) separately and write them together. In our case, it would be the following:

  • Step 2: Now, we need to represent this number in scientific notation, which is 1.00011.0001. Here, we have moved the decimal point two steps leftwards. For each step toward the left, we multiply by 22 to get the exponent. In our case, the exponent is 22=42^2 = 4.

For steps towards the right, we divide by 2.

  • Step 3: This exponent needs to be represented in exponent bits. We need to add the exponent in the bias. The bias can be calculated by 2exp bits12^{exp \space bits - 1}. Converting this number to binary will give us the required exponent bits. In the case of half-precision, it is 1616.

  • Step 4: Now, it's time to put it all together. Since the number is positive, the sign bit will be 0. To fill up the space, we append zeros to the mantissa.

Half-precision

Single precision (32 bits)

Half precision is quite an outdated version and rarely used. However, the concept of conversion is still the same. These are the only changes when we talk about single precision:

  • Sign bit: 1

  • Exponent bits: 8

  • Mantissa bits: 23

The exponent range for single precision is emin=127e_{min} = 127 and emax=128e_{max} = -128.

Implementation

In Python, we can use the struct library for the float to binary conversion.

import struct
def float_to_binary(num):
res = ''
for i in struct.pack('!f', num):
res += f'{i:0>8b}'
return res
print(float_to_binary(4.25))

Explanation

  • Line 5: Here, we use the struct.pack() function. The purpose of this function is to convert a given list of values into their corresponding string representation by using a format string. This is the first argument in the function. Here, instead of a list of values, we only convert one number.

  • Line 6: The result of the function needs to be properly formatted. Here, we use f-strings that allow us to format strings in a compact manner. The format for this string is 0>8b , which loosely translates to "convert the number into binary and represent it in eight characters. If needed, append zeros in the front."

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved