Computers process floating-point numbers, in the form of zeros and ones, by storing them in floating-point representations. Floating-point numbers are represented in the computer systems using the IEEE 754 standard.
According to the IEEE 754 standard, a floating-point number has three components:
Base: This is also known as radix. Its value is 2 for binary and 10 for decimal. Here, we'll restrict ourselves to binary.
Precision: This is the number of significant digits. These digits are bits in the case of binary.
Exponent range: This is specified by the number of exponent bits.
In half-precision, we have 16 bits, and they are divided in the following way:
Sign bit (1): A sign bit represents whether a number is positive (0) or negative (1)
Exponent bits (5): Exponent bits represent the power of 2 that will be multiplied with the mantissa.
Mantissa bits (10): Mantissa bits are the fractional part of the binary number expressed in scientific notation. Since only
Note: Some authors use the term fraction bits instead of mantissa bits.
We reserve a bit in exponent bits since exponents can be positive or negative. We can calculate the exponent range with the following formula:
It means that for half-precision,
Let's take the number
Step 1: In the first step, we convert the integer part (
Step 2: Now, we need to represent this number in scientific notation, which is
For steps towards the right, we divide by 2.
Step 3: This exponent needs to be represented in exponent bits. We need to add the exponent in the bias. The bias can be calculated by
Step 4: Now, it's time to put it all together. Since the number is positive, the sign bit will be 0. To fill up the space, we append zeros to the mantissa.
Half precision is quite an outdated version and rarely used. However, the concept of conversion is still the same. These are the only changes when we talk about single precision:
Sign bit: 1
Exponent bits: 8
Mantissa bits: 23
The exponent range for single precision is
In Python, we can use the struct
library for the float to binary conversion.
import structdef float_to_binary(num):res = ''for i in struct.pack('!f', num):res += f'{i:0>8b}'return resprint(float_to_binary(4.25))
Line 5: Here, we use the struct.pack()
function. The purpose of this function is to convert a given list of values into their corresponding string representation by using a format string. This is the first argument in the function. Here, instead of a list of values, we only convert one number.
Line 6: The result of the function needs to be properly formatted. Here, we use f-strings that allow us to format strings in a compact manner. The format for this string is 0>8b
, which loosely translates to "convert the number into binary and represent it in eight characters. If needed, append zeros in the front."
Free Resources