Floating-Point Representation in Java

4 min readMay 4, 2021

Overview

Floating-point representation is used to represent non-integer fractional numbers in computer memory. The most commonly used floating-point representation is the IEEE-754 floating-point representation. IEEE-754 standard has 3 basic components.

Sign bit: A single bit is allocated to represent the sign of the floating-point number.
Exponent: A field of bits is allocated to represent both positive and negative exponents.
Mantissa: A collection of bits are allocated to represent the decimal part of the floating-point number.

Floating-Point Representation in Java

In Java there are two primitive data types for floating-point numbers:

float: Float data type is represented by 32-bits (4-bytes) in memory (by one word in JVM). The IEEE-754 standard representation of float data type is known as single-precision representation. Figure 1. shows the IEEE-754 standard representation of float data type in Java.

Figure 1. Standard-Single Precision Representation.

double: This is represented by 64-bits (8-bytes) in memory (by two words in JVM). The IEEE-754 standard representation of this is known as double-precision representation. Figure 2. shows the IEEE-754 standard representation of double data type in Java.

Figure 2. Standard-Double Precision Representation.

The following explains the IEEE-754 standard representation using an example.

20: 10100

0.1: 0.00011001100110011001100110011001100110011…………

20.1: 10100.00011001100110011001100110011001100110011…………

20.1: 1.01000001100110011001100110011001100110011………… x 24

Example: Single-precision representation of 20.1.

Biased component: 127+4=131

131: 10000011

The IEEE 754 standard single-precision representation:

0 10000011 01000001100110011001100

Example: Double-precision representation of 20.1.

Biased component: 1023+4=1027

1027: 10000000011

The IEEE 754 standard double-precision representation:

0 10000000011 0100000110011001100110011001100110011001100110011001

Note: For sign bit, 0 is assigned for positive floating-point numbers and 1 is assigned for negative floating-point numbers. For exponent calculation, a bias is added to exponent as 127 for single precision and 1023 for double precision.

But when we consider Java, the standard representations of single and double precision in the memory are not the same as the exact ones we obtained above. Refer to figure 3.

Figure 3. Exact and Rounding Representations of Single and Double Precision.

This is occurred due to the rounding error in java.

As per this, if the bit located after the final bit of mantissa is 1, 1 bit is added to the final bit of mantissa otherwise it remains the same. Refer to figure 4.

Figure 4. Exact and Rounding Representations of Single and Double Precision in Detail.

As a solution for we can avoid this rounding error by using BigDecimal. Figure 5. shows the output difference obtained by using double data type and BigDecimal.

Figure 5. Output Difference of double and BigDecimal.

BigDecimal Class

BigDecimal is a Java class that extends java.lang.Number, declared in java.math package. This has defined methods to handle double and float data type variables with great precision. Here define some of the methods declared inside the BigDecimal class.

arg1.doubleValue()

Converts arg1 BigDecimal to a double.

arg1.intValue()

Converts arg1 BigDecimal to an int.

arg1.longValue()

Converts arg1 BigDecimal to a long.

arg1.equals(Object x)

Compares arg1 BigDecimal with the specified Object for equality.

arg1.add(BigDecimal arg2)

Returns a BigDecimal whose value is (arg1 + arg2).

arg1.byteValueExact()

Converts arg1 BigDecimal to a byte.

arg1.max(BigDecimal arg2)

Returns the maximum of arg1 BigDecimal and arg2.

arg1.min(BigDecimal arg2)

Returns the minimum of arg1 BigDecimal and arg2.

arg1.pow(int n)

Returns a BigDecimal whose value is arg1’s n th power.

arg1.remainder(BigDecimal divisor)

Returns a BigDecimal whose value is (arg1% divisor).

arg1.divide(BigDecimal divisor)

Returns a BigDecimal whose value is (arg1/ divisor).

arg1.multiply(BigDecimal multiplicand)

Returns a BigDecimal whose value is (arg1× multiplicand).

arg1.subtract(BigDecimal arg2)

Returns a BigDecimal whose value is (arg1- arg2).

arg1.toString()

Returns the string representation of arg1 BigDecimal.

Conclusion

As per the things discussed above,

Standard IEEE-754 representation is used to represent floating-point numbers.
In Java due to the impact of rounding error, floating-point representations of double and float data types may differ from the exact representations.
BigDecimal class is used to avoid this rounding error problem.

References

Floating-Point Representation in Java

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Hasini Sandunika Silva

No responses yet