I am constantly amazed at how unfamiliar otherwise competent developers can be with numeric types. Without understanding the representation of numbers used by computers, developers cannot anticipate the issues that may arise and avoid them. Working on the JVM brings additional quirks to be careful of too.
Floating point numbers cause the most confusion, but integer types are often not understood either. In a series of two blog posts I will explain what you need to know to understand the JVM’s numeric types. I’ll point out those JVM specific quirks too.
This first installment covers the JVMs integer types, the next will cover floating point.
Integer types
The integer types are types that represent “whole” numbers e.g. 0
, 1
or 1023
, and not e.g. 1.23
. On the JVM all integer types are signed, i.e. they can represent positive and negative numbers e.g. 4
and -15
.
Number systems
There have been many concepts of “number” throughout time and cultures. Things probably started with people making notches in sticks to represent numbers, but more advanced systems evolved. Some more unusual number systems you may have encountered:
System | Description |
---|---|
Roman numerals | Often seen on monuments and films to specify dates |
Sexagesimal numbers | Initially used by the Sumerians, this system based on symbols for the numbers 1 to 59, i.e. base 60. It still has a legacy on our clocks and the measurement of angles. |
Decimal numbers
When you are taught to do arithmetic at school, you use the Arabic-Hindu number system. This system is based on symbols for the numbers 0 to 9
, i.e. base 10. It has thrived because we have ten fingers, and because it includes the concept of zero.
You are taught to think in terms of columns of units, tens, hundreds, thousands etc.. This is a positional number system. Units can be any number from 0 to 9
. Tens are worth 10 times the value of the column. Hundreds 10 x 10 times the value. Thousands 10 x 10 x 10 times the value. Each column is worth 10 times the previous one.
Thousands | Hundreds | Tens | Units | Total in Decimal |
---|---|---|---|---|
1 | 2 | 0 | 1 | (1 x 1000) + (2 x 100) + (0 x 10) + (1 x 1) = 1201 |
When performing addition, the process can be defined as a series of operations on the columns, from the least significant up to the most significant. First add the units, then add the tens along with any carry over from the units, and repeat the process for the hundreds, thousands etc.
Binary numbers
Computers are digital electronic systems, based around the bit. A bit can be a 0
or a 1
, so computers use a binary number system, i.e. base 2.
It is no more complex than the base 10 system and is simple enough to be covered in GCSE computing.
Binary numbers are no different from decimals both being positional systems, one using a base of 2 and the other 10. In binary the value in each column can range from 0 to 1
, and each column is worth two times the previous one e.g. units, twos, fours, eights.
Eights | Fours | Twos | Ones | Total in Decimal |
---|---|---|---|---|
1 | 0 | 1 | 1 | (1 x 8) + (0 x 4) + (1 x 2) + (1 x 1) = 11 |
Confusion One - Integers don’t have a sign bit
The binary number system described so far doesn’t include any way to specify a negative number. In other languages, e.g. C/C++, there are separate types for signed and unsigned integers. In Java we only have signed types.
How would you represent a signed number?
Most people think the obvious is to use 1 bit to represent the sign e.g.:
Minus | Fours | Twos | Ones | Total in Decimal |
---|---|---|---|---|
- | 0 | 1 | 1 | -1 x ((0 x 4) + (1 x 2) + (1 x 1)) = -3 |
+ | 0 | 1 | 1 | 1 x ((0 x 4) + (1 x 2) + (1 x 1)) = 3 |
The presence of a sign bit is a common belief, but it is not the case. The Java Language Specification JLS - Primitive Types and Values states that a system called two’s complement is used for integer types.
Two’s complement can cause a few quirky corner cases e.g. -Integer.MAX_VALUE
is Integer.MAX_VALUE
, see JLS - Unary Minus Operator.
Why do we use two’s complement?
The algorithms/hardware needed to do arithmetic, like addition, subtraction and multiplication are identical for signed numbers in this format and for unsigned numbers. This means half the circuitry is required in the cpu’s ALU.
Treating signed integers as unsigned ones
Since Java 8 unsigned arithmetic support has been added to the standard library to help you use the Java types as unsigned values.
The following code prints the binary representation of a series of numbers and how they can be interpreted as both signed and unsigned numbers. Note the differing behaviour as the integers roll over the 0
boundary, and the Integer.MAX_VALUE
boundary:
Outputs:
A quirk - Java has two shift operators
The right shift operator is often used as a quick way to divide by 2. Moving the digits of a signed number to the right reduces the value each contributes by a factor of 2.
In languages like C the way that the shift works is determined by whether the value shifted if signed or unsigned.
For unsigned numbers the left hand side of the number shifted is filled with 0
values. 32 >>> 1
is 2147483632
(01111111111111111111111111110000
).
For signed numbers the left hand side of the number is filled with the left most column’s value before the shift. I.e. if you start with the integer 11111111111111111111111111100000
i.e. -32
and you do -32 >> 1
, you end up with 11111111111111111111111111110000
or -16
. Whereas 32
(00000000000000000000000000100000
) shifted once to the right will yield 16
(00000000000000000000000000010000
).
Instead of Java having separate signed and unsigned types. It has two separate right shift operators.
The unsigned shift is the >>>
operator. It can be useful in applications where you are using the the integer to represent a bitmask).
The signed shift is the >>
operator and is useful when using the shift to do cheap division by powers of 2.
Confusion 2 - The result of the %
operator can be negative
Those who have some familiarity with modular arithmetic assume that the %
operator on integers will always return a positive value. In modular arithmetic e.g. 1
would be considered congruent to -2 mod 3
, but the integer remainder of -2
w.r.t. 3
is -2
.
The important equality is given in the JLS - Remainder Operator %:
The remainder operation for operands that are integers … produces a result value such that (a/b)*b+(a%b) is equal to a.
The correct code should have been:
Hexadecimal & Octal Numbers
Often in computing hexadecimal and octal representations of numbers are used. They use a number bases that are a power of 2, this means they can more compactly represent binary numbers, and can be more natural to use than a decimal number in some applications.
Java literals
The JLS - Integer Literals allows you to write code using binary, octal, hexadecimal or decimal forms. Since Java 8 decimals can use an _
separator to make longer numbers more readable. Some examples are given here.
Whatever representation you use in your source code, the JVM processes all numbers as binary numbers.
Sizes of integer types on the JVM
There are an infinite number of integers, you can always add 1 to the biggest you know about. In a computer space to store numbers is typically limited to a fixed size. There are different sizes of “integer” types, each of which uses a fixed number of bits to represent the number.
The JLS - Integral Types and Values states that we have the following integer primitive types. The more bits used to represent the number, the more columns we have in our binary representation and a wider range of numbers can be represented, a byte
covering the numbers -128
to 127
and a long
covering -9223372036854775808
to 9223372036854775807
.
type | size in bits |
---|---|
byte | 8 |
short | 16 |
int | 32 |
long | 64 |
JVM Quirk - bytes and shorts actually use 32 bits
Look at the list of JVM instructions, you will see that there are no instructions for adding, subtracting etc. bytes or shorts. Bytes and shorts are automatically widened to the int type when they are pushed onto the stack. Operations like addition are performed as 32 bit int additions.
In the following code, an explicit cast is required to assign the result of adding 2 bytes together to a byte variable.
The cast can be seen as an explicit i2b instruction in the decompiled code.
BigInteger
java.math.BigInteger is an arbitary precision integer type for the JVM. In theory it can represent any integer number. It abstracts away the internal representation, but allows you to specify and retreive the number in terms of byte[]
s in big-endian order, in two’s complement.
This code shows creating a number that requires more accuracy than that available in a long
:
It will show you that instead of rolling a bigger number requires more space to store.:
Auto boxing and unboxing
One final JVM quirk comes from the autoboxing of integers. Look at the test below:
This test passes for the numbers 127
and 0
. It fails for the numbers 128
and 12345
. Why is this?
The JLS - Boxing Conversion states:
If the value p being boxed is true, false, a byte, or a char in the range \u0000 to \u007f, or an int or short number between -128 and 127 (inclusive), then let r1 and r2 be the results of any two boxing conversions of p. It is always the case that r1 == r2.
I.e. there is a pool of number objects that Integer.valueOf
will return for e.g. 127
, but for numbers outside that pool a fresh object is created for each boxing.