This post looks at how to construct Java's built-in hash based Map
implementations to ensure they have sufficient, but not excessive capacity.
Hash based maps are one of the most commonly used data structures in programming and Java provides two different implementations of these - the commonly used HashMap
and the older synchronized Hashtable
.
A hash map is essentially a wrapper for an array containing linked lists of the items in it. Each item is allocated an index in the array based on its hash code (the value returned by invoking the hashCode()
method) and the length of the array. This means that so long number of items stored in the map is small in comparison to the length of the backing array, and the hash codes allow the items to be evenly distributed throughout it, it can performs many operations (e.g. finding/adding/removing elements) in constant time. However, as more and more items are added, it needs to ensure that an acceptable ratio between the number of items it's storing and the length of the underlying array is maintained. In order to do this, it periodically creates an entirely new, larger backing array and moves all the existing items over to it
The reallocation of elements to a new backing array is an expensive operation (at best linear time in the number of items in the map) so it's best to avoid having to do it if at all possible. To help do this, you can pass the initial size of the underlying array as a argument when constructing your hash map. This is only really useful if you know (or at least can have a good guess at) the number of items your new map is going to store, which is often the case. For example, one common use case of a hash map is to store the elements of an array against some id of the elements, e.g. I've seen something like the following code many times over:
The use of the default constructor in the above code means that it's possible that the map will have to unnecessarily reallocate the elements to ever larger backing arrays numerous times before it's finished the loop. However, the snag is that it's not entirely obvious what you should set the initial size of the backing array to be in order to prevent the map needing to do this and so that you don't use an excessive amount of memory. After trawling through the source code of the two hash map implementations and creating some unit tests, it looks like the a good initial value is defined by:
Here, loadFactor
is the threshold ratio between the number of items in the map (n) and the size of the backing array (s), i.e. if n > loadFactor
* s, a new larger backing array is created and the elements added to it. The default load factor for both implementations is .75. Therefore the above code could be more efficiently written like so:
The same trick also works for instances of Java's commonly used HashSet
class, as a HashSet
is these are effectively just a wrapper for HashMap
instance. For example the following code instantiates an HashSet
which can hold at least 100 Strings
without requiring a reallocating of elements to a new backing array:
Testing it WorksIn order to verify that initializing hash maps in this way works as expected, I wrote some JUnit tests, the source for which is available here: JUnit test source. Since we are interested in the state of the underlying backing arrays, which are private in both the HashMap
and Hashtable
classes, these tests need to use reflection to see what's going on. In order to reduce the amount of duplicate code, I made use of the abstract factory pattern, so that the main bulk of the testing code doesn't need to know the specifics of how the map instances it's testing are constructed.
The interface for the factory which provides instances to verify is as follows:
The idea is that test code gets two map instances of the implementation to be tested - one provided by each of these methods, passing in the same values. It then adds the given number of elements to each and verifies that the backing array of the instance provided by the createTestInstance
method didn't resize and remains no larger than the one provided by createSmallerMapInstance
method. The implementation of the TestInstanceFactory
for verifying the initialization of HashMap
s is an anonymous inner class that looks like so:
The actual test method, which takes a TestInstanceFactory
as a parameter and verifies the behaviour is as as follows:
The test is by no means perfect as it requires a sensible implementation of the TestInstanceFactory
and it's never going to be that generic due to the use of reflection, but it gives a fair level of confidence that things are working as expected. It's worth noting that if you were to alter the implementation of the createTestInstance
method of the factory for HashMap
s to be:
which is effectively what is used within the HashCode
source, the test code fails in the case that requiredCapacity / loadFactor
is an exact power of two, as it allocates twice the required memory for the backing array.