This post outlines a general purpose alternative to ArrayList
s which provide lazy insertions and deletions to speed up those operations. The Java source code, unit tests, javadoc and jar file for all the classes mentioned in this blog is available here:scottlogic-utils-mr.zip.
ArrayLists
are perfect when you want to get elements by their indices but not so brilliant when you want to insert or remove elements. In a previous post, I showed how a balanced binary search tree can be designed to work as if it's an extendable array, this approach offers faster removals and insertions at the cost of more time to get elements. In this post, I'll go through an implementation of a new list (at least I don't know what the name for it is, if it has one already!) which attempts to marry up these two approaches, offering improved speed on all operations over the binary tree based list by being lazy.
I call this new list a "PatchWorkArray
" because it stores its data in patches throughout a regular ArrayList
. The idea behind it is this: Suppose you've got a list with a large number of elements in it and want to remove the element at index i, where i < n, and n is the number of elements in the list; instead of moving elements i+1 to n up one place (which is what happens in an ArrayList
), you could just leave the list as is, then, when someone asks for the element at index j, where j >= i, you just give them the element at index j+1 instead.
The PatchWorkArray
generalizes this idea; it's an ArrayList
(the "backing list") and an associated set of alterations (insertions and removals) to that ArrayList
. When an element is requested, the set of alterations is queried to figure out the correct location of the element in the backing list, which is then returned.
The relationship between the various classes involved is described in the following UML diagram:
Dealing with AlterationsThe PatchWorkArray
doesn't do anything special about removals or insertions made to the end of the list - they simply get performed on the backing list itself. However, the effect differs for alterations made at any other position. When an element is removed, instead of removing the corresponding element from the backing list, it is instead replaced with a dummy value. When an element is inserted, the element currently in the corresponding position in the backing list is replaced with a new sub-list containing the newly inserted element followed by the replaced one. This sub-list is an instance of a binary tree based list, which allows log time insertions and removal by index. Further deletions and insertions at indices covered by this sub-list are then performed on the sub-list itself, rather than affecting the backing list.
For example, suppose you have a PatchWorkArray
storing the numbers 0 to 9, in the most simple case the backing list would look like so:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
If you then removed the element at index 2, then inserted -1 at index 6, the backing list would appear like so:
[0, 1, R, 3, 4, 5, 6, [-1, 7], 8, 9]
where the 'R' represents the dummy value. Note that indices start from index 0 and that when we inserted the -1 at index 6, this actually causes a sub-list to be created at index 7 in the backing list, due to the removal we made before. To make rapid searching of the list possible, the PatchWorkArray
needs to store an alternative representation of the alterations that have been made.
In this representation, alterations are represented by simple Java objects (instances of the Alteration
class) with two key integer properties: index, and indexDiff. Index is the index in the backing list at which the alteration occurs and is unique. IndexDiff is the difference to the index that the alteration makes when you're searching for an element; if that sounds a bit abstract now, not to worry, as hopefully it'll become clearer later on. Removal
and Addition
are the two concrete subclasses of Alteration
. A Removal
object represents a block of consecutive removals in the backing list starting at the index position; its indexDiff is set to the number of removed elements. Addition
s have negative indexDiffs equal to one minus the size of the sub-list at the index in the backing list.
For example, you can have a PatchWorkArray
, once again storing the numbers 0 to 9, with a backing list that looks like this:
[R, R, R, 0, 1, [2, 3, 4], 5, R, 6, [7, 8], 9].
If this were the case, the following Alterations would need to be stored:
Type | Index | IndexDiff |
---|---|---|
Removal |
0 | 3 |
Addition |
5 | -2 |
Removal |
7 | 1 |
Addition |
9 | -1 |
The data structure that stores these Alteration
s is an instance of AlterationList
, a subclass of the SortedList
class I wrote for a previous post, and is relatively complicated in itself. It's a balanced binary search tree (actually an AVL tree) that stores Alteration
s in order of their indices, where each node in the tree caches the sum total of all the indexDiffs of its descendants.
The AlterationList
class has only one key method with the signature ArrrayLocation getLocationOf(int index)
, which takes an index of an element in the PatchWorkArray
and returns the position it would expect to find it in the backing list. This position is represented as a simple object (an instance of ArrayLocation
) which is effectively a wrapper for two integers - the requested element's index in the backing list and, if appropriate, its index in the sub-list at that index. The getLocationOf
method is similar to a binary tree search, other than the fact that each time we encounter a node whilst traversing, it needs to update the index of the element its looking for, based on the Alteration
at that node and the Alteration
s in its left sub-tree (those with smaller indices). The commented source code for this method is given below:
With the getLocationOf
method in place, the PatchWorkArray
's code is greatly simplified, for example, the get
method of the PatchWorkArray
is fairly straight forward and looks like this:
Note that in this case, the PatchWorkArray
is parameterised by the type T
, but the backing list is actually an instance of List<Object>
. This is because the backing list has to store sub-lists and dummy values as well as objects of type T
, hence the use of casting.
The Advantage of LazinessAlthough it's clear that lazily removing and inserting elements can make these operations faster; being lazy has another advantage - it gives you the opportunity to optimise the data structure over a series of operations. In a simple case, imagine you remove the element at index zero from a list, then, afterwards, insert a new element at that index. If the list is an ArrayList
, this would involve copying the whole array twice. However, if it's a PatchWorkArray
, making these changes only requires a constant number of operations and what's more, thanks to some optimisations, you are left with a backing list which is an exact match of the equivalent ArrayList
(i.e. its AlterationList
is empty).
The optimisations carried out in the PatchWorkArray
are relatively simple and are by no means pushing the boundaries of what could be achieved with more time and attention. These optimisations can be considered "local" since they work only by considering what's happening in the indices directly next to where a change is being made. They are implemented in the add
and remove
methods of the PatchWorkArray
class and are an attempt to reduce the number of alterations to the backing list when making a change. They ensure that consecutive blocks of dummy values in the backing list are represented by a single Removal
in the AlterationList
and that indices holding dummy values are used to store newly inserted elements whenever possible. The following simple examples show these local optimisations at work and how insertions and removals affect the state of a PatchWorkArray
:
Before | Operation | After | |||||||
---|---|---|---|---|---|---|---|---|---|
Backing List | Alterations | BackingList | Alterations | ||||||
Type | Index | IndexDiff | Type | Index | IndexDiff | ||||
1 | [R, 1, R, 2] | Removal |
0 | 1 | remove(0) |
[R, R, R, 2] | Removal |
0 | 3 |
Removal |
2 | 1 | |||||||
2 | [9, 8, R, 7] | Removal |
2 | 1 | add(2, -1) |
[9, 8, -1, 7] |
In example one, an element is removed which has dummy values to the left and right, causing the entire block of dummy values to be represented by a single Removal
in the AlterationList
. In the second example, an element is added to a position where it can replace a dummy value, causing the AlterationList
to be emptied.
PerformanceThe asymptotic runtime of the PatchWorkArray
methods are slightly harder to quantify than those in regular data structures as they are dependent on its alterations, rather than the number of elements in it. The important metric, which I call the "performance hit" is the number of Removal
s plus the sum of the absolute index diffs of all the Addition
s. Note that this metric can never be greater than one plus the number of elements in the backing list, since consecutive removals are represented using a single Removal
object. Although it's not strictly true, I like to think of this metric as a measure of where the PatchWorkArray
lies between an ArrayList
and a binary tree based list; if the performance hit is zero, its backing list is pretty close to the equivalent ArrayList
, if it's close to the number of elements in the list, then it's likely that it contains at least one large binary-tree based sub-list.
The asymptotic worst-case runtimes of various operations on a PatchWorkArray
and other list implementations are given below, where n is the number of elements in the list and p is the performance hit of the PatchWorkArray
:
Adding To By Index | Removing From By Index | Get by Index | |||||
---|---|---|---|---|---|---|---|
Front | Middle | End | Front | Middle | End | ||
LinkedList | O(1) | O(n) | O(1) | O(1) | O(n) | O(1) | O(n) |
ArrayList/Vector | O(n) | O(n) | O(1)* | O(n) | O(n) | O(1)* | O(1) |
UnsortedList (binary tree based list) | O(log(n)) | O(log(n)) | O(log(n)) | O(log(n)) | O(log(n)) | O(log(n)) | O(log(n)) |
PatchWorkArray | O(log(p)) | O(log(p)) | O(1)* | O(log(p)) | O(log(p)) | O(log(p))** | O(log(p)) |
(* amortized constant time - running the method k times takes O(k) time, ** if the removal is from the end of the backing list the time taken is O(1)*, if however the end of the PatchWorkArray is not the end of the backing list, the time required is O(log(p)). |
I suppose you could call it cheating to say that theoretically the performance of a PatchWorkArray
is superior to that of a binary tree based list, since it's possible that p = n + 1. However, the point is p can be controlled separately from n, and in practice, it's likely that p is significantly smaller than n anyway.
Other "Non-Trivial" MethodsThe PatchWorkArray
contains a fair amount of code and took a quite lot more time than I anticipated, especially owing to the fact that I needed a binary-tree based list implementation first! Apart from the add
and remove
methods which perform the optimisations, the other technically involved code is in the internal implementation of the Iterator
interface returned by the iterator
method, and in the fix
method, which is unique to this type of list.
I won't go through it here (as the code is pretty long and rather ugly!), but the Iterator
implementation allows the PatchWorkArray
to be iterated over in linear time in the number of elements in the list. This is more difficult than you might think, owing to the fact that this number is potentially far smaller than the number of dummy values in the list. The fix
method however is worth considering; its purpose is to resolve the alterations in the backing list so that it becomes effectively just like the equivalent ArrayList
. The runtime for this method is linear in the number of elements in the list too. The idea is that this method be called periodically, or when required, to allow the PatchWorkArray
to operate with maximum efficiency. The commented source code for this method is given below:
This method works by creating a new ArrayList
, which will form the new "fixed" backing list, then filling in the elements by iterating over the AlterationList
, rather than the backing list itself. This is crucial in order to keep the process linear. The method ends with simply switching the backing list for the new list and resetting the alterations. The call to increment modCount
enables iterators going through this PatchWorkArray
to fail on a best effort basis, for more information see AbstractList
, which is an ancestor of the PatchWorkArray
class.
In the next post, I'll go through how to use effectively subclass and use the PatchWorkArray
class and show how it compares in empirical tests to other List
implementations. Meanwhile, feel free to try it out and let me know what you think. I've written a fair number of unit tests, which are included in the download, so hopefully it's all working asexpected (especially as it really isn't the easiest thing to debug!), but if you come across an issue please let me know.
Mark Rhodes