Programming Problems in Java: A Primer for The Technical Interview (2014)
Chapter 6. Hash Tables
Hash tables are important data structures whose efficiency and flexibility allow them to be used in almost any kind of search problem.
A large part of being a computer programming is being able to evaluate a problem and design the proper data structure to address it. As the popularity of higher-level languages grows, more and more candidates default to the hash table. Because of their flexibility, hash tables can lead to a correct, but inefficient solution. They can be equally effective in finding duplicates in a linked-list, counting the elements in a stream, and searching a pre-generated list. But hash tables have drawbacks. They cannot be easily enumerated by key or value. It takes work to guarantee a proper hash function has been chosen for a specific data set, as collisions in the search space can degrade lookup performance tremendously. Also there is much memory overhead to using a hash table as a key-value map, and dynamic resizing can occur at any insert.
In this section we cover the material that every candidate should know regarding hash tables. We discuss the three important features of hash table implementation; the hash function, collision resolution, and dynamic resizing. And finally conclude with a number of applications of hash tables that have been asked as interview questions.
6.1 The hash function
A hash function maps a large domain to a smaller range of values. For instance, a simple hash function can map the set of all strings to the integers between 0 and 31. The hash of a data structure is computed by calling a hash function on a specified element of the data structure called a key. The hash value of a variable is the result of evaluating the hash function on that variable’s key.
Hash functions require a number of properties to guarantee correctness and efficiency of a hash table.
A hash function must be deterministic in that every data structure with the same key must have the same hash value. Otherwise retrieval from a hash table is impossible.
A hash function should guarantee that the domain is nearly uniformly distributed across the range. That means the probability that two elements, chosen at random, have identical hash values is n-1 when the range is of size n. In this way, that hash function minimizes the chance of false positives. A collision is when two distinct elements have equal hash values.
Finally, a hash function must be able to be efficiently computed across the domain. It is common for a hash function to take constant time on elements of the domain.
Hash functions may have other properties, but they find use in only special purposes. For instance a hash function can be tolerant of unimportant features of a key, such as capitalization, to provide normalized hashing. A hash function can be continuous, in that small changes to a key result in small changes to its hash value. This guarantees locality of similar objects within a hash table. Consistent hashing provides a balanced load across a set of ranges, and is used in load balancing client requests.
Let’s examine the problem of creating a hash function for lower case English words. Below is a first attempt. Here we hash any lower case English string to theASCIIvalue of its first character.
Listing 6.1: A Simple Hash Function
int hash_simple(String str) {
if (str.isEmpty()) {
return 0;
}
return str.charAt(0) - 'a';
}
This function is deterministic and extremely efficient, its range is limited to only 26 values and hence many strings will hash to the same value. Further, the corpus of English words is not uniformly distributed across these 26 values. An examination of the words in the Gutenberg library shows that 17% of the words start with t, nearly 12% of words start with the letter a but only 0.05% of words start with z.
To overcome these two problems, we add complexity to the calculation of the hash value.
Listing 6.2: Another Hash Function
int hash(String str) {
int hash_value = 0;
for (int i = 0; i < str.length(); i++) {
hash_value += str.charAt(i) - 'a';
}
return hash_value;
}
While this function has a larger range, however it is still small and rather sparse. Also it no longer guarantees constant time operation. The hash value of an extremely long word takes much more time to compute than the hash value of a short word. To overcome these drawbacks, let’s consider taking each four character set of characters from the word cast them as an integer. This significantly increases the range. We can sum these sets to distribute the words over the range. And to guarantee constant time operation we bound the number of characters used.
Listing 6.3: Complex Hash Table
int hash_complex(String str) {
int hash_value = 0;
for (int i = 0, count = 15;
i < str.length() && count >= 0;
i++, count--) {
hash_value += (str.charAt(i) - 'a')
<< (count % 4);
}
return hash_value;
}
This is better because it maintains a constant time evaluation and we are increasing the range.
It is common that complexity increases as we aim to smooth the distribution or increase the range. With greater complexity comes longer run time, and an unfortunate choice of hash functions can decrease the promise of constant time lookup.
6.2 Definition
A hash table stores elements in an array indexed by their hash value. Each entry in the array is called a bucket. Each element is associated with a key that is used to uniquely identify the bucket that stores the element. The hash function maps the element’s key to the bucket in which to store the element. The goal of hash functions is to provide constant time retrieval of stored elements while minimizing insertion time.
In the sequel we will use theItemobject for elements of the hash table. The key is a string member of anItem. Of course in practice, such a definition would be more complex.
Listing 6.4: Definition of Item object
class Item {
String key;
int value;
Item(String key, int value) {
this.key = key;
this.value = value;
}
}
With that defined, a hash table includes a collection ofItemobjects and exposes functions forhash_function,get,set, andsize. We illustrate this in the following abstract class.
Listing 6.5: Definition of Hashtable
abstract class HashTable {
protected static final int
DEFAULT_HASH_TABLE_SIZE = ...;
int hash_function(String str) {
return ...;
}
abstract void insert(Item entry);
abstract Item find(String key);
abstract int size();
abstract void resize(int size);
}
A simple hash table implementation can use aListto store the collection ofItemobjects.
Listing 6.6: Simple Hashtable
class SimpleHashTable extends HashTable {
private List<Item> items;
SimpleHashTable() {
this(DEFAULT_HASH_TABLE_SIZE);
}
SimpleHashTable(int size) {
this.items = new ArrayList<Item>(
Collections.nCopies(size, (Item)null));
}
Item get(int index) {
return (index < size()) ? items.get(index)
: null;
}
void set(int index, Item item) {
items.set(index, item);
}
int size() {
return items.size();
}
...
}
Notice that the hash function is defined to take a key type instead of anItem. This is because we want to use the same hash function for both insertion and retrieval. Also, we take the liberty of storing object references instead of values in the hash table. While it is true that Item is plain old data and simple to copy, hash tables often contain complex data structures for which it is better to store references instead of copies. We take this approach in this chapter.
6.3 Operations
The two basic operations of a hash table are insertion and retrieval. A third operation often required is dynamic resizing. It is used to increase the capacity and decrease the load of a hash table. By growing the table, it is the goal that the chance of collisions is reduced.
Hash table implementations may also provide access to enumerating the elements of a hash table. This can be done by allowing enumeration of the buckets or by providing direct access to the underlying array. However there is rarely a guarantee that the elements are provided in any sorted order.
Below we provide implementations ofinsert,find, andresize for our simple hash table.
6.3.1 Insertion
Theinsertoperation for a hash table requires calculating hash values and addressing collisions if they occur. Collision resolution is discussed in another section below. This listing provides an incomplete hash table implementation that does not guarantee retrieval of stored elements. Instead we evict older entries when collisions are detected. Such a scheme is useful in cache implementations, but not as useful for using hash tables for histograms.
Note that in order to convert the hash value to a valid index, we mod out the hash value by the size of theList.
Listing 6.7: Insert into a Hashtable
void insert(Item entry) {
int index = hash_function(entry.key) % size();
set(index, entry);
}
6.3.2 Retrieval
Implementing thefindoperation in a hash table depends on the hash value of a key and the insertion scheme. As multiple keys can hash to the same value, it is important that retrieval determine if an entry agrees with the key used in retrieval. For that reason there is an equality check after retrieval by index.
Listing 6.8: Find in a Hashtable
Item find(String key) {
int index = hash_function(key) % size();
Item item = get(index);
return item != null && item.key.equals(key)
? item
: null;
}
6.3.3 Dynamic resizing
Commonly, a hash table exposes a third method for dynamic resizing. Dynamic resizing refers to increasing the size of the underlying data structure while maintaining the key to hash value mapping of the table. There are two steps. The first is allocation of a new buffer to store the elements. And the second is a rehashing of the elements in the original table to the second.
Listing 6.9: Dynamic Resizing of a Hashtable
void resize(int size) {
List<Item> new_table = new ArrayList<Item>(
Collections.nCopies(size, (Item)null));
for (Item entry : this.items) {
if (null != entry) {
new_table.set(
hash_function(entry.key) % size,
entry);
}
}
this.items = new_table;
}
During dynamic resizing, a rehash is necessary. Because of this it is sometimes beneficial to update the hash function at the same time. Such an update may be as simple as changing an offset for the hash value calculation.
While this implementation provides a possibly lossy implementation of a hash table, any of the collision resolution schemes below guarantee retrieval of all elements inserted into the hash table.
6.4 Collision resolution
Collision resolution refers to a scheme for dealing with collisions in hash table insertion. In the simple example above, the hash table was not correct in the sense that subsequent inserts in the event of a hash collision could evict objects stored in the hash table. Below we will discuss and implement three popular and efficient schemes for collision resolution.
6.4.1 Separate chaining
Separate chaining is the most commonly taught method of collision resolution. With separate chaining, the buckets of the hash table contain linked-lists that hold the keys. This provides unlimited capacity for the hash table. However, it comes at the cost of increased lookup time as the hash table becomes filled to capacity. In the worst case, lookup time can become linear.
To implement separate chaining we modify the definition of a hash table so that the underlying data structure contains lists ofItem references.
Listing 6.10: Definition of Hashtable with Separate Chaining
class SeparateChainingHashTable
extends HashTable {
private List<List<Item>> items;
SeparateChainingHashTable() {
this(DEFAULT_HASH_TABLE_SIZE);
}
SeparateChainingHashTable(int size) {
this.items = new ArrayList<List<Item>>(
Collections.nCopies(size,
new LinkedList<Item>()));
}
List<Item> get(int index) {
return (index < items.size())
? items.get(index)
: null;
}
void set(int index, Item item) {
items.get(index).add(item);
}
...
}
When an element is inserted into a hash table with separate chaining, it is simply pushed onto the list stored at the bucket. This provides constant time insertion. It also guarantees that an item can be retrieved once it has been inserted as there is no eviction.
Listing 6.11: Insert with Separate Chaining
void insert(Item entry) {
int index = hash_function(entry.key) % size();
set(index, entry);
}
Retrieval from separate chaining requires enumerating the lists to complete the search for an entry. As above, we must ensure the element’s key matches the parameter. That this requirement is clear, as with separate chaining it is common for the table to have multiple entries with the same hash value.
Listing 6.12: Find with Separate Chaining
Item find(String key) {
int index = hash_function(key) % size();
List<Item> itemsList = get(index);
for (Item i : itemsList) {
if (i.key.equals(key)) {
return i;
}
}
return null;
}
Dynamic resizing with separate chaining takes only a little more effort, as each list in each bucket must be enumerated. Aside from this caveat, implementation of dynamic resizing is straightforward.
6.4.2 Open addressing
Open addressing is a scheme by which collisions are resolved by inserting an element at the next available bucket in the hash table given some iteration scheme. Commonly, open addressing involves simply iterating through the table. This is called open addressing with linear probing. However, open addressing can be implemented by other probing schemes such as successively doubling the index value. If no available bucket is found, the table is dynamically resized and the insert resumes. Note that after resizing, we are guaranteed to find an available bucket. Dynamically resizing and a possible linear time enumeration are the cost the hash table pays for guaranteeing retrieval.
We implement a hash table supporting open addressing with linear probing. Insertion is only a matter of finding an open bucket.
Listing 6.13: Insert with Open Addressing
void insert(Item entry) {
while (true) {
int index = hash_function(entry.key) % size();
for (int offset = 0;
offset < size();
offset++) {
int bucket_index = (index + offset) % size();
if (null == get(bucket_index)) {
set(bucket_index, entry);
return;
}
}
resize(size() ⋆ 2 + 1);
}
}
To ensure that a false negative is not returned on lookup, an open addressing scheme must iterate through all successive buckets to ensure that an entry is not present at successive location. Caution must be exercised to not loop continuously.
Listing 6.14: Find with Open Addressing
Item find(String key) {
int index = hash_function(key) % size();
for (int offset = 0;
offset < size();
offset++) {
int bucket_index = (index + offset) % size();
Item item = get(bucket_index);
if (null == item) {
return null;
}
if (key.equals(item.key)) {
return item;
}
}
return null;
}
6.4.3 Cuckoo hashing
Cuckoo hashing is a modern approach to collision resolution that provides expected constant time lookup and insertion when the load of the table is low. Instead of linear probing, cuckoo hashing uses a combination of two hash functions and two tables to find the next available bucket for a value. The hash functions and tables are labeled primary and secondary, corresponding to the order they are probed.
To realize expected constant time operations, cuckoo hashing requires a more complex definition for the hash table.
Listing 6.15: Definition of Hashtable with Cuckoo Hashing
class CuckooHashTable extends HashTable {
interface HashFunction {
int hash_function(String key);
}
private List<List<Item>> items;
private List<HashFunction> hash_functions;
CuckooHashTable() {
this(DEFAULT_HASH_TABLE_SIZE);
}
CuckooHashTable(int size) {
this.items = new ArrayList<List<Item>>(2);
this.items.add(new ArrayList<Item>(
Collections.nCopies(size, (Item)null)));
this.items.add(new ArrayList<Item>(
Collections.nCopies(size, (Item)null)));
}
int size() {
return items.get(0).size();
}
...
}
Insertions begin by checking the first table. If the hash value bucket is occupied, its contents are replaced with the new entry and the old entry is promoted to the second table. This process repeats until all entries are added or else an infinite loop is detected. In the latter case, the table is resized and the process continues. In the code below, we usetable_indexto track if we are operating on the primary or secondary table. The set is used to detect if a loop is encountered. Also we use references to make the code easier to read.
Listing 6.16: Insert with Cuckoo Hashing
void insert(Item entry) {
List<Set<Integer>> visited =
new ArrayList<Set<Integer>>(
Collections.nCopies(2,
new HashSet<Integer>()));
int table_index = 0;
int index = 0;
do {
List<Item> table = items.get(table_index);
HashFunction hashfn = hash_functions.get(
table_index);
index = hashfn.hash_function(entry.key)
% table.size();
Item previous_entry = table.get(index);
table.set(index, entry);
if (null == previous_entry) {
return;
}
entry = previous_entry;
visited.get(table_index).add(index);
table_index = (0 == table_index) ? 1 : 0;
} while (
!visited.get(table_index).contains(index));
resize(items.get(0).size() ⋆ 2 + 1);
insert(entry);
}
The resize is necessary if an infinite loop is detected. With a poor choice of hash function, it is possible for the insert-resize process to also loop. To insure against this, it is advisable to update the hash function during resize. Other than that, the resize member is straightforward.
Lookup in a hash table with cuckoo hashing must determine if the key is in the first or second table. The latter check is only needed when the hash table bucket in the first table is occupied by an element not equal to the value seen.
Listing 6.17: Find with Cuckoo Hashing
Item find(String key) {
for (int table_index = 0;
table_index < 2;
table_index++) {
List<Item> table = items.get(table_index);
HashFunction hashfn = hash_functions.get(
table_index);
int index = hashfn.hash_function(key)
% table.size();
Item item = table.get(index);
if (null == item) {
return null;
}
if (key.equals(item.key)) {
return item;
}
}
return null;
}
This concludes our overview of the fundamentals of hash tables.Javaprovides a hash table implementation in itsHashtableclass, which we will use when appropriate in the sequel.
6.5 Find the most common elements in a stream
Consider the problem of finding the most common element in large list with many distinct elements.
To solve this problem, we want to build a histogram of the count of each distinct element. To construct this histogram we will require a container to store an entry that is a tuple of the element and a counter. The element is the key of this entry, and the counter is the entry’s value.
When elements are first encountered, we add an entry with count 1. Otherwise we increment the counter associated with that element. So clearly, for each element in the list we need to support both insertion and lookup. In a binary tree, each lookup can require order log n comparisons, and each insert can require order log n comparisons. An efficient hash table we can reduce each lookup and insertion to expected constant time and solve the problem with a linear scan.
After scanning the input, the hash table has a count of the frequency of each element in the list. We can keep track of the current most common element during iteration. If the k most common elements are desired, we can use a bounded size min heap as in the previous section.
The following listing provides an implementation of finding the most common element in a list of strings using a hash table.
Listing 6.18: Find the Most Common Element
String find_most_common_element(
Scanner in,
Map<String, Integer> hash_table) {
String str;
while (in.hasNext()) {
str = in.next();
int previous_count =
(hash_table.get(str) == null)
? 0
: hash_table.get(str);
hash_table.put(str, previous_count + 1);
}
String max_str = "";
int max_count = 0;
for (Map.Entry<String, Integer> entry :
hash_table.entrySet()) {
if (entry.getValue() > max_count) {
max_str = entry.getKey();
max_count = entry.getValue();
}
}
return max_str;
}
6.6 Implement a most recently used cache
A cache is a data structure used to reduce retrieval time of data from a remote store. A subset of elements is stored locally, and retrieval first looks locally before looking remotely. Caches are bounded and store only a subset of the data available from the remote store. For this reason they must implement an eviction policy. A most recently used cache is one that evicts the least recently used element. In this subsection, we solve the problem of implementing a MRU cache.
The MRU cache must efficiently support both retrieval and priority updating. To implement this data type we use a combination of two containers we have already visited. We use the built-inHashMap implementation for lookup, and a doubly linked-list to maintain priority.
To facilitate retrieval in the linked-list, the hash table used for retrieval stores references to the nodes of the linked-list. The linked-list stores the priority of each entry in the cache by the order of the nodes. The head of the linked-list always contains the most recently accessed item, and the tail will be the least recently accessed item. In order to maintain correctness, the cache must be updated whenever an element is used.
The structures used in our cache are present in the definition below.
Listing 6.19: Most Recently Used Cache
class MruCache {
private LinkedList<Item> list;
private Map<String, Item> table;
private int bound;
MruCache(int bound) {
list = new LinkedList<Item>();
table = new HashMap<String, Item>();
this.bound = bound;
}
...
}
The MRU cache also implements two methods,findandstore. Thefindmethod uses the hash table for quick retrieval of items from the cache. Whenfindretrieves an item, that item is accessed and the priority must be updated. To update the priority without allocating new memory, we move the location of the item’s node within the linked-list to the head.
Listing 6.20: Retrieval in Most Recently Used Cache
Item find(String key) {
Item item = table.get(key);
if (null != item) {
list.remove(item);
list.offerFirst(item);
return item;
}
return null;
}
When local retrieval fails, a method must call the remote store. The item retrieved is then saved to the cache by thestorefunction. It is withinstorethat the eviction policy is implemented. The item stored is prepended to the front of the list, and the hash table is updated with the item’s key value. If the cache is above capacity after the item is added, the least recently used item is removed. This item is found at the tail of the list. When removed from the priority queue, the item must also be removed from the hash table.
Listing 6.21: Access Recording in Most Recently Used Cache
void store(Item item) {
list.push(item);
table.put(item.key, item);
if (list.size() > bound) {
table.remove(list.getLast().key);
list.removeLast();
}
}
Let’s consider the running time of this implementation. The operations of thefindmethod are a hash table lookup and moving a node to the head of the linked-list. These are constant time operations. Further, the operations of thestoremethod are insertion into the hash table, insertion into a linked-list, and a possible removal from the linked-list. These operations also take constant time. From this we see we have an efficient implementation of the MRU cache.