How Hash Maps Work

A dict in Python. map in Go, Object or Map in JavaScript. Associative arrays in PHP, Dictionary in C++. Hash maps are implemented in virtually every high-level programming language. And they are awesome! Who doesn't want to store and then access data in constant time? Whether you're working with large datasets or grinding Leetcode problems, very often this data structure comes to the rescue. But what are they exactly, and how do they work under the hood? In this article, we will try and answer these questions. dict map Object Map Dictionary What Is a Hash Map? At a high level, a hash map, or hash table is a data structure that implements an associative array abstract data type, a structure that can map keys to values. The key is used to uniquely identify a value in the map, and the value is the data that is being stored. Hash maps are designed to provide fast insertion, deletion, and lookup of key-value pairs. In fact, the average time complexity for these operations is O(1), which means that they can be performed in constant time! This feature makes hash maps probably the most used data structure in programming, however, there are some caveats to this, as we will see later. The worst-case time complexity for these operations is O(n), which can happen in certain scenarios, and the more we know about the internals, the more likely we are to avoid these scenarios. According to the Wikipedia article: Wikipedia article a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. So let's take a step back and look at the components of a hash map. What Is a Hash Function? A hash function is a function that takes an input (or "key") and typically returns an integer that is used to index the data in the hash map. The key is transformed into an integer, which is then used to determine the index in the underlying array where the value is stored. A good hash function has the following properties: Deterministic: The same input will always produce the same output. Uniform Distribution: The hash function should distribute keys uniformly across the hash table to minimize collisions. Fast Computation: The hash function should be quick to compute, even for large inputs. Minimize Collisions: The space of possible keys is typically much larger (often infinite) than the space of hash codes. This means that different keys may produce the same hash code. While these collisions are inevitable, a good hash function minimizes the chances of two different keys producing the same hash code. Deterministic: The same input will always produce the same output. Deterministic Uniform Distribution: The hash function should distribute keys uniformly across the hash table to minimize collisions. Uniform Distribution Fast Computation: The hash function should be quick to compute, even for large inputs. Fast Computation Minimize Collisions: The space of possible keys is typically much larger (often infinite) than the space of hash codes. This means that different keys may produce the same hash code. While these collisions are inevitable, a good hash function minimizes the chances of two different keys producing the same hash code. Minimize Collisions A simple example of a hash function is the modulo operation, which takes a key and returns the remainder when divided by the size of the hash table. For example, if we have a hash table of size 10 and a key of 23, the hash code would be 23 % 10 = 3, meaning that the value associated with the key 23 would be stored at index 3 in the underlying array. And if the key is 33, the hash code would be 33 % 10 = 3 as well, which means that we have a collision. In this case, both keys would map to the same index in the array. 23 % 10 = 3 33 % 10 = 3 What Is a Bucket? A bucket is a slot in the hash table where a key-value pair is stored. In case of a collision, where two different keys produce the same hash code, the bucket can store multiple key-value pairs. This is often done using a linked list or another data structure to handle collisions. This diagram illustrates how all this works: Here we can see how the hash function maps keys to indices in the underlying array. The keys 23 and 33 both produce the same hash code of 3, which means that they are stored in the same bucket. The bucket can then store both key-value pairs, but when we retrieve a value, we need to check the keys in the bucket to find the correct one. This is where the time complexity can increase to O(n) in the worst case, if many (or even all) keys collide and are stored in the same bucket. Load Factor The load factor is a measure of how full the hash table is. It is calculated as the number of elements in the hash table divided by the number of buckets (or slots) in the underlying array. A higher load factor means that there are more elements in the hash table relative to the number of buckets, which can lead to more collisions and slower performance. Collision Resolution When two keys produce the same hash code, we have a collision. There are several strategies to handle collisions in hash maps: Chaining: In this method, each bucket contains a linked list (or another data structure) of all key-value pairs that hash to the same index. When a collision occurs, the new key-value pair is simply added to the list in the appropriate bucket. This is the most common method for handling collisions. Complexity: Average O(1) for all operations, worst-case O(n) if all keys hash to the same bucket Pros: Simple to implement, handles high load factors well, deletion is straightforward Cons: Extra memory overhead for pointers, poor cache performance due to scattered memory access Open Addressing: In this method, when a collision occurs, the hash table searches for the next available slot in the array to store the new key-value pair. There are several techniques for finding the next available slot: Chaining: In this method, each bucket contains a linked list (or another data structure) of all key-value pairs that hash to the same index. When a collision occurs, the new key-value pair is simply added to the list in the appropriate bucket. This is the most common method for handling collisions. Complexity: Average O(1) for all operations, worst-case O(n) if all keys hash to the same bucket Pros: Simple to implement, handles high load factors well, deletion is straightforward Cons: Extra memory overhead for pointers, poor cache performance due to scattered memory access Chaining Complexity: Average O(1) for all operations, worst-case O(n) if all keys hash to the same bucket Pros: Simple to implement, handles high load factors well, deletion is straightforward Cons: Extra memory overhead for pointers, poor cache performance due to scattered memory access Complexity: Average O(1) for all operations, worst-case O(n) if all keys hash to the same bucket Complexity: Average O(1) for all operations, worst-case O(n) if all keys hash to the same bucket Complexity Pros: Simple to implement, handles high load factors well, deletion is straightforward Pros: Simple to implement, handles high load factors well, deletion is straightforward Pros Cons: Extra memory overhead for pointers, poor cache performance due to scattered memory access Cons: Extra memory overhead for pointers, poor cache performance due to scattered memory access Cons Open Addressing: In this method, when a collision occurs, the hash table searches for the next available slot in the array to store the new key-value pair. There are several techniques for finding the next available slot: Open Addressing Linear Probing: If a collision occurs, the algorithm checks the next slot in the array until it finds an empty one. Linear Probing Complexity: Average O(1), worst-case O(n) due to primary clustering Pros: Simple implementation, good cache performance for nearby accesses Cons: Primary clustering (consecutive occupied slots), performance degrades with clustering Complexity: Average O(1), worst-case O(n) due to primary clustering Complexity Pros: Simple implementation, good cache performance for nearby accesses Pros Cons: Primary clustering (consecutive occupied slots), performance degrades with clustering Cons Quadratic Probing: Instead of checking the next slot, it checks slots at increasing distances (1, 4, 9, etc.) from the original index. Quadratic Probing Complexity: Average O(1), better than linear probing due to reduced clustering Pros: Reduces primary clustering compared to linear probing, still cache-friendly Cons: Secondary clustering (keys with same hash follow same probe sequence), may not visit all slots Complexity: Average O(1), better than linear probing due to reduced clustering Complexity Pros: Reduces primary clustering compared to linear probing, still cache-friendly Pros Cons: Secondary clustering (keys with same hash follow same probe sequence), may not visit all slots Cons Double Hashing: Uses a second hash function to determine the step size for probing. Unlike linear probing which always moves to the next slot, or quadratic probing which uses a fixed sequence, double hashing calculates a unique step size for each key. The formula is typically: index = (hash1(key) + i * hash2(key)) % table_size, where i is the probe attempt number. The second hash function must return a value that's relatively prime to the table size to ensure all slots can be visited. For example, if hash1(key) = 7 and hash2(key) = 3, then we'd probe indices 7, 10, 13, 16, etc. Double Hashing index = (hash1(key) + i * hash2(key)) % table_size i hash1(key) = 7 hash2(key) = 3 Complexity: Average O(1), best performance among open addressing methods Pros: Minimizes clustering, uniform distribution of probe sequences, visits all slots when properly implemented Cons: More complex implementation, requires computing two hash functions, and slightly more overhead per operation Rehashing: If the load factor becomes too high, the hash table can be resized and all existing key-value pairs can be rehashed into the new table. This helps to maintain efficient performance as the number of elements grows. Complexity: O(n) for the rehashing operation itself, but amortizes to O(1) per insertion over time Pros: Maintains optimal performance by keeping the load factor low, prevents performance degradation Cons: Temporary performance spike during rehashing, requires additional memory during the resize operation Complexity: Average O(1), best performance among open addressing methods Complexity: Average O(1), best performance among open addressing methods Complexity Pros: Minimizes clustering, uniform distribution of probe sequences, visits all slots when properly implemented Pros: Minimizes clustering, uniform distribution of probe sequences, visits all slots when properly implemented Pros Cons: More complex implementation, requires computing two hash functions, and slightly more overhead per operation Rehashing: If the load factor becomes too high, the hash table can be resized and all existing key-value pairs can be rehashed into the new table. This helps to maintain efficient performance as the number of elements grows. Complexity: O(n) for the rehashing operation itself, but amortizes to O(1) per insertion over time Pros: Maintains optimal performance by keeping the load factor low, prevents performance degradation Cons: Temporary performance spike during rehashing, requires additional memory during the resize operation Cons: More complex implementation, requires computing two hash functions, and slightly more overhead per operation Cons Rehashing: If the load factor becomes too high, the hash table can be resized and all existing key-value pairs can be rehashed into the new table. This helps to maintain efficient performance as the number of elements grows. Rehashing Complexity: O(n) for the rehashing operation itself, but amortizes to O(1) per insertion over time Pros: Maintains optimal performance by keeping the load factor low, prevents performance degradation Cons: Temporary performance spike during rehashing, requires additional memory during the resize operation Complexity: O(n) for the rehashing operation itself, but amortizes to O(1) per insertion over time Complexity Pros: Maintains optimal performance by keeping the load factor low, prevents performance degradation Pros Cons: Temporary performance spike during rehashing, requires additional memory during the resize operation Cons Each of these methods has its own trade-offs in terms of complexity, performance, and memory usage. The choice of collision resolution strategy can have a significant impact on the overall performance of the hash map. Here is a quick summary of the pros and cons of each collision resolution method: Feature Chaining Linear Probing Quadratic Probing Double Hashing Average Time Complexity O(1) O(1) O(1) O(1) Worst-case Time Complexity O(n) O(n) O(n) O(n) Memory Overhead High (pointers) Low Low Low Cache Performance Poor Good Good Moderate Implementation Complexity Simple Simple Moderate Complex Clustering Issues None Primary clustering Secondary clustering Minimal Load Factor Tolerance High (>1.0) Low (<0.7) Low-Medium (<0.7) Medium (<0.8) Deletion Complexity Simple Complex (tombstones) Complex (tombstones) Complex (tombstones) Space Efficiency Lower Higher Higher Higher Performance Degradation Gradual Rapid at high load Moderate at high load Slow at high load Hash Function Requirements One One One Two Best Use Cases Unknown load factors, frequent deletions Cache-friendly applications, low load Better than linear, moderate load High performance, predictable load Feature Chaining Linear Probing Quadratic Probing Double Hashing Average Time Complexity O(1) O(1) O(1) O(1) Worst-case Time Complexity O(n) O(n) O(n) O(n) Memory Overhead High (pointers) Low Low Low Cache Performance Poor Good Good Moderate Implementation Complexity Simple Simple Moderate Complex Clustering Issues None Primary clustering Secondary clustering Minimal Load Factor Tolerance High (>1.0) Low (<0.7) Low-Medium (<0.7) Medium (<0.8) Deletion Complexity Simple Complex (tombstones) Complex (tombstones) Complex (tombstones) Space Efficiency Lower Higher Higher Higher Performance Degradation Gradual Rapid at high load Moderate at high load Slow at high load Hash Function Requirements One One One Two Best Use Cases Unknown load factors, frequent deletions Cache-friendly applications, low load Better than linear, moderate load High performance, predictable load Feature Chaining Linear Probing Quadratic Probing Double Hashing Feature Feature Feature Chaining Chaining Chaining Linear Probing Linear Probing Linear Probing Quadratic Probing Quadratic Probing Quadratic Probing Double Hashing Double Hashing Double Hashing Average Time Complexity O(1) O(1) O(1) O(1) Average Time Complexity Average Time Complexity Average Time Complexity O(1) O(1) O(1) O(1) O(1) O(1) O(1) O(1) Worst-case Time Complexity O(n) O(n) O(n) O(n) Worst-case Time Complexity Worst-case Time Complexity Worst-case Time Complexity O(n) O(n) O(n) O(n) O(n) O(n) O(n) O(n) Memory Overhead High (pointers) Low Low Low Memory Overhead Memory Overhead Memory Overhead High (pointers) High (pointers) Low Low Low Low Low Low Cache Performance Poor Good Good Moderate Cache Performance Cache Performance Cache Performance Poor Poor Good Good Good Good Moderate Moderate Implementation Complexity Simple Simple Moderate Complex Implementation Complexity Implementation Complexity Implementation Complexity Simple Simple Simple Simple Moderate Moderate Complex Complex Clustering Issues None Primary clustering Secondary clustering Minimal Clustering Issues Clustering Issues Clustering Issues None None Primary clustering Primary clustering Secondary clustering Secondary clustering Minimal Minimal Load Factor Tolerance High (>1.0) Low ( 1.0) High (>1.0) Low (<0.7) Low (<0.7) Low-Medium (<0.7) Low-Medium (<0.7) Medium (<0.8) Medium (<0.8) Deletion Complexity Simple Complex (tombstones) Complex (tombstones) Complex (tombstones) Deletion Complexity Deletion Complexity Deletion Complexity Simple Simple Complex (tombstones) Complex (tombstones) Complex (tombstones) Complex (tombstones) Complex (tombstones) Complex (tombstones) Space Efficiency Lower Higher Higher Higher Space Efficiency Space Efficiency Space Efficiency Lower Lower Higher Higher Higher Higher Higher Higher Performance Degradation Gradual Rapid at high load Moderate at high load Slow at high load Performance Degradation Performance Degradation Performance Degradation Gradual Gradual Rapid at high load Rapid at high load Moderate at high load Moderate at high load Slow at high load Slow at high load Hash Function Requirements One One One Two Hash Function Requirements Hash Function Requirements Hash Function Requirements One One One One One One Two Two Best Use Cases Unknown load factors, frequent deletions Cache-friendly applications, low load Better than linear, moderate load High performance, predictable load Best Use Cases Best Use Cases Best Use Cases Unknown load factors, frequent deletions Unknown load factors, frequent deletions Cache-friendly applications, low load Cache-friendly applications, low load Better than linear, moderate load Better than linear, moderate load High performance, predictable load High performance, predictable load Some Real-World Examples Programming Language Implementations Many programming languages have built-in hash maps. These implementations often use a combination of the techniques described above to provide efficient performance and handle collisions effectively. Python's dict uses open addressing with randomized probing, rehashing when the load factor exceeds about 0.66. Java's HashMap uses chaining with linked lists (converting to balanced trees for large buckets in Java 8+), rehashes at 0.75 load factor. C++'s unordered_map typically uses chaining, but implementations may vary. Python's dict uses open addressing with randomized probing, rehashing when the load factor exceeds about 0.66. Python dict Java's HashMap uses chaining with linked lists (converting to balanced trees for large buckets in Java 8+), rehashes at 0.75 load factor. Java HashMap C++'s unordered_map typically uses chaining, but implementations may vary. C++ unordered_map Database Systems Hash maps are also widely used in database indexing. Many database systems use hash indexes to speed up data retrieval. These indexes allow for fast lookups by hashing the indexed columns and storing the resulting key-value pairs in a hash table. When a query is executed, the database can quickly find the relevant rows by computing the hash of the query key and looking it up in the hash index. Some popular database systems that use hash indexing include: PostgreSQL: Supports hash indexes, but they are not as commonly used as B-tree indexes. MongoDB: Uses hash indexes for sharding and to support equality queries on hashed fields. Redis: Implements hash maps as a core data structure, allowing for efficient storage and retrieval of key-value pairs. PostgreSQL: Supports hash indexes, but they are not as commonly used as B-tree indexes. PostgreSQL MongoDB: Uses hash indexes for sharding and to support equality queries on hashed fields. MongoDB Redis: Implements hash maps as a core data structure, allowing for efficient storage and retrieval of key-value pairs. Redis These implementations often leverage the same underlying principles of hashing and collision resolution discussed earlier, but they may also incorporate additional optimizations specific to the database context. Version Control Version control systems like Git use hash maps to efficiently manage file changes and track versions. Each commit in Git is identified by a SHA-1 hash of its contents, which serves as a unique key for the commit object. This allows Git to quickly look up commits, branches, and other objects in the repository. Git doesn't use traditional hash table collision resolution, it's designed around the assumption that cryptographic hash collisions won't occur in practice. Putting It All Together: How Implementation Knowledge Matters And it's not just about the theory! Understanding how hash maps are implemented in your programming language of choice can lead to significant performance improvements in your code. For example, since Python's dict uses open addressing with optimized string handling, understanding this can lead to much better performance. Here's how to write efficient vs inefficient code: dict Bad Implementation (Fights Against Python's Dict) def count_words_bad(text): word_counts = {} words = text.split() for word in words: # This is inefficient with open addressing! if word in word_counts: # First lookup word_counts[word] += 1 # Second lookup + assignment else: word_counts[word] = 1 # Third lookup + assignment return word_counts def count_words_bad(text): word_counts = {} words = text.split() for word in words: # This is inefficient with open addressing! if word in word_counts: # First lookup word_counts[word] += 1 # Second lookup + assignment else: word_counts[word] = 1 # Third lookup + assignment return word_counts Enter fullscreen mode Exit fullscreen mode Problems: Problems: Multiple hash lookups per word (up to 3!) Open addressing makes key-missing checks expensive Doesn't leverage Python's dict optimizations Multiple hash lookups per word (up to 3!) Open addressing makes key-missing checks expensive Doesn't leverage Python's dict optimizations Good Implementation (Works With Python's Dict) from collections import defaultdict, Counter def count_words_good_v1(text): # defaultdict eliminates key existence checks word_counts = defaultdict(int) words = text.split() for word in words: word_counts[word] += 1 # Single operation! return dict(word_counts) def count_words_good_v2(text): # Counter is optimized specifically for Python's dict implementation words = text.split() return Counter(words) def count_words_good_v3(text): # dict.get() with default avoids the membership test word_counts = {} words = text.split() for word in words: word_counts[word] = word_counts.get(word, 0) + 1 # Single lookup return word_counts from collections import defaultdict, Counter def count_words_good_v1(text): # defaultdict eliminates key existence checks word_counts = defaultdict(int) words = text.split() for word in words: word_counts[word] += 1 # Single operation! return dict(word_counts) def count_words_good_v2(text): # Counter is optimized specifically for Python's dict implementation words = text.split() return Counter(words) def count_words_good_v3(text): # dict.get() with default avoids the membership test word_counts = {} words = text.split() for word in words: word_counts[word] = word_counts.get(word, 0) + 1 # Single lookup return word_counts Enter fullscreen mode Exit fullscreen mode Why These Are Better: Why These Are Better: Single hash operation per word instead of multiple Leverages Python's string optimization - string keys are handled very efficiently Works with open addressing - fewer probing operations needed Uses built-in optimizations like Counter which is tuned for Python's implementation Single hash operation per word instead of multiple Single hash operation Leverages Python's string optimization - string keys are handled very efficiently Leverages Python's string optimization Works with open addressing - fewer probing operations needed Works with open addressing Uses built-in optimizations like Counter which is tuned for Python's implementation Uses built-in optimizations Counter Performance Difference Typical Results: The good implementation is often 2-3x faster, simply by understanding and working with Python's dict implementation rather than against it! Typical Results: Conclusion Hash maps are among the most fundamental and powerful data structures in computer science, providing near-constant time access to data that makes them indispensable in modern programming. Throughout this deep dive, we've explored how they achieve their remarkable O(1) average performance through clever use of hash functions, strategic collision resolution, and careful load factor management. The key insight is that the "magic" of hash maps isn't really magic at all — it's the result of well-designed algorithms and data structures working together. Understanding these internals helps us avoid the O(n) worst-case scenarios and write more efficient code. Key Takeaways: Key Takeaways: Hash functions are the foundation—they determine how evenly data is distributed and directly impact collision rates Collision resolution strategies each have distinct trade-offs: chaining for simplicity and robustness, open addressing for memory efficiency and cache performance Load factor management through rehashing prevents performance degradation as hash maps grow Implementation knowledge translates to real performance gains—understanding whether your language uses chaining or open addressing can make your code 2-3x faster Hash functions are the foundation—they determine how evenly data is distributed and directly impact collision rates Hash functions Collision resolution strategies each have distinct trade-offs: chaining for simplicity and robustness, open addressing for memory efficiency and cache performance Collision resolution strategies Load factor management through rehashing prevents performance degradation as hash maps grow Load factor management Implementation knowledge translates to real performance gains—understanding whether your language uses chaining or open addressing can make your code 2-3x faster Implementation knowledge Whether you're optimizing a Python script, debugging performance issues in Java, or making architectural decisions for a database system, this understanding of HashMap internals gives you the tools to make informed choices. The next time you use a dict, HashMap, or unordered_mapYou'll know exactly what's happening under the hood and how to make the most of these incredible data structures. dict HashMap unordered_map Hash maps truly are awesome—and now you know why!