136 reads

Learn How to Pick the Correct Dictionary in C++ With Ordered Containers

by Denis TDecember 5th, 2024

Too Long; Didn't Read

When it comes to picking an associative container (like a hash map), modern C++ has a lot to offer. Sometimes, it is hard to select the most efficient dictionary data structure even for a professional engineer. In this first article, I review ordered containers the C++23 standard library and the most prominent third-party libraries provide, and how to pick the most suitable ordered associative container.

featured image - Learn How to Pick the Correct Dictionary in C++ With Ordered Containers

When it comes to picking an associative container (like a hash map), modern C++ has a lot to offer. Sometimes, it is hard to select the most efficient dictionary data structure even for a professional engineer. Let's see what the C++23 standard library and the most prominent third-party libraries (such as boost, folly, abseil) provide, and how to pick the most suitable associative container. At the end of this series, I'll share a diagram to help you choose the best container for various purposes.

Associative container types and characteristics

There are two major types of associative containers in C++: maps and sets. Maps allow you to store some value along with the key, and search for the value by its key. Sets only store keys but not values, and are useful when it is necessary to check if the key is present, but not retrieve the associated value. In general, all maps and sets allow adding a key to the container, erasing it, checking if some key is present, and changing the associated value for the existing key (maps only). Once the key is added to the map or set, it becomes constant and can no longer be changed.

Another way to categorize associative containers is key ordering. There are ordered and unordered containers. In the first category, keys are sorted according to some comparison criteria and maintain their order at all times. In unordered containers, there is no defined order to the keys, and it can change at any time.

Looking at the internal structure, there are node-based containers, and ones not based on separate key/key-value nodes (often called flat). Node-based ones provide stricter reference and iterator stability guarantees. Such containers often allow manipulating separate nodes (for example, moving a node from one container to another). However, node-based containers are usually less efficient than their flat counterparts.

Finally, containers can provide different thread safety guarantees. Most containers only allow concurrent reads when no modification happens, but some special containers provide wider capabilities.

Let's look at the above container categories in more detail. In this first part, I will only cover ordered containers.

Ordered containers

Some containers are ordered by keys according to the comparison criteria you specify. By default, it's std::less (link), which means, keys are compared with the operator <, and are ordered from smaller to greater. This enables several extra operations, which are not available in unordered containers.

For example, you can search for the key which is not less than the one you provide. The following code will find the exact key if it's present in the map, or the key which would be positioned next to it in the key order.

#include <iostream>
#include <map>

int main() {
    std::map<int, std::string> map{
        {1, "string 1"},
        {5, "string 5"},
        {0, "string 0"},
        {3, "string 3"}
    };

    if (auto it = map.lower_bound(2); it != map.end()) {
        std::cout << it->first << " -> " << it->second << "\n";
    }
}

This program will print 3 -> string 3, because the key 2 is absent in the map, and the next existing key in order after 2 is 3.

When you iterate over an ordered container, you will visit keys in order, too:

#include <iostream>
#include <set>

int main() {
    std::set<int> set{0, 1, 4, -10, 5};
    set.insert(3);

    for (int value : set) {
        std::cout << value << ' ';
    }
}

This program prints: -10 0 1 3 4 5 (ordered from the smallest to the largest key).

Node-based containers

There are several ordered associative containers in the C++ STL: std::map (link), std::multimap (link), and their value-less variants std::set (link) and std::multiset (link). Multi containers allow to store more than one key with equivalent values. These containers are almost always implemented using some form of a binary search tree, such as a red-black or an AVL tree. They are node-based, which means, they allocate a separate node for each new key (or a new key-value pair), which is somewhat inefficient and is not CPU cache-friendly. On the other hand, this makes map and set iterators and references stable. Even if some element is added or removed, iterators and references to the existing elements do not change. This makes the following code valid:

#include <iostream>
#include <map>

int main() {
    std::map<std::string, int> map{
        {"test", 123}
    };

    auto& value = map["test"];
    auto valueIt = map.find("test");

    // Map is modified:
    map.emplace("test2", 456);

    // References and iterators to the existing elements
    // remain valid. This line prints "123, 123":
    std::cout << value << ", " << valueIt->second << "\n";
}

Insertion, deletion, key search, and upper/lower bound operations time complexity is always O(logN).

Here is how the map from the first example may look like in memory:

Each rectangle is a separate node, which contains both the key and the value. Nodes are connected using left and right pointers, and the tree always maintains the properties of the binary search tree. The library keeps the tree balanced, so that all branches are approximately the same depth, which guarantees the O(logN) time complexity for all search/insertion/deletion operations.

Flat ordered containers

A "flat" (not based on nodes) alternatives to node-based containers are C++23 std::flat_map (link), std::flat_set (link), std::flat_multimap (link) and std::flat_multiset (link). If C++23 is not available in your compiler yet, you can use boost flat containers, which provide compatible interfaces. Flat ordered containers are useful in cases when there are not too many elements, or when insertions and deletions are rare. These containers store everything in a contiguous block of memory (or a few contiguous blocks). This makes insertion and deletion time complexity O(N), which is much worse than in binary search tree node-based containers. Key search time complexity is still O(logN). These containers do not provide reference and iterator stability, but use less memory, do fewer allocations, and are more CPU cache-friendly. There are third-party flat ordered containers such as folly::sorted_vector_map and folly::sorted_vector_set (link). They do not allow duplicate keys (just like std::flat_map and std::flat_set).

Here is how std::flat_map may look like in memory:

There are two contiguous blocks of memory, one containing keys in sorted order, and another containing the corresponding values.

Order statistic trees

None of the above containers provide all possible operations you might need. For example, it is impossible to get the number of keys that precede the given key (this operation is called rank). It is also not possible to get the i-th key (or the corresponding value) in order (this is the select operation). However, both operations can be supported by a binary search tree with O(logN) time complexity. A tree that allows such operations is called an order statistic tree.

The only container I am aware of, which allows such operations is boost::multi_index_container. Boost Multi Index library provides much more than that, but let's focus on rank and select for now. Here is an example showing one way to create and use an order statistic tree:

#include <iostream>
#include <string>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/member.hpp>
#include <boost/multi_index/ranked_index.hpp>

struct KeyValue {
    std::string key;
    int value{};
};

using OrderStatisticTree = boost::multi_index_container<
    KeyValue,
    boost::multi_index::indexed_by<
        boost::multi_index::ranked_unique<
                boost::multi_index::member<
                    KeyValue, std::string, &KeyValue::key>
            >
        >
    >;

int main() {
    OrderStatisticTree tree;
    tree.insert(KeyValue{
        .key = "abc",
        .value = 5
    });
    tree.insert(KeyValue{
        .key = "zkey",
        .value = 10
    });
    tree.insert(KeyValue{
        .key = "def",
        .value = 20
    });
    tree.insert(KeyValue{
        .key = "aaaa",
        .value = 50
    });

    auto defIt = tree.find("def");
    std::cout << "def rank: " << tree.rank(defIt)  << "\n";

    auto it2 = tree.nth(2);
    std::cout << "Key with index 2: " << it2->key  << "\n";
}

First, we define the KeyValue structure, which has both the key and the value for our container.
Then, we define the OrderStatisticTree type, and we ask the Multi Index library to index the tree by KeyValue::key using the ranked_unique index type.
Then, we create the tree, and add a few elements to it by calling insert.
The rank call returns the def key rank. The next line will output def rank: 2, because there are two keys that are positioned before the def in order (aaaa, abc, def).
The nth call is the "select" operation, and it returns an iterator to the key with index 2 in order (which will be def again: the next line will output Key with index 2: def).

The ranked_unique index type corresponds to std::map. There is also the ranked_non_unique index type, which corresponds to std::multimap. Of course, classic ordered container operations are also available: you can call insert, emplace, find, upper_bound and lower_bound, erase, contains, etc. Time complexity of all the above operations is always O(logN). Multi Index guarantees iterator and reference stability, just as C++ standard maps and sets. Containers with ranked indexes will use more memory than the plain std::map, because they need to store additional data related to key ranks.

There are more rank/select methods provided by the ranked index types, which may be useful in your use case.

Boost Multi Index container memory layout is more complex than the classic node-based or flat containers, as it allows multiple indexes into the same set of values:

Which one do I use?

If you need the key order or upper/lower bound operations, pick one of the above containers.
If you need to associate values with the keys, pick map containers, otherwise, choose set ones.
If the number of elements in the container is very small, or if container modifications are rare, take one of the flat options. If you never modify a container after inserting some initial elements, use std::vector (link) or std::array (link), sort them once, and use algorithms like std::binary_search (link) or std::equal_range (link) to find the key. Otherwise, go with the std::map / std::set ones.
If you need to store duplicate keys, pick "multi" containers (they also have an additional equal_range operation which returns iterators to the range of keys equivalent to the one you are looking for).
If you need statistics operations like rank and select, there is no other choice but boost::multi_index_container.

Remember that all of the above containers are only thread-safe when no modification is happening. If in your case one thread may modify the container while other threads access it, you will need to implement manual synchronization (for example, using mutexes). I will cover the thread safety aspect in more detail in the next article.