Generating the Nth Cartesian Product by@tyler-burdsall

# Generating the Nth Cartesian Product

### @tyler-burdsallTyler Burdsall

Software Development Engineer at AWS Elemental

Recently at my job we ran into a problem: we needed to generate some of the possible combinations of a set of questions and answers, but we only needed a small subset of these combinations. Plus, we needed this subset to be distributed randomly across the range of possibilities (while maintaining an even spread). This way, we could generate a strong model of our data after running some statistics based on our combinations.

Sure, we could just generate all possible combinations and use a built-in library to randomly select some of the possible combinations. This is a good solution when your range of possible combinations is relatively small (less than 1,000,000; give or take). But what about when you have 197,074,944 possible combinations? Yes, this is how many possible combinations existed for our small set of questions and answers.

If you would like to skip to the actual implementation, feel free to scroll down to the section: Solution With Python

### How NOT to Approach the Problem

One of the first things I tried doing was using Python’s `itertools` library to generate all of the combinations at once and then select a random subset from that list. The nice thing about `itertools` is that it will generate these combinations really fast. The not-so-nice thing is that it will generate all 197,074,944 combinations really fast, which means that in a blink of an eye I watched my RAM consumption go from ~2GB to my machine’s maximum in about 5 seconds. After that, the script kept trying to use paged memory to keep building these combinations and I had to do a hard reset to bring my machine back to life.

Clearly, this approach isn’t going to work for such a large set of combinations.

We needed to find a method that would avoid filling up memory space while maintaining reasonable performance.

### Enter: the Lazy Cartesian Product

After spending hours researching the topic, I finally came across this excellent article: http://phrogz.net/lazy-cartesian-product . The author does an amazing job explaining some of the theory as well as an implementation in JavaScript.

TL;DR: You can find any combination of a Cartesian Product at index n without needing to calculate every single combination.

With this algorithm, you can now virtually eliminate the issue of memory space, maintain roughly O(1) performance when calculating an entry, and only worry about disk space if storing combinations.

Plus, we can now find our subset of entries without needing to calculate all possible combinations.

### Solution with Python

Using the psuedocode from the link above, the resulting Python code is below:

Here is an example showing off how the code functions:

If we run this in a terminal, we’ll get these results:

``\$ python example.py[1, 'foo', 'z', 'three'][2, 'bar', 'y', 'three'][4, 'bar', 'y', 'two']\$``

### End Result

At my work, we wrapped this in a script so that we could generate a random subset of about 1,000,000 records (out of the 197,074,944 possible combinations) and save the combinations to a `.csv` file. When it was all said and done, the script only took about 10 seconds to complete on my machine (most likely due to file I/O).

Ultimately, this algorithm can be implemented in any other language as necessary. For my work, I also wrote the algorithm with C# for a desktop application, and I can imagine something written in C/Rust would yield even higher-performing results.