Software Development Engineer at AWS Elemental
Note: this article has been updated as of 2018-Nov-12. See the section titled “Updates” at the end of this article for more.
Recently at my job we ran into a problem: we needed to generate some of the possible combinations of a set of questions and answers, but we only needed a small subset of these combinations. Plus, we needed this subset to be distributed randomly across the range of possibilities (while maintaining an even spread). This way, we could generate a strong model of our data after running some statistics based on our combinations.
Sure, we could just generate all possible combinations and use a built-in library to randomly select some of the possible combinations. This is a good solution when your range of possible combinations is relatively small (less than 1,000,000; give or take). But what about when you have 197,074,944 possible combinations? Yes, this is how many possible combinations existed for our small set of questions and answers.
If you would like to skip to the actual implementation, feel free to scroll down to the section: Solution With Python
One of the first things I tried doing was using Python’s
itertools library to generate all of the combinations at once and then select a random subset from that list. The nice thing about
itertools is that it will generate these combinations really fast. The not-so-nice thing is that it will generate all 197,074,944 combinations really fast, which means that in a blink of an eye I watched my RAM consumption go from ~2GB to my machine’s maximum in about 5 seconds. After that, the script kept trying to use paged memory to keep building these combinations and I had to do a hard reset to bring my machine back to life.
Clearly, this approach isn’t going to work for such a large set of combinations.
We needed to find a method that would avoid filling up memory space while maintaining reasonable performance.
TL;DR: You can find any combination of a Cartesian Product at index n without needing to calculate every single combination.
With this algorithm, you can now virtually eliminate the issue of memory space, maintain roughly O(1) performance when calculating an entry, and only worry about disk space if storing combinations.
Plus, we can now find our subset of entries without needing to calculate all possible combinations.
Using the psuedocode from the link above, the resulting Python code is below:
Here is an example showing off how the code functions:
If we run this in a terminal, we’ll get these results:
$ python example.py
[1, 'foo', 'z', 'three']
[2, 'bar', 'y', 'three']
[4, 'bar', 'y', 'two']
At my work, we wrapped this in a script so that we could generate a random subset of about 1,000,000 records (out of the 197,074,944 possible combinations) and save the combinations to a
.csv file. When it was all said and done, the script only took about 10 seconds to complete on my machine (most likely due to file I/O).
Ultimately, this algorithm can be implemented in any other language as necessary. For my work, I also wrote the algorithm with C# for a desktop application, and I can imagine something written in C/Rust would yield even higher-performing results.
Upon further inspection, when the amount of generated combinations starts growing the current code above begins to fail. The short version is: the way Python performs floating-point division causes the algorithm to fail when trying to calculate the index of the next element to select. Thanks to woudsma a solution has been found using the package bigfloat:
If you find that your results aren’t correct, I highly recommend using this excellent fork of the code.
For further discussion see the responses below which detail these discoveries in full.
Create your free account to unlock your custom reading experience.