Array And Set Operations: What You Need to Know

I recently took on the task of re-writing several open-source JavaScript repositories I maintain to take advantage of recent optimizations in the JavaScript v8 engine.

Some of these optimizations involved using built-in classes like Map and Set instead of arrays. When I first implemented the libraries, either Map and Set were not available or they provided no real performance benefit over Array.

Other optimizations have included changing the order of array iteration or loop types. At a micro-optimization level, sometimes looping backward is faster than forwards and while can be faster than for.

When working with set operations on arrays, I uncovered a fundamental paradigm shift that provides performance boosts of 1x, 2x, or even orders of magnitude.

Paradigm Shift

For many set operations, the ultimate members of a set can be known prior to the complete set being assembled. For instance, when asking for a union, every item in every collection (Sets or Arrays) will be in the final result at least and at most once.

So, why not return all the items in the first collection in rapid succession and then the items in the subsequent collections one after the other so long as they have not already been returned?

This type of code can typically be implemented using a generator function, but surprisingly, none of the other open-source libraries to which I was comparing the performance of my code, with the exception of the cartesian product, were using them effectively.

Many libraries were simply focusing on shorter code by using a recursive map and reducing functions with the spread operator. Other libraries were using generators in such a naïve way that there was little or no performance benefit.

The failure to rapidly return an initial value from a set operation while running a data analytics pipeline can slow or eliminate the ability of the pipeline to produce any final results because of direct timeouts or indirect timeouts caused by the inability to use parallelism downstream of the set operation.

Alternatively, the use of recursive functions or the failure to use on-demand generation can result in crashes due to memory consumption as intermediate values are computed.

Well-written generators will solve some of the memory and performance issues, however, sharing their algorithms with non-generator code is not straightforward.

This causes code bloat when non-generative approaches are warranted by the calling program because two sets of code must be maintained. It also means more unit testing to get appropriate coverage and more chance for error.

However, there is another alternative, customizable iteration protocols for JavaScript and Python that implement a generative approach without generic generators.

Using a common pattern, these can be written to share logic between generative and non-generative approaches.

Writing A Shared Custom Iterable

Below is a union function in JavaScript:

function union(...args) {
        const nOthers = args.length - 1,
            memory = new Set();
        let j = 0;
        for(let i=0;i<=nOthers;i++) {
            const array = args[i],
              len = array.length;
            while(j<len) {
                const elem = array[j++];
                if(!memory.has(elem)) {
                    memory.add(elem);
                }
            }
            j=0;
        }
        return [...memory];
    }

To prepare for creating an Iterable, we need to capture in a closure any variables that might change each time we ask for the next item and conditionalize for when we are iterating. So, we create a function unionizer that returns a function union.

const unionizor = (iterating) => {
    let i, j, nOthers, args, memory;
    return function union() {
        if(!args) { // initilaize variables the first time union is called
            args = [...arguments];
            nOthers = args.length - 1;
            memory = new Set();
            i = 0;
            j = 0;
        }
        for(;i<=nOthers;i++) {
            const array = args[i],
              len = array.length;
            while(j<len) {
                const elem = array[j++];
                if(!memory.has(elem)) {
                    memory.add(elem);
                    if(iterating) { // conditionaly yield result if iterating
                        return {value:elem}
                    }
                }
            }
            j=0;
        }
        args = null; // set args back to null so iterating can restart
        return iterating ? {done:true} : [...memory];
    }
}
const union = unionizer();

When called without an argument, unionizer will return a function that behaves just like our first implementation. It has only two additional boolean tests for iterating, so it will be almost indistinguishable from a performance perspective and only adds a few bytes in size.

The function will be EXTREMELY fast for the first array in the argumentsbecause nothing will be in the memory variable. Since a Setis used for memory, it will be VERY fast for subsequent arrays, although it will slow down slightly when the intersection, i.e., memory gets really large.

Next, the actual Iterable must be implemented:

function iterableUnion(...args) {
    const union = unionizer(true);
    let started;
    return {
        next() {
            return union(...args);
        },
        [Symbol.iterator]() {
            return this;
        }
    }
}

By calling unionizer(true) from within iterableUnion, a function that returns value objects or {done:true}rather than just collecting all the results is created.

So, now there are two functions, union and iterableUnion , based on the same fundamental code. Less to maintain and almost identical in performance to the original code or a generator respectively!

For those interested, here is a straight generator implementation of the union:

function* union(...args) {
    const nOthers = args.length - 1,
        memory = new Set();
    let j = 0;
    for(let i=0;i<nOthers;i++) {
        let array = args[i];
        if(!Array.isArray(array)) args[i] = array = [...array];
        const len = array.length;
        while(j<len) {
            const elem = array[j++];
            if(!memory.has(elem)) {
                memory.add(elem);
                yield elem;
            }
        }
        j=0;
    }
}

The same principles can be applied to difference, intersection, symmetricIntersection, and CartesianProduct.

Although, there are even more opportunities for optimizing a CartesianProduct beyond the scope of this article, see http://phrogz.net/lazy-cartesian-product.

In order to apply these principles in other cases, we can keep the code base small by implementing a function that creates an Iterable:

function createIterable(setFunction) {
   return function(...args) {
    const f= setFunction(true);
    let started;
    return {
        next() {
            return f(...args);
        },
        [Symbol.iterator]() {
            return this;
        }
    }
  }
}

So now the definition of iterableUnion would look like this:

const iterableUnion = createIterable(unionizor);

Using multiple arrays in excess of 10,000 items each as test data, these are the performance benefits vs naïve approaches I am seeing for access to the first item in a result using JavaScript:

difference, 1.5 to 2x
intersection, 1.5 to 2x
symmetric difference, 2.5 to 3x
union 2.5 to 3 orders of magnitude
Cartesian Product, 3+ orders of magnitude

The performance of generators andIterablesare typically within each other’s margin of error, but Iterables give us less code to maintain.

I hope you found this article useful for optimizing your data pipeline. The library array-set-ops uses the principles outlined above plus more nuanced optimizations and provides a consistent API for set manipulations onArray and Set, while also adding all the standard functions like map, reduce, slice , etc. toSet andCartesianProduct.

Image Credit: Dall-E High Speed Programmer

Also published here