I recently took on the task of re-writing several open-source JavaScript repositories I maintain to take advantage of recent optimizations in the JavaScript v8 engine. Some of these optimizations involved using built-in classes like and instead of arrays. When I first implemented the libraries, either and were not available or they provided no real performance benefit over . Map Set Map Set Array Other optimizations have included changing the order of array iteration or loop types. At a micro-optimization level, sometimes looping backward is faster than forwards and can be faster than . while for When working with set operations on arrays, I uncovered a fundamental paradigm shift that provides performance boosts of 1x, 2x, or even orders of magnitude. Paradigm Shift For many set operations, the ultimate members of a set can be known prior to the complete set being assembled. For instance, when asking for a union, every item in every collection (Sets or Arrays) will be in the final result at least and at most once. So, why not return all the items in the first collection in rapid succession and then the items in the subsequent collections one after the other so long as they have not already been returned? This type of code can typically be implemented using a generator function, but surprisingly, none of the other open-source libraries to which I was comparing the performance of my code, with the exception of the cartesian product, were using them effectively. Many libraries were simply focusing on shorter code by using a recursive map and reducing functions with the spread operator. Other libraries were using generators in such a naïve way that there was little or no performance benefit. The failure to rapidly return an initial value from a set operation while running a data analytics pipeline can slow or eliminate the ability of the pipeline to produce any final results because of direct timeouts or indirect timeouts caused by the inability to use parallelism downstream of the set operation. Alternatively, the use of recursive functions or the failure to use on-demand generation can result in crashes due to memory consumption as intermediate values are computed. Well-written generators will solve some of the memory and performance issues, however, sharing their algorithms with non-generator code is not straightforward. This causes code bloat when non-generative approaches are warranted by the calling program because two sets of code must be maintained. It also means more unit testing to get appropriate coverage and more chance for error. However, there is another alternative, customizable iteration protocols for and that implement a generative approach without generic generators. JavaScript Python Using a common pattern, these can be written to share logic between generative and non-generative approaches. Writing A Shared Custom Iterable Below is a union function in JavaScript: function union(...args) { const nOthers = args.length - 1, memory = new Set(); let j = 0; for(let i=0;i<=nOthers;i++) { const array = args[i], len = array.length; while(j<len) { const elem = array[j++]; if(!memory.has(elem)) { memory.add(elem); } } j=0; } return [...memory]; } To prepare for creating an , we need to capture in a closure any variables that might change each time we ask for the next item and conditionalize for when we are iterating. So, we create a function that returns a function . Iterable unionizer union const unionizor = (iterating) => { let i, j, nOthers, args, memory; return function union() { if(!args) { // initilaize variables the first time union is called args = [...arguments]; nOthers = args.length - 1; memory = new Set(); i = 0; j = 0; } for(;i<=nOthers;i++) { const array = args[i], len = array.length; while(j<len) { const elem = array[j++]; if(!memory.has(elem)) { memory.add(elem); if(iterating) { // conditionaly yield result if iterating return {value:elem} } } } j=0; } args = null; // set args back to null so iterating can restart return iterating ? {done:true} : [...memory]; } } const union = unionizer(); When called without an argument, will return a function that behaves just like our first implementation. It has only two additional boolean tests for , so it will be almost indistinguishable from a performance perspective and only adds a few bytes in size. unionizer iterating The function will be EXTREMELY fast for the first array in the because nothing will be in the variable. Since a is used for memory, it will be VERY fast for subsequent arrays, although it will slow down slightly when the intersection, i.e., gets really large. arguments memory Set memory Next, the actual must be implemented: Iterable function iterableUnion(...args) { const union = unionizer(true); let started; return { next() { return union(...args); }, [Symbol.iterator]() { return this; } } } By calling from within , a function that returns value objects or rather than just collecting all the results is created. unionizer(true) iterableUnion {done:true} So, now there are two functions, and , based on the same fundamental code. Less to maintain and almost identical in performance to the original code or a generator respectively! union iterableUnion For those interested, here is a straight generator implementation of the union: function* union(...args) { const nOthers = args.length - 1, memory = new Set(); let j = 0; for(let i=0;i<nOthers;i++) { let array = args[i]; if(!Array.isArray(array)) args[i] = array = [...array]; const len = array.length; while(j<len) { const elem = array[j++]; if(!memory.has(elem)) { memory.add(elem); yield elem; } } j=0; } } The same principles can be applied to , , , and . difference intersection symmetricIntersection CartesianProduct Although, there are even more opportunities for optimizing a beyond the scope of this article, see . CartesianProduct http://phrogz.net/lazy-cartesian-product In order to apply these principles in other cases, we can keep the code base small by implementing a function that creates an : Iterable function createIterable(setFunction) { return function(...args) { const f= setFunction(true); let started; return { next() { return f(...args); }, [Symbol.iterator]() { return this; } } } } So now the definition of would look like this: iterableUnion const iterableUnion = createIterable(unionizor); Using multiple arrays in excess of 10,000 items each as test data, these are the performance benefits vs naïve approaches I am seeing for access to the first item in a result using JavaScript: difference, 1.5 to 2x intersection, 1.5 to 2x symmetric difference, 2.5 to 3x union 2.5 to 3 orders of magnitude Cartesian Product, 3+ orders of magnitude The performance of generators and are typically within each other’s margin of error, but give us less code to maintain. Iterables Iterables I hope you found this article useful for optimizing your data pipeline. The library uses the principles outlined above plus more nuanced optimizations and provides a consistent API for set manipulations on and while also adding all the standard functions like , , , etc. to and . array-set-ops Array Set, map reduce slice Set CartesianProduct Image Credit: Dall-E High Speed Programmer Also published here