I recently took on the task of re-writing several open-source JavaScript repositories I maintain to take advantage of recent optimizations in the JavaScript v8 engine.
Some of these optimizations involved using built-in classes like Map
and Set
instead of arrays. When I first implemented the libraries, either Map
and Set
were not available or they provided no real performance benefit over Array
.
Other optimizations have included changing the order of array iteration or loop types. At a micro-optimization level, sometimes looping backward is faster than forwards and while
can be faster than for
.
When working with set operations on arrays, I uncovered a fundamental paradigm shift that provides performance boosts of 1x, 2x, or even orders of magnitude.
For many set operations, the ultimate members of a set can be known prior to the complete set being assembled. For instance, when asking for a union, every item in every collection (Sets or Arrays) will be in the final result at least and at most once.
So, why not return all the items in the first collection in rapid succession and then the items in the subsequent collections one after the other so long as they have not already been returned?
This type of code can typically be implemented using a generator function, but surprisingly, none of the other open-source libraries to which I was comparing the performance of my code, with the exception of the cartesian product, were using them effectively.
Many libraries were simply focusing on shorter code by using a recursive map and reducing functions with the spread operator. Other libraries were using generators in such a naïve way that there was little or no performance benefit.
The failure to rapidly return an initial value from a set operation while running a data analytics pipeline can slow or eliminate the ability of the pipeline to produce any final results because of direct timeouts or indirect timeouts caused by the inability to use parallelism downstream of the set operation.
Alternatively, the use of recursive functions or the failure to use on-demand generation can result in crashes due to memory consumption as intermediate values are computed.
Well-written generators will solve some of the memory and performance issues, however, sharing their algorithms with non-generator code is not straightforward.
This causes code bloat when non-generative approaches are warranted by the calling program because two sets of code must be maintained. It also means more unit testing to get appropriate coverage and more chance for error.
However, there is another alternative, customizable iteration protocols for
Using a common pattern, these can be written to share logic between generative and non-generative approaches.
Below is a union function in JavaScript:
function union(...args) {
const nOthers = args.length - 1,
memory = new Set();
let j = 0;
for(let i=0;i<=nOthers;i++) {
const array = args[i],
len = array.length;
while(j<len) {
const elem = array[j++];
if(!memory.has(elem)) {
memory.add(elem);
}
}
j=0;
}
return [...memory];
}
To prepare for creating an Iterable
, we need to capture in a closure any variables that might change each time we ask for the next item and conditionalize for when we are iterating. So, we create a function unionizer
that returns a function union
.
const unionizor = (iterating) => {
let i, j, nOthers, args, memory;
return function union() {
if(!args) { // initilaize variables the first time union is called
args = [...arguments];
nOthers = args.length - 1;
memory = new Set();
i = 0;
j = 0;
}
for(;i<=nOthers;i++) {
const array = args[i],
len = array.length;
while(j<len) {
const elem = array[j++];
if(!memory.has(elem)) {
memory.add(elem);
if(iterating) { // conditionaly yield result if iterating
return {value:elem}
}
}
}
j=0;
}
args = null; // set args back to null so iterating can restart
return iterating ? {done:true} : [...memory];
}
}
const union = unionizer();
When called without an argument, unionizer
will return a function that behaves just like our first implementation. It has only two additional boolean tests for iterating
, so it will be almost indistinguishable from a performance perspective and only adds a few bytes in size.
The function will be EXTREMELY fast for the first array in the arguments
because nothing will be in the memory
variable. Since a Set
is used for memory, it will be VERY fast for subsequent arrays, although it will slow down slightly when the intersection, i.e., memory
gets really large.
Next, the actual Iterable
must be implemented:
function iterableUnion(...args) {
const union = unionizer(true);
let started;
return {
next() {
return union(...args);
},
[Symbol.iterator]() {
return this;
}
}
}
By calling unionizer(true)
from within iterableUnion
, a function that returns value objects or {done:true}
rather than just collecting all the results is created.
So, now there are two functions, union
and iterableUnion
, based on the same fundamental code. Less to maintain and almost identical in performance to the original code or a generator respectively!
For those interested, here is a straight generator implementation of the union:
function* union(...args) {
const nOthers = args.length - 1,
memory = new Set();
let j = 0;
for(let i=0;i<nOthers;i++) {
let array = args[i];
if(!Array.isArray(array)) args[i] = array = [...array];
const len = array.length;
while(j<len) {
const elem = array[j++];
if(!memory.has(elem)) {
memory.add(elem);
yield elem;
}
}
j=0;
}
}
The same principles can be applied to difference
, intersection
, symmetricIntersection
, and CartesianProduct
.
Although, there are even more opportunities for optimizing a CartesianProduct
beyond the scope of this article, see
In order to apply these principles in other cases, we can keep the code base small by implementing a function that creates an Iterable
:
function createIterable(setFunction) {
return function(...args) {
const f= setFunction(true);
let started;
return {
next() {
return f(...args);
},
[Symbol.iterator]() {
return this;
}
}
}
}
So now the definition of iterableUnion
would look like this:
const iterableUnion = createIterable(unionizor);
Using multiple arrays in excess of 10,000 items each as test data, these are the performance benefits vs naïve approaches I am seeing for access to the first item in a result using JavaScript:
The performance of generators andIterables
are typically within each other’s margin of error, but Iterables
give us less code to maintain.
I hope you found this article useful for optimizing your data pipeline. The library Array
and Set,
while also adding all the standard functions like map
, reduce
, slice
, etc. toSet
andCartesianProduct
.
Image Credit: Dall-E High Speed Programmer
Also published here