Time, Space and Interviews

Like most software companies, almost every technical interview at Microsoft involves some type of coding problem. I had a number of different problems I used over my time there. They all generally had the characteristics that you could have a good discussion of the general problem space, design solutions and tradeoffs but there was only a small amount of code that needed to be written and the code was not overly tricky.

Some interviewers like to dig into arcane details of specific technologies but I did not find those types of questions too useful. This was especially true since the vast majority of my interviewing was of college candidates where it was difficult to expect deep background in a specific technology. A special form of these interviews was the half hour on-campus interview. These are especially tough since you only have 10–15 minutes for the coding part of the interview and you don’t want to spend a lot of it with the candidate just struggling to get started or to even understand the problem. You also want it to be meaty enough that you can have a good deeper discussion if the candidate does well quickly.

The question I liked to ask was “write a function to remove the duplicates from an unsorted array of integers”. The problem is easy to state and understand so you don’t have to spend a lot of time on set up.

Before running into possible solutions, the first follow-up question I ask is whether the problem is really as well-specified as I first indicated. What questions might I like to ask about the characteristics of the problem and its running environment to guide implementation choices? This is where you can start to get a sense of what kind of experience and background the candidate has, even before they have tried to put any kind of algorithm together.

How big are the integers? I might use a very different algorithm for 8-bit integers versus 64-bit integers. How big is the array? An approach that uses an auxiliary data structure could get quite expensive if the array is large. Do I know anything about the distribution of integers? That is, even if the values are stored in a 64-bit integer, are the actual values clustered over a small range? Do I know anything about the expected number of unique values? Even if the original array is large, a small number of unique values could impact the choice and effectiveness of certain algorithms. Can I use additional storage? The function might be used in a context where use of additional storage (storage proportional to the input) is problematic. Do I need to maintain the original ordering of the unique values or can I use an approach that re-orders them?

How is the function going to be used? Will it be part of a library of functions and therefore I cannot make a lot of assumptions about how it will be used? Is it just one of the hundreds of small algorithms in an application and so clarity and simplicity are most important? Or is it the core algorithm that determines the overall performance of my application (perhaps this is part of a real-time data pipeline)?

A given candidate might identify just a few of these factors and candidates vary considerably in how they can explore how these factors might impact a possible solution as I give some hints along the way.

As we get into actually exploring algorithms, you start to get a sense of how much real coding experience the candidate has. Some suggest going through and “deleting” elements from the array without recognizing that they have just described a potentially O(N²) algorithm. Others talk about placing a “sentinel” value in the array for the deleted elements (without a good sense why this would be a useful intermediate step or whether the concept could even exist in their language of choice).

Most decent candidates will eventually get to the basic framework below for a solution (in pseudo-code, which is just fine):

function RemoveDups(a: array of integer): integer{integer unique_index = 0;integer current_index = 0;

for (; current_index < a.length; current_index++)if (is_unique(a[current_index])){remember(a[current_index]);a[unique_index++] = a[current_index];}

return unique_index;}

That is, iterate over the array and copy the unique elements to the front of the array, returning the number of unique elements found.

The trick then becomes how to determine whether an element is unique. The rise of rich base class libraries in virtually all languages taught in schools means that many candidates will jump to using a set class to record and test for the unique integers. That was fine by me, but I then tried to dig into whether they understand what the implications of that might be, in particular on memory usage. This is where most candidates are just convinced because they have written a linear algorithm that there is no way they can improve on it, unfortunately demonstrating little intuition about the impact of constant factor overheads and heavy memory use.

I would also ask them about approaches that did not use additional storage. Some candidates would also have come up with the idea of sorting the array first, reducing the problem to “remove the duplicates from a sorted array” which is simpler — as above with just a few careful end conditions. Alternatively, they could just use the set of unique elements they have gathered at the front of the array to test for uniqueness (worst case O(N²)).

I came back from one campus trip having gone through about thirty of these interviews and wishing that I had had more concrete data to use in the discussion. So I sat down and decided to code up a few different approaches and then run through and compare the running times and memory usage with a number of different input test scenarios.

The C++ code is up on github. I tested a number of different implementations. The first baseline case was just to go through and touch every element in the array to get a sense for what “O(N)” really was. The next version first sorted the array (using the standard qsort implementation) and then walked through compressing out duplicate elements. The next version was the nominally O(N²) algorithm that tests for uniqueness by walking through the set of unique entries that have been found so far and moved to the front of the array. Really this is not O(N²) but O(N*U). The next version was a trie implementation I wrote that provides for an efficient set implementation if the values are clustered. The final version just used the std::unordered_set implementation that comes with Visual Studio C++.

I found the results interesting and surprising in places. Actually, how to test the code probably required more detailed thinking that the actual implementation (that is actually not too unusual). My tests look at a range of input array sizes from 2⁴ to 2²² and then a distribution of number of unique values across the same range. The unique values can then either be clustered together or randomly distributed over the entire range of integers.

The sort results sure looks linear in the size of input.
Sort can change in speed by factor of 4 depending on number of duplicates. It is much faster to sort an array with lots of duplicates (presumably because the final phases of the sorting process can be eliminated).
But sorting is a factor of 25–90 slower than the baseline linear walk. That is, the speed for the sorting approach was linear to the size of input but was way slower than a simple linear walk over the data.
My naïve trie implementation beats the std::unordered_set implementation all the way up until the number of unique elements in set goes to 2¹⁶ or greater. That surprised me a little bit. With clustered values (even with a large number of unique values), trie is way better than the set implementation on both memory size and speed (obviously related issues).
Trie has much better memory behavior than std::unordered_set for clustered values and only around 2 times worse for unclustered values (e.g. trie maxes out at 356MB while set maxes out at 146MB for the auxiliary data structure).
The O(N²) algorithm beats all of the other algorithms if there are 16 or less unique values (for any size input array) and beats the sorting approach all the way up to 256 unique elements. With those unique elements in L1 cache, just a linear walk is still very fast. It then gets much slower than any other algorithm as the number of unique elements grows large (as expected).

It is very hard to really understand the overall performance impact of the large auxiliary data structures allocated by the set and trie implementations when there are a large number of unique values. The results get slower (between a factor of 5–10) as the number of unique elements grows very large (for the trie, only when they are randomly distributed rather than clustered) but the overall time reliably stays faster than the sorting approach which requires no additional allocations. It is notoriously hard in any really complex piece of software made of many components to understand the aggregate impact of additional memory use by one component in the system. That is one reason why a narrow focus on a particular performance scenario often leads a developer to decide that some additional caching layer is the solution even if in the aggregate this leads to bloated memory usage that is overall more harmful than helpful. This is especially typical for developers of standard libraries or middleware than have only limited visibility to full end-to-end application behavior.

I will also note that I did not even look at possible ways to parallelize these algorithms and take advantage of the multiple cores available on the machine I used to run these tests. That would further complicate both the implementation, the testing and the analysis.

The fact that the analysis does get complicated and interesting is what makes this a fun problem for an interview. It is simple enough to allow many good candidates some success in working through the problem (whatever approach they take) while providing lots of fodder for an interesting discussion that can go in a lot of different directions.

It is also a good example about how complexity can scale in a system. There are hundreds of algorithms sprinkled through a complex application. They start off simple and naïve and then over time get tuned, often as part of some performance effort, to take advantage of particular characteristics of the operating environment and expected datasets. Those assumptions are often poorly recorded and validated so you often later see “rot” as the tuning done for one scenario ends up as mis-tuning over time. The most rigorous systems have both testing frameworks and operational telemetry to keep these assumptions validated, but this is inevitably hard to maintain over time.