Listen to this story
Passionate Software Developer.
Programs may freeze for many reasons, such as software and hardware problems, limitations of particular technologies, concurrency programming failures, resource exhaustion, software bugs, issues with networks, and among others, inefficient algorithm implementations.
The last mentioned is often the root of the problem and may persist in the code for a long time. It may not be quite noticeable when input data is small, but it becomes a defining problem with larger data.
We’ll examine a case of such freeze caused by the inefficient implementation of some logic in Python. Let’s look at the code I’ve encountered in my work that has caused a long-lasting freeze every time it was run. It could have been avoided by taking into consideration the cost of basic operations in CPython.
Assume we’ve got a list of files we’d like to process, and a list of ignored files that should be excluded from the processing.
files = [f'/tmp/temp{i}.txt' for i in range(29_000, 69_000)]
ignored_files = [f'/tmp/temp{i}.txt' for i in range(30_000)]
To illustrate the problem, I’m using list comprehension to generate the lists. In practice, they were read from a file and passed to a method as parameters.
The following code was supposed to get us files, excluding the ignored ones.
files_to_be_processed = []
for file in files:
for ignored in ignored_files:
if file == ignored:
break
else:
files_to_be_processed.append(file)
The code was iterating over the list of files. For each file in the files, it was iterating over the ignored_files to find out whether the file should be ignored. Assuming that:
The code had O(m*n) time complexity for getting files_to_be_processed.
When we ran the code, it was freezing for about 40 seconds to finish processing. So, we had to wait for the program would come to life again. Because a freeze was repeating every run, the development became tedious and slow. In the case of frequent runs, the development process slowed down radically.
The running time is negligible when m << n. But when n is comparable to m, the running time becomes noticeable due to the current implementation, which gives roughly O(n²) time complexity while it could be linear O(n). Given the fact we had dozens of thousands of files, processing them took up to 1 minute in some cases. It appeared to be the case when the overlap of files and ignored_files was small. Because most items out of the files weren’t in the ignored list, each was iterating over the whole ignored_files.
Such a simple task as excluding files from processing should not be so time-consuming, even for dozens of thousands of files. It turned out we had the freeze because of the costly implementation of excluding files in Python.
We can reduce the execution time and transform O(n²) into O(n), taking into account the time-complexity of various operations in CPython. According to the docs, checking whether an item is a member of a set takes O(1) in the average case. So, we can use the set data structure for ignored_files to improve time complexity:
files_to_be_processed = []
unique_ignored_files = set(ignored_files)
for file in files:
if file not in unique_ignored_files:
files_to_be_processed.append(file)
# or using a list comprehension
unique_ignored_files = set(ignored_files)
files_to_be_processed = [f for f in files if f not in unique_ignored_files]
Here we’re checking if a file is in the unique_ignored_files for O(1). Because we have to iterate over all files, we get overall O(n) time complexity. It’s highly advantageous over the previous O(n²) solution, especially when the input data is large. As a result, It takes less than a second to prepare several dozens of thousands of files files_to_be_processed. You may run the example yourself to see the running time changes.
Alternatively, Python has the difference operator for sets that could further simplify the code. It returns a new set with elements in the left set that are not in the right set.
files_to_be_processed = set(files) - set(ignored_files)
According to difference time-complexity, it yields the same O(n). The resulting set files_to_be_processed can be converted to any particular data structure if needed.
It makes sense to pay attention to long-lasting freezes in the code as they may originate from the sub-optimal algorithm implementation. The problem often reveals itself for large inputs when the time complexity begins to matter. Having a grasp of the cost of some operations may help to detect and fix such issues.