Photo by on Margaret Weir Unsplash “ ” So started an I published two years ago. Then went on to show how we could speed things up by using techniques such as parallelization and memory-mapped files. All that because doing the same task with pandas would have taken hours. We spend a lot of time waiting for some data preparation task to finish —the destiny of data scientists, you would say. article : I implemented the same task with the new for python. The results? Well, I am flashed. TL;DR Polars library Polars can crunch the input (23GB) in less than 80 seconds. My needed 15 minutes, chakka! old script But, let’s look at it one thing at a time, shall we? The data As happens often in large data, we have to take care of input quality. I wanted to filter out OCR defects from the . The data files contain frequencies of how often a term appears in the scanned books on a per-year basis. Summing over the years gives us then a good grip on suspect words (those that appear rarely) — “ ” appears only 41 times, “ “ 381 times. In contrast, the word “ ” appears 93,503,520 times, “ ” 71,863,291 times. Google Books 1-Ngram dataset apokalypsc sngineering air population In total, there are 27 files (one per letter) including 1,201,784,959 records (yes, over one thousand million records to crunch through, 23GB uncompressed). DataFrames in Polars Polars is a data processing and analysis library written entirely in rust with APIs in Python and Node.js. It is the new kid on the block competing with established top dogs such as pandas. It comes fully equipped with full support for numerical calculations, string manipulation, and data frame operations like filtering, joining, intersection, and aggregations such as groupby. Polars has achieved honors in benchmarks as shown and . here here Back to our task, this is the script implementing the logic described above for processing one file. def process_file(file): global basepath, stopwords not_word = r'(_|[^\w])' # define what we are reading (only cols 0 and 2 and name them) df = pl.read_csv(basepath+file, sep="\t", columns=[0,2], new_columns=['word','count']) # filter out terms with non alphabetical characters ... df = df.filter(pl.col("word").str.contains(not_word).is_not()) # ... and eliminate terms shorter than 3 chars df = df.filter(pl.col("word").str.lengths() > 2) # ... and also stop words df["word"] = df["word"].str.to_lowercase() df = df.filter(pl.col("word").is_in(stopwords).is_not()) # sum unique counts and sort by sum desc df = df.groupby('word')['count'].sum().sort(by='count_sum', reverse=True) # select only terms that appear more 20,000 times in the books good = df.filter(pl.col("count_sum") > 20000) # output a csv file print(f"out_{file}, {len(good)} terms") good.to_csv(f'out_{file}.csv', sep='\t', has_header=False) The input format for each file is ngram TAB year TAB count TAB volume_count NEWLINE In a nutshell, we apply some heuristic filters and sum for each term overall records. Finally, we output only terms that appear more often than a given threshold (test number). count The syntax of working with data frames in polars bears similarity with the syntax in pandas, but only to a certain extent. Polars has a chained expression syntax that makes it very … well, expressive. I liked that a lot. I must admit, though that without stackoverflow I would have never come up with to address the Series data structure storing each column in the data frame 😉 pl.col(“colname”) What makes Polars so fast? In this , , the creator of Polars gives some insight into what happens behind the scenes. Parallelization happens in the underlying layers in . Lots of thought went into optimizing CPU caches and multi-core design. The use of the framework for columnar data also helped to speed things up. But, now we see something new: interview Ritchie Vink rust Arrow2 The most inspiration came from database systems, mostly how they define a query plan and optimize that. This gave way for Polars , and that are the key selling point of our API. They are declarative, composable and fast. expressions This quote caught my attention. See, dealing with a large data frame resembles accessing rows/cols in a database. Behind the scenes we have copy-on-write so generally copies, which are expensive in RAM and speed, don’t have to happen unless you modify the data - the data itself is immutable. All of this happens in the Rust layer, using Rust threads (which you don’t see from the Python frontend), so running low on RAM is much less of an issue compared to Pandas. Voilà. Parallelizing the Input Pipe Processing 27 input files doesn’t have to happen sequentially 😃 I use the python multi-processing library to have four processes running the script above at any given time (my Mac-mini has four cores and 32GB of memory). The script is available . here Thank you for reading, hope you found it interesting. Comments and suggestions are always welcome!