Counting lines 60% faster than wc, with Clojure + core.async by@atroche
2,353 reads

Counting lines 60% faster than wc, with Clojure + core.async

Read on Terminal Reader

Too Long; Didn't Read


Company Mentioned

Mention Thumbnail
featured image - Counting lines 60% faster than wc, with Clojure + core.async
Alistair Roche HackerNoon profile picture

@atroche

Alistair Roche
react to story with heart

Yesterday I noticed that my laptop’s SSD is so fast, any single core can’t keep up — even with as simple a task as counting lines in a file, with as battle-tested a tool as wc.

This evening, after taking on lots of advice from the lovely commenters on the reddit thread, I found a better way.

The key idea is to have one thread doing absolutely nothing except:

  • reading in data to a byte array, and
  • passing the array to another thread to do the “processing” (if you can glorify newline-counting with that word).

wc, the venerable unix tool, counts lines in the same thread (check out this I sketched up yesterday from the BSD wc source).

Clojure’s default way of reading in lines (line-seq, which uses BufferedReader under the hood) performs UTF decoding on the same thread, which (as I noticed yesterday) is a real bottleneck when your SSD is pumping through well over a gigabyte of data per second.

And the one in my mid-2015 MBP is nowhere near top-of-the-line anymore:

image

The parallel version I coded up tonight takes an average of 6.9 seconds to finish, which is around a 60% improvement. Not only that, but it’s not using the ugly low-level threading and concurrency primitives of the JVM, but the fairly high-level abstraction of core.async.

(I also tested it against much smaller files and found a similar speedup).

Here’s the code:

It’s getting late here, so I’m not going to dive into it, but it just implements that core idea I mentioned above of keeping one thread dedicated to doing the system call and little else. Here’s the . If you have questions or comments, jump over to the /r/clojure reddit thread.

It seems like this issue of CPU-bound IO is going to become more important as we move towards faster SSDs, more cores, but static clock speeds. I wonder how standard libraries (like Clojure’s and Java’s) will adapt to keep up with the pace?

[edit: check out the awesome unix-y ways of dealing with multi-core IO people came up with in the /r/programming thread]

[edit2: here’s , if you want to try running or modifying it. and if you want to generate a test file, try: “head -c 10G /dev/random > random.txt”]

RELATED STORIES

L O A D I N G
. . . comments & more!
Hackernoon hq - po box 2206, edwards, colorado 81632, usa