Freeing the Data Scientist's Mind from the Curse of Vectorization - Paging Julia for Rescue

Written by dmoura | Published 2019/08/20
Tech Story Tags: programming | data-science | computer-science | rlang | julialang | python | benchmark | latest-tech-stories

TLDR Julia promises performance comparable to statically typed compiled languages (like C) while keeping the rapid development features of interpreted languages. This performance is achieved by just-in-time (JIT) compilation. In interpreted languages we pay an overhead for each time we execute an instruction. We want to use vectorized operations or specialized implementations that take data structures (e.g. arrays, dataframes) as input and handle them in a single call. In this post, we will try to show how the mindset and limitations when programming in interpreted languages can be achieved out of the box.via the TL;DR App

Nowadays, most data scientists use either Python or R as their main programming language. That was also my case until I met Julia earlier this year. Julia promises performance comparable to statically typed compiled languages (like C) while keeping the rapid development features of interpreted languages (like Python, R or Matlab). This performance is achieved by just-in-time (JIT) compilation.
Instead of interpreting code, Julia compiles code in runtime. While JIT compilation has been around for sometime now (e.g. Matlab introduced it in 2002), Julia was designed for performance with JIT compilation in mind. Type stability and multiple-dispatch are key design concepts in Julia that put it apart from the competition. There is a very nice notebook by the Data Science Initiative at the University of California that explains these concepts if you want to learn more.
Somewhere in time, we started using interpreted languages for handling large datasets (I guess datasets grew bigger and bigger and we kept using the same tools). We learned that, for the sake of performance, we want to avoid loops and recursion.
Instead, we want to use vectorized operations or specialized implementations that take data structures (e.g. arrays, dataframes) as input and handle them in a single call. We do this because in interpreted languages we pay an overhead for each time we execute an instruction. While I was happy coding in R, it involved having a set of strategies for avoiding loops and recursion and many times the effort was being directed to “how do I avoid the pitfalls of an interpreted language?”.
I got to a point where I was coding C functions to tackle bottlenecks on my R scripts and, while performance clearly improved, the advantages of using R were getting lost in the way. That was when I started looking for alternatives and I found Julia.
In this post, we will start by solving a simple problem in R where I will try to illustrate the mindset and limitations when programming in interpreted languages. Then, we will solve the same problem with Julia, showing how the mindset differs completely and how C-like performance can be achieved out of the box.

The Vectorization pathway

Sometimes it is not clear how to get the best performance using vectorization.
Let us consider the problem of finding the distances between all combinations of pairs of points, given a vector of points. For the sake of simplicity, points have one dimension and we will use the L1-distance. Given the input [5, 3, 9, 1] the expected output is [2, 4, 4, 6, 2, 8] (resulting from |5–3|, |5–9|, |5–1|, |3–9|, |3–1| and|9–1|).
This problem is solved in R using the
dist
function of the
stats
package:
> as.vector(dist(c(5, 3, 9, 1), method="manhattan"))
[1] 2 4 4 6 2 8
R’s implementation returns a distance matrix that we convert into a vector.
Let us pretend that
dist
is not available. How would you code it in R?
Solving this problem with a loop-based approach is straightforward: we need an outer loop for iterating over the first element of the pair and an inner loop for the second element.
Since the L1-distance is symmetric (|a-b| = |b-a|), we just need to do this for half the combinations, while avoiding calculating the distances of points to themselves (the diagonal of the matrix). A loop-based implementation in R would look like this:
distances_among_pairs_loop <- function(v) {
  nin = length(v)
  nout = (nin * nin - nin) %/% 2 #length of the output vector
  dists = vector(mode=typeof(v), length=nout) #preallocating output
  k = 1 #current position in the output vector
  for (i in seq_len(nin-1)) {
    a = v[i]
    for (b in v[seq(i+1, nin)]) {
      dists[k] = abs(a-b)
      k = k + 1
    }
  }
  dists
}
If you program in R, you would say that this is not proper R code… loops are slow and produce very verbose code… there should be a way of vectorizing.
One way to drop loops is by generating vectors with all combinations of pairs. Basically, we are finding the (row, column) coordinates of the lower triangle of the distance matrix.
Then, we can calculate the distances with a single vectorized operation. We know that we are paying a memory penalty, but we hope that the vectorization pays off.
distances_among_pairs_rep <- function(v) {
  row <- rep(1:length(v), each=length(v))  #e.g 1 1 1 2 2 2 3 3 3
  col <- rep(1:length(v), times=length(v)) #e.g 1 2 3 1 2 3 1 2 3
  lower_tri <- which(row > col) #e.g. (2,1) (3,1) (3,2)
  abs(v[row[lower_tri]] - v[col[lower_tri]])
}
Another alternative is to use R’s
outer
function for generating a matrix with the differences between all combinations of two points (including redundant combinations).
Then, we only need to retrieve the absolute value of the lower triangular part of the matrix. We might feel reluctant because we are making ~2 times more operations (and storing them in memory), but this does result in cleaner and more readable code.
distances_among_pairs_outer <- function(v) {
  dists <- outer(v, v, '-')
  abs(dists[lower.tri(dists)])
}
As we travel along the vectorization pathway, the code is becoming more compact, but is it becoming faster? We are exchanging compactness by more memory and more operations… it is not trivial to predict which of these implementations is the most efficient.
We could go on an on trying to figure out what is the best way to avoid the pitfalls of R when we already have a simple loop-based solution (which would be difficult to beat in a compiled language).
So, if this function is performance-critical, it might make sense to implement it in C, CPP or Fortran. A CPP translation of the original implementation, integrated with R via the Rcpp library, would look like this:
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector distances_among_pairs_cpp(NumericVector v) {
    int nin = v.size();
    int nout = (nin * nin - nin) / 2; //length of the output vector
    NumericVector dists(nout); //preallocating output vector
    for (int i=0,k=0;i<nin-1;i++) {
        double a = v[i];
        for (int j=i+1;j<nin;j++,k++) {
            double b = v[j];
            dists[k] = abs(a-b);
        }
    }
    return dists;
}

Experiments

I ran the different implementations 3 times each with vectors of 10.000 random numbers (requiring 49.995. 000 distance calculations when avoiding redundancy, 100.000.000 otherwise) and took the median CPU time and memory consumption. I used a 2017 MacBook Pro, 2.3 GHz Intel Core i5, 16 GB RAM running Mac OS 10.14 with R 3.6.0, Julia 1.0.3 and XCode 10.1.
The loop-based implementation in R was the slowest, as expected (and would be much slower before version 3.4 where JIT became available). By vectorizing, we decrease computation time but increase memory consumption, which can become a problem as the size of the input increases. Even after our vectorization efforts, we are still far from the performance of R’s
dist
function.
Rcpp allowed decreasing both computation time and memory requirements, outperforming R’s core implementation. This is not surprising as R’s
dist
function is much more flexible, adding several options and input validation.
While it is great that we can inject C/CPP code into R scripts, now we are dealing with two programming languages and we have lost the goodies of interactive programming for the CPP code.

Julia’s way

Julia’s programming mindset differs completely from R’s. The most efficient solution is attained by a loop-based approach that preallocates memory.
function distances_among_pairs_loop(v)
    nin = length(v)
    nout = div(nin * nin - nin, 2) #length of the output vector
    dists = Vector{eltype(v)}(undef,nout) #preallocating output
    k = 1 #current position in the output vector
    for i in 1:(nin-1)
        a = v[i]
        for j in (i+1):nin
            b = v[j]
            dists[k] = abs(a-b)
            k += 1
        end
    end
    dists
end
If you want to write less code, you can do so at the expense of computational efficiency. One way of compressing code is through comprehensions.
distances_among_pairs_comp(v) = 
    [abs(v[i]-v[j]) for i in eachindex(v), j in eachindex(v) if i<j]
The above comprehension is much more compact than the loop-based implementation while embodying the same logic. The main disadvantage of this approach is that the output vector is not preallocated. Since the compiler cannot predict the size of the output, the output grows dynamically as needed.
Finally, if you prefer a vectorized approach, you also have that choice in Julia. Translating R’s implementation based on the outer function would look like this:
function distances_among_pairs_outer(v)
    dists = v .- v'
    abs.(dists[tril!(trues(size(dists)), -1)])
end
Unlike in R, we expect this to be the least efficient approach since it requires more memory and unnecessary (redundant) operations.

Results

Julia stands out by delivering C-like performance out of the box. The tradeoff between code compactness and efficiency is very clear, with C-like code delivering C-like performance. Comprehensions are a good compromise as they are simpler to code, less prone to bugs, and equally memory-efficient for this problem.
When using R for tackling computational intensive tasks, we hope to find a specialized function to solve our problem. If a specialized function is not available, we either need to program it in C/CPP or go through the vectorization pathway. If we choose the second, we will probably end up far from the performance we would get from the first.

Closing thoughts

Vectorization, comprehension, map-filter-reduce are all great tools that can save you time and deliver a more compact and readable code. However, programmers should not be forced to use them because of performance limitations of the programming language.
You do not want to spend your time experimenting several ways to implement a solution when there is a straightforward and otherwise efficient implementation.
Julia allows you to choose between code compactness and computational efficiency, making this tradeoff very clear. You can implement C-like loops, R-like vectorizations, or Python-like comprehensions.
It’s up to you. You can optimize the ratio between the time you need to code the solution and the time the solution takes to run. You can implement your own algorithms without relying on a second language.
Being much younger than Python and R, Julia is fighting its way through the data science community. I loved when I started using Matlab 13 years ago because of the way I could interact with data. I am now loving how I can write faster code faster and how I am free to implement virtually any algorithm in Julia. Give it a go, I am sure you will love it too!
Code available at: github.com/dcmoura/blogposts
Original published @ Towards Data Science on Medium
You can find me on TweeterLinkedIn and Vimeo (check out my datavizs!).

Written by dmoura | Computer Scientist and Engineer passionate about Data and Algorithms
Published by HackerNoon on 2019/08/20