paint-brush
How to Optimize Scala Code Performance With Almost No Effortby@alex.slesarenko
1,808 reads
1,808 reads

How to Optimize Scala Code Performance With Almost No Effort

by Alex SlesarenkoMarch 30th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Scala provides powerful constructs and idioms to build abstractions for correctness and clarity. When solving a task, there may be several alternative solutions, each with its own performance characteristics. The tips below can help you to avoid accidental performance issues and write faster Scala code for no additional efforts.
featured image - How to Optimize Scala Code Performance With Almost No Effort
Alex Slesarenko HackerNoon profile picture

How is it even possible?

In this article, I describe Scala performance tips based on my experience of optimizing Ergo's smart contracts interpreter. I found that some typical Scala code fragments have surprisingly much faster alternatives. The tips below can help you to avoid accidental performance issues and write faster Scala code for no additional effort. There is also source code with all the benchmarks.


Scala provides powerful constructs and idioms to build abstractions for correctness and clarity but may have runtime performance penalties. When solving a task, there may be several alternative solutions, each with its own performance characteristics. Even similar-looking code can have radically different performance.


Teams often specify things like formatting and documentation to make sure the code stays consistent and good quality. Similarly, tips on making the code run faster can also be useful. Performance improvement can come from almost no special effort. As such, it is a matter of coding style and habits rather than a matter of extensive code optimization, which shouldn't be premature.


The code speedups described below are only examples and not necessarily exact numbers, as there may be various factors that can impact the timings. Nevertheless, after systematically applying the outlined recommendations, I have observed strictly positive effects on performance.

Performance tips

Tip 1: Avoid Seq()

Let's start with the simplest. It's not uncommon to see the use of Seq() wherever an empty collection is required. However, this method involves the following:


  1. an empty array is initiated
  2. the array is wrapped into an instance of ArraySeq
  3. the following methods called: SeqFactory$Delegate.apply -> IterableFactory.apply -> List.from -> Nil.prependAll
  4. Nil object returned in the end

What to use instead

Simple Nil is 20-28% faster (see Benchmark: Seq) depending on whether the JIT is able to inline the Seq() invocation all the way down to Nil or not. So, why not help it and use Nil directly?

Tip 2: Avoid Map()

Next one is Map(). It is often used wherever an empty Map is required. However, as is the case with Seq(), this method involves allocations and many method calls.

What to use instead

Calling Map.empty is 25-85% faster, depending on the context (see Benchmark: Map)

Tip 3: Avoid for comprehensions

Looping pattern for (x <- xs) { ... } is used quite often due to its convenience. It looks very similar to the for loop in Java, but it is not.

In Scala, it is desugared by the compiler to xs.foreach { x => ... } which, besides execution of the block of code involves the following list of overheads:


  1. foreach method call
  2. allocation of lambda object
  3. boxing (memory allocation) of the value x for every xs item (if xs values are not yet boxed)
  4. hidden overhead of concrete implementation of foreach method


Such for comprehensions are particularly sub-optimal (see Benchmark: for over Range) for Range loops, where x is an Int.

What to use instead

There is a cfor macro from spire library.

I recommend the following code as a replacement when xs provides an O(1) indexing access, for example if xs is an Array or an array wrapped into Seq.

import spire.syntax.cfor._
cfor(0)(_ < xs.length, _ + 1) { i => 
  val x = xs(i)
      ...
}


The above xs collection can also be a Range in which case for (x <- 0 until n) { ... } can be replaced with

cforRange(0 until n) { x => ... }


These cfor macros are compiled to efficient Java loop without overhead points 1) - 4). Depending on xs.length it is 10-170x faster (see Benchmark: for over Range). And since foreach already implies a side effect operation, cfor doesn't make the code any less readable.


Note, however, that cfor is not a silver bullet. It is NOT a replacement for foreach in the case of non-indexed collections, when indexed access is NOT a fast O(1) operation. In many cases using cfor instead of foreach is a no-brainer and often increases the performance of any code which loops over an indexed collection (see for example Benchmark: dot-product and Benchmark: check brackets).

Tip 4: Avoid creating sequences with Seq(...)

It is tempting to use Seq.apply method where a Seq of items is required like Seq(1, 2, 3) because it is easy and concise. You can pass it as a method argument or as a method result.

However, the following happens under the hood:


  1. new Array with the items 1, 2, 3 is created, and each item is boxed into its own Integer object
  2. WrappedArray#ofInt wrapper is created and passed as vararg argument of Seq.apply
  3. the above-mentioned methods down to Nil.prependAll are called
  4. new ListBuffer is created and all items are copied from array to the buffer


Benchmarks show that JIT is not able to fully optimize this code.

What to use instead

Simple drop-in replacement Array(1, 2, 3) performs better. This code is 4-10x faster than using Seq(1, 2, 3). (see Benchmark: Seq vs Array). The speedup is even more significant when the number of items in the sequence grows (see Benchmark: Seq vs Array (from Range)).


What happens here is that:


  1. native unboxed Java array is created and then

  2. wrapped via implicit conversion into a WrappedArray#ofInt which is inherited from the Seq trait and can be used directly.


Why is this faster?


  1. It avoids unnecessary allocations (only two new objects are allocated in the Array case). Note that to create Seq not only each Int is boxed into java.lang.Integer (or other primitive type), but also scala.collection.immutable.:: instances are created for each item of the list.

  2. Creating Array avoids boxing (which is proportional to the size of Seq).

  3. Accessing of array items is cache friendly both when the array is created and when it is later used.

  4. Fewer allocations mean less garbage to collect later. This is especially important when the application is multithreaded because, in this case the garbage collector will compete with application threads for CPU resources, thus further slowing down the application.


Notice that using arrays like this doesn't make the code any less readable. Moreover, the arrays created in this way are only accessible via Seq interface, which is immutable. Thus, better performance is achieved without sacrificing other desirable code qualities like readability, safety, and conciseness.


Caveat: the array is not efficient if it is accessed using list operations head andtail. See, for example, Seq vs Array (list ops) benchmark. In this case, replacing Seq with Array may lead to performance degradation. However, using lists in a performance-critical code may not be a good idea as well.

Tip 5: Avoid extractors with Option

An extractor is an object with an unapply method taking an arbitrary argument and returning an Option of some (maybe different) type. Here is a simple example:

object IsEven:
  def unapply(n: Int): Option[Int] =
    if (n % 2 == 0) Some(n) else None


With extractors, pattern matching can be concise and expressive.

However, unapply returning an Option might have a negative impact on runtime performance because on every matching call an instance of Some needs to be created for each successful extraction.

What to use instead

Since version 2.11, Scala introduces name based extractors that no longer requires unapply to return an Option. Combined with value classes these two features allow implementing allocation-free extractors, which are 10-70% faster than the original Option based extractors (see Benchmark: Option vs Opt).


object IsEvenOpt:
  def unapply(n: Int): Opt[Int] =
    if (n % 2 == 0) Opt(n) else Opt.empty


A suitable implementation of Opt class can be found in the spire library.

Benchmarks Summary

The following tables summarize the results of the benchmarks run on the code snippets above and the following configuration: osArch: aarch64; osName: Mac OS X OpenJDK 64-Bit Server VM; vendor: Eclipse Adoptium; version: 17.0.5+8


You can reproduce the results on your computer by running the Benchmarks class.

Clone the repository and run sbt "Test/runMain benchmarks.Benchmark".

Benchmark: Seq

Size

Seq(), ms

Nil, ms

Speedup

10

0.0001250

0.0001250

1.00

100

0.0002080

0.0002080

1.00

1000

0.0007500

0.0006250

1.20

10000

0.0065830

0.0051250

1.28

100000

0.0625830

0.0502920

1.24

Benchmark: Map

Size

Map(), ms

Map.empty, ms

Speedup

10

0.0001250

0.0001250

1.00

100

0.0002080

0.0001660

1.25

1000

0.0010410

0.0006250

1.67

10000

0.0091670

0.0050830

1.80

100000

0.0902500

0.0487080

1.85

Benchmark: foreach vs cfor

Size

foreach, ms

cfor, ms

Speedup

10

0.0006660

0.0000830

8.02

100

0.0007500

0.0000830

9.04

1000

0.0025830

0.0001250

20.66

10000

0.0238330

0.0003330

71.57

100000

0.2244170

0.0042910

52.30

Benchmark: for over Range

Size

foreach, ms

cforRange, ms

Speedup

10

0.0004160

0.0000830

5.01

100

0.0009160

0.0000830

11.04

1000

0.0061250

0.0001250

49.00

10000

0.0569170

0.0003330

170.92

100000

0.5648750

0.0042500

132.91

Benchmark: Seq vs Array (from elems)

Size

Seq, ms

Array, ms

Speedup

10

0.0007910

0.0002080

3.80

100

0.0066250

0.0008340

7.94

1000

0.0661660

0.0160000

4.14

10000

0.6733330

0.0679170

9.91

100000

6.8687920

0.6750420

10.18

Benchmark: Seq vs Array (from Range)

Size

Seq, ms

Array, ms

Speedup

10

0.0120420

0.0066250

1.82

100

0.0836670

0.0160830

5.20

1000

0.8315420

0.0533340

15.59

10000

8.1187080

0.5097500

15.93

100000

81.0221660

4.5335830

17.87

Benchmark: Seq vs Array (list ops)

Size

Seq, ms

Array, ms

Speedup

1

0.0003330

0.0003750

0.89

10

0.0005840

0.0007500

0.78

100

0.0030410

0.0042500

0.72

1000

0.0295000

0.1258330

0.23

10000

0.2695000

15.5551250

0.02

Benchmark: Option vs Opt

Size

Option, ms

Opt, ms

Speedup

100

0.0004160

0.0003750

1.11

1000

0.0014160

0.0009160

1.55

10000

0.0117910

0.0069580

1.69

100000

0.1155420

0.0666670

1.73

1000000

1.1547090

0.6657910

1.73

Benchmark: dot-product

Size

slow, ms

fast, ms

Speedup

10

0.0050000

0.0011250

4.44

100

0.0059160

0.0016660

3.55

1000

0.0242090

0.0073750

3.28

10000

0.1297920

0.1235000

1.05

100000

1.4926660

0.8446250

1.77

Benchmark: check brackets

Size

slow, ms

fast, ms

Speedup

1

0.0007080

0.0002500

2.83

10

0.0017080

0.0005000

3.42

100

0.0077500

0.0027090

2.86

1000

0.0753340

0.0260000

2.90

10000

0.7195830

0.2587080

2.78

Concluding remarks

Program performance upkeep & optimization is the result of constant vigilance and work, rather than just a one-time job. There is no one size fits all solution as there are many trade-offs along the way.


The tips presented in this article are so simple that they can be applied by default and regardless of whether you have already identified the bottleneck or not. Back to my experience, once I systematically applied these tips all over the code base, those parts of code disappeared from the profiler's hot spots bringing up on the surface other bottlenecks

and simplifies profiling.


For further reading, I collected a list of useful references on the topic of Scala optimization. Happy optimizing!

References

  1. Scala Style Guide
  2. Scala High Performance Programming
  3. Optimizing Higher-Order Functions in Scala (somewhat outdated)
  4. Where to look first when optimizing Scala code?
  5. Scala for comprehension performance
  6. Performance characteristics of Scala collections
  7. Java Performance: The Definitive Guide: Getting the Most Out of Your Code
  8. Scala library benchmarks
  9. JITWatch
  10. Parallel Collections: Measuring Performance
  11. JVM JIT optimization techniques
  12. Name Based Extractors in Scala 2.11