paint-brush
Defeating Evil with Data Structures!by@james_32022
354 reads
354 reads

Defeating Evil with Data Structures!

by James BowenJune 18th, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In <a href="https://www.mmhaskell.com/blog/2017/6/19/profiling-in-haskell" target="_blank">last week’s article</a>, we used benchmarks to determine how well our code performs on certain inputs. First we used the Criterion library to get some measurements for our code. Then we were able to look at those measurements in some spiffy output. We also profiled our code to try to determine what part of our code was slowing us down.

Company Mentioned

Mention Thumbnail
featured image - Defeating Evil with Data Structures!
James Bowen HackerNoon profile picture

In last week’s article, we used benchmarks to determine how well our code performs on certain inputs. First we used the Criterion library to get some measurements for our code. Then we were able to look at those measurements in some spiffy output. We also profiled our code to try to determine what part of our code was slowing us down.

The profiling output highlighted two functions that were taking an awful lot of time. When we analyzed them, we found they were very inefficient. In this article, we’ll resolve those problems and improve our code in a couple different ways. First, we’ll use an array rather than a list to make our value accesses faster. Then, we’ll add a cool data structure called a segment tree. This will help us to quickly get the smallest height value over a particular interval.

The code examples in this article series make good use of the Stack tool. If you’ve never used Stack before, you should check out our new FREE Stack mini-course. It’ll walk you through the basics of organizing your code, getting dependencies, and running commands.

What Went Wrong?

So first let’s take some time to remind ourselves why these functions were slowing us down a lot. Both our minimum height function and our value at index function ran in O(n) time. This means each of them could scan the entire list in the worst case. Next we observed that both of these functions will get called O(n) times. Thus our total algorithm will be O(n^2) time. The time benchmarks we took backed up this theory.

The data structures we mentioned above will help us get the values we need without doing a full scan. We’ll start by substituting in an array for our list, since that is a great deal easier.

Arrays

Linked lists are very common when we’re solving functional programming problems. They have some nice properties, and work very well with recursion. However, they do not allow fast access by index. For these situations, we need to use arrays. Arrays aren’t as common in Haskell as other languages, and there are a few differences.

First, Haskell arrays have two type parameters. When you make an array in Java, you say whether it’s an int array (int[]) or a string array (String[]), or whatever other type. So this is only a single parameter. Whenever we want to index into the array, we always use integers.

In Haskell, we get to choose both the type that the array stores AND the type that indexes the array. Now, the indexing type has to belong to the index (Ix) typeclass. And in this case we’ll be using Int anyways. But it’s cool to know that you have more flexibility. For instance, consider representing a matrix. In Java, we have to use an “array of arrays”. This involves a lot of awkward syntax. In Haskell, we can instead use a single array indexed by tuples of integers! We could also do something like index from 1 instead of 0 if we wanted.

So for our problem, we’ll use Array Int Int for our inner fence values instead of a normal list. We'll only need to make a few code changes though! First, we'll import a couple modules and change our type to use the array:

import Data.Arrayimport Data.Ix (range)

...

newtype FenceValues = FenceValues { unFenceValues :: Array Int Int }

Next, instead of using (!!) to access by index, we’ll use the specialized array index (!) operator to access them.

valueAtIndex :: FenceValues -> FenceIndex -> IntvalueAtIndex values index = (unFenceValues values) ! (unFenceIndex index)

Finally, let’s improve our minimumHeight function. We’ll now use the range function on our array instead of resorting to drop and take. Note we now use right - 1 since we want to exclude the right endpoint of the interval.

where    valsInInterval :: [(FenceIndex, Int)]    valsInInterval = zip       (FenceIndex <$> intervalRange)       (map ((unFenceValues values) !) intervalRange)      where        intervalRange = range (left, right - 1)

We’ll also have to change our benchmarking code to produce arrays instead of lists:

import Data.Array(listArray)

…

randomList :: Int -> IO FenceValuesrandomList n = FenceValues . mkListArray <$>   (sequence $ replicate n (randomRIO (1, 10000 :: Int)))  where    mkListArray vals = listArray (0, (length vals) - 1) vals

Both our library and our benchmark now need to use array in their build-depends section of the Cabal file. We need to make sure we add this! Once we have, we can benchmark our code again, and we’ll find it’s already sped up quite a bit!

>> stack bench --profileRunning 1 benchmarks...Benchmark fences-benchmarks: RUNNING...benchmarking fences tests/Size 1 Testtime                 49.33 ns   (48.98 ns .. 49.71 ns)                     1.000 R²   (0.999 R² .. 1.000 R²)mean                 49.46 ns   (49.16 ns .. 49.86 ns)std dev              1.105 ns   (861.0 ps .. 1.638 ns)variance introduced by outliers: 33% (moderately inflated)

benchmarking fences tests/Size 10 Testtime                 4.541 μs   (4.484 μs .. 4.594 μs)                     0.999 R²   (0.998 R² .. 1.000 R²)mean                 4.496 μs   (4.456 μs .. 4.531 μs)std dev              132.0 ns   (109.6 ns .. 164.3 ns)variance introduced by outliers: 36% (moderately inflated)

benchmarking fences tests/Size 100 Testtime                 79.81 μs   (79.21 μs .. 80.45 μs)                     0.999 R²   (0.999 R² .. 1.000 R²)mean                 79.51 μs   (78.93 μs .. 80.39 μs)std dev              2.396 μs   (1.853 μs .. 3.449 μs)variance introduced by outliers: 29% (moderately inflated)

benchmarking fences tests/Size 1000 Testtime                 1.187 ms   (1.158 ms .. 1.224 ms)                     0.995 R²   (0.992 R² .. 0.998 R²)mean                 1.170 ms   (1.155 ms .. 1.191 ms)std dev              56.61 μs   (48.02 μs .. 70.28 μs)variance introduced by outliers: 37% (moderately inflated)

benchmarking fences tests/Size 10000 Testtime                 15.03 ms   (14.71 ms .. 15.32 ms)                     0.997 R²   (0.994 R² .. 0.999 R²)mean                 15.71 ms   (15.44 ms .. 16.03 ms)std dev              729.7 μs   (569.3 μs .. 965.4 μs)variance introduced by outliers: 16% (moderately inflated)

benchmarking fences tests/Size 100000 Testtime                 191.4 ms   (189.2 ms .. 193.9 ms)                     1.000 R²   (1.000 R² .. 1.000 R²)mean                 189.3 ms   (188.2 ms .. 190.5 ms)std dev              1.471 ms   (828.0 μs .. 1.931 ms)variance introduced by outliers: 14% (moderately inflated)

Benchmark fences-benchmarks: FINISH

Here’s what the multiplicative factors are:

Size 1: 49.33 nsSize 10: 4.451 μs (increased ~90x)Size 100: 79.81 μs (increased ~18x)Size 1000: 1.187 ms (increased ~15x)Size 10000: 15.03 ms (increased ~13x)Size 100000: 191.4 ms (increased ~13x)

For the later cases, increasing size by a factor 10 seems to only increase the time by a factor of 13–15. We could be forgiven for thinking we have achieved O(n log n) time already!

Better Test Cases

But something still doesn’t sit right. We have to remember that the theory doesn’t quite justify our excitement here. In fact our old code was SO BAD that the NORMAL case was O(n^2). Now it seems like we may have gotten O(n log n) for the average case. But we want to prepare for the worst case if we can. Imagine our code is being used by our evil adversary:

He’ll find the worst possible case! In this situation, our code will not be so performant when the lists of input heights is sorted!

main :: IO ()main = do  [l1, l2, l3, l4, l5, l6] <- mapM     randomList [1, 10, 100, 1000, 10000, 100000]  let l7 = sortedList  defaultMain    [ bgroup "fences tests"         ...       , bench "Size 100000 Test" $ whnf largestRectangle l6      , bench "Size 100000 Test (sorted)" $ whnf largestRectangle l7      ]    ]

...

sortedList :: FenceValuessortedList = FenceValues $ listArray (0, 99999) [1..100000]

We’ll once again find that this last case takes a loooong time, and we’ll see a big spike in run time.

>> stack bench --profileRunning 1 benchmarks...Benchmark fences-benchmarks: RUNNING...

…

benchmarking fences tests/Size 100000 Test (sorted)time                 378.1 s    (355.0 s .. 388.3 s)                     1.000 R²   (0.999 R² .. 1.000 R²)mean                 384.5 s    (379.3 s .. 387.2 s)std dev              4.532 s    (0.0 s .. 4.670 s)variance introduced by outliers: 19% (moderately inflated)

Benchmark fences-benchmarks: FINISH

It averages more than 6 minutes per case! But this time, we’ll see the profiling output has changed. It only calls out various portions of minimumHeightIndexValue! We no longer spend a lot of time in valueAtIndex.

COST CENTRE                                          %time %alloc

minimumHeightIndexValue.valsInInterval               65.0   67.7minimumHeightIndexValue                              22.4    0.0minimumHeightIndexValue.valsInInterval.intervalRange 12.4   32.2

So now we have to solve this new problem by improving our calculation of the minimum.

Segment Trees

Our current approach still requires us to look at every element in our interval. Even though some of our intervals will be small, there will be a lot of these smaller calls, so the total time is still O(n^2). We need a way to find the smallest item and value on a given interval without resorting to a linear scan.

One idea would be to develop an exhaustive list of all the answers to this question right at the start. We could make a mapping from all possible intervals to the smallest index and value in the interval. But this won’t help us in the end. There are still n^2 possible intervals. So creating this data structure will still mean that our code takes O(n^2) time.

But we’re on the right track with the idea of doing some of the work before hand. We’ll have to use a data structure that’s not an exhaustive listing though. Enter segment trees.

A segment tree has the same structure as a binary search tree. Instead of storing a single value though, each node corresponds to an interval. Each node will store its interval, the smallest value over that interval, and the index of that value.

The top node on the tree will refer to the interval of the whole array. It’ll store the pair for the smallest value and index overall. Then it will have two children nodes. The left one will have the minimum pair over the first half of the tree, and the right one will have the second half. The next layer will break it up into quarters, and so on.

As an example, let’s consider how we would determine the minimum pair starting from the first quarter point and ending at the third quarter point. We’ll do this using recursion. First, we’ll ask the left subtree for the minimum pair on the interval from the quarter point to the half point. Then we’ll query the right tree for the smallest pair from the half point to the three-quarters point. Then we can take the smallest of those and return it. I won’t go into all the theory here, but it turns out that even in the worst case this operation takes O(log n) time.

Designing Our Segment Tree

There is a library called Data.SegmentTree on hackage. But I thought it would be interesting to implement this data structure from scratch. We'll compose our tree from SegmentTreeNodes. Each node is either empty, or it contains six fields. The first two refer to the interval the node spans. The next will be the minimum value and the index of that value over the interval. And then we’ll have fields for each of the children nodes of this node:

data SegmentTreeNode = ValueNode  { fromIndex :: FenceIndex  , toIndex :: FenceIndex  , value :: Int  , minIndex :: FenceIndex  , leftChild :: SegmentTreeNode  , rightChild :: SegmentTreeNode  }  | EmptyNode

We could make this Segment Tree type a lot more generic so that it isn’t restricted to our fence problem. I would encourage you to take this code and try that as an exercise!

Building the Segment Tree

Now we’ll add our preprocessing step where we’ll actually build the tree itself. This will use the same interval/tail pattern we saw before. In the base case, the interval’s span is only 1, so we make a node containing that value with empty sub-children. We’ll also add a catchall that returns an EmptyNode:

buildSegmentTree :: Array Int Int -> SegmentTreeNodebuildSegmentTree ints = buildSegmentTreeTail   ints   (FenceInterval ((FenceIndex 0), (FenceIndex (length (elems ints)))))

buildSegmentTreeTail :: Array Int Int -> FenceInterval -> SegmentTreeNodebuildSegmentTreeTail array  (FenceInterval (wrappedFromIndex@(FenceIndex fromIndex), wrappedToIndex@(FenceIndex toIndex)))  | fromIndex + 1 == toIndex = ValueNode       { fromIndex = wrappedFromIndex      , toIndex = wrappedToIndex      , value = array ! fromIndex      , minIndex = wrappedFromIndex      , leftChild = EmptyNode      , rightChild = EmptyNode      }  | … missing case  | otherwise = EmptyNode

Now our middle case will be the standard case. First we’ll divide our interval in half and make two recursive calls.

where   average = (fromIndex + toIndex) `quot` 2  -- Recursive Calls  leftChild = buildSegmentTreeTail     array (FenceInterval (wrappedFromIndex, (FenceIndex average)))  rightChild = buildSegmentTreeTail     array (FenceInterval ((FenceIndex average), wrappedToIndex))

Next we’ll write a function that’ll extract the minimum value and index, but handle the empty node case:

-- Get minimum val and index, but account for empty case.valFromNode :: SegmentTreeNode -> (Int, FenceIndex)valFromNode EmptyNode = (maxBound :: Int, FenceIndex (-1))valFromNode n@ValueNode{} = (value n, minIndex n)

Now we’ll compare the three cases for the this minimum. It’ll likely be the values from the left or the right. Otherwise it’s the current value.

leftCase = valFromNode leftChildrightCase = valFromNode rightChildcurrentCase = (array ! fromIndex, wrappedFromIndex)(newValue, newIndex) = min (min leftCase rightCase) currentCase

Finally we’ll complete our definition by filling in the missing variables in the middle/normal case. Here’s the full function:

buildSegmentTreeTail :: Array Int Int -> FenceInterval -> SegmentTreeNodebuildSegmentTreeTail array  (FenceInterval (wrappedFromIndex@(FenceIndex fromIndex), wrappedToIndex@(FenceIndex toIndex)))  | fromIndex + 1 == toIndex = ValueNode       { fromIndex = wrappedFromIndex      , toIndex = wrappedToIndex      , value = array ! fromIndex      , minIndex = wrappedFromIndex      , leftChild = EmptyNode      , rightChild = EmptyNode      }  | fromIndex < toIndex = ValueNode     { fromIndex = wrappedFromIndex    , toIndex = wrappedToIndex    , value = newValue    , minIndex = newIndex    , leftChild = leftChild    , rightChild = rightChild    }  | otherwise = EmptyNode    where       average = (fromIndex + toIndex) `quot` 2      -- Recursive Calls      leftChild = buildSegmentTreeTail         array (FenceInterval (wrappedFromIndex, (FenceIndex average)))      rightChild = buildSegmentTreeTail         array (FenceInterval ((FenceIndex average), wrappedToIndex))

      -- Get minimum val and index, but account for empty case.      valFromNode :: SegmentTreeNode -> (Int, FenceIndex)      valFromNode EmptyNode = (maxBound :: Int, FenceIndex (-1))      valFromNode n@ValueNode{} = (value n, minIndex n)

      leftCase = valFromNode leftChild      rightCase = valFromNode rightChild      currentCase = (array ! fromIndex, wrappedFromIndex)      (newValue, newIndex) = min (min leftCase rightCase) currentCase

Finding the Minimum

Now let’s write the critical function of finding the minimum over the given interval. We’ll add our tree as another parameter. Then we’ll handle the EmptyNode case and then unwrap our values for the full case:

minimumHeightIndexValue :: FenceValues -> SegmentTreeNode -> FenceInterval -> (FenceIndex, Int)minimumHeightIndexValue values tree   originalInterval@(FenceInterval (FenceIndex left, FenceIndex right)) =  case tree of    EmptyNode -> (maxBound :: Int, -1)    ValueNode      { fromIndex = FenceIndex nFromIndex      , toIndex = FenceIndex nToIndex      , value = nValue      , minIndex = nMinIndex      , leftChild = nLeftChild      , rightChild = nRightChild} ->

Next we’ll handle the base case that we are at exactly the correct node:

| left == nFromIndex && right == nToIndex = (nValue, nMinIndex)

Next we’ll observe two cases that will need only one recursive call. If the right index is below the midway point, we recursively call to the left sub-child. And if the left index is above the midway point, we’ll call on the right side (we’ll calculate the average later).

| otherwise = if right < average   then minimumHeightIndexValue values nLeftChild originalInterval  else if left >= average    then minimumHeightIndexValue values nRightChild originalInterval

Finally we have a tricky part. If the interval does cross the halfway mark, we’ll have to divide it into two sub-intervals. Then we’ll make two recursive calls, and get their solutions. Finally, we’ll compare the two solutions and take the smaller one.

else minTuple leftResult rightResult  where    average = (nFromIndex + nToIndex) `quot` 2    leftResult = minimumHeightIndexValue values nLeftChild      (FenceInterval (FenceIndex left, FenceIndex average))    rightResult = minimumHeightIndexValue values nRightChild      (FenceInterval (FenceIndex average, FenceIndex right))    minTuple :: (FenceIndex, Int) -> (FenceIndex, Int) -> (FenceIndex, Int)    minTuple old@(_, heightOld) new@(_, heightNew) =      if heightNew < heightOld then new else old

Here’s the full function for clarity:

minimumHeightIndexValue :: FenceValues -> SegmentTreeNode -> FenceInterval -> (FenceIndex, Int)minimumHeightIndexValue values tree   originalInterval@(FenceInterval (FenceIndex left, FenceIndex right)) =  case tree of    EmptyNode -> (maxBound :: Int, -1)    ValueNode      { fromIndex = FenceIndex nFromIndex      , toIndex = FenceIndex nToIndex      , value = nValue      , minIndex = nMinIndex      , leftChild = nLeftChild      , rightChild = nRightChild} ->        | left == nFromIndex && right == nToIndex = (nValue, nMinIndex)        | otherwise = if right < average           then minimumHeightIndexValue values nLeftChild originalInterval          else if left >= average            then minimumHeightIndexValue values nRightChild originalInterval            else minTuple leftResult rightResult          where            average = (nFromIndex + nToIndex) `quot` 2            leftResult = minimumHeightIndexValue values nLeftChild              (FenceInterval (FenceIndex left, FenceIndex average))            rightResult = minimumHeightIndexValue values nRightChild              (FenceInterval (FenceIndex average, FenceIndex right))          minTuple :: (FenceIndex, Int) -> (FenceIndex, Int) -> (FenceIndex, Int)          minTuple old@(_, heightOld) new@(_, heightNew) =            if heightNew < heightOld then new else old

Touching Up the Rest

Once we’ve accomplished this, the rest is pretty straightforward. First, we’ll build our segment tree at the beginning and pass that as a parameter to our function. Then we’ll plug in our new minimum function in place of the old one. We’ll make sure to add the tree to each recursive call as well.

largestRectangle :: FenceValues -> FenceSolutionlargestRectangle values = largestRectangleAtIndices values   (buildSegmentTree (unFenceValues values)  (FenceInterval (FenceIndex 0, FenceIndex (length (unFenceValues values))))

…-- Notice the extra parameterlargestRectangleAtIndices :: FenceValues -> SegmentTreeNode -> FenceInterval -> FenceSolutionlargestRectangleAtIndices  values  tree…      where      …      -- And down here add it to each call      (minIndex, minValue) = minimumHeightIndexValue values tree interval      leftCase = largestRectangleAtIndices values tree (FenceInterval (leftIndex, minIndex))      rightCase = if minIndex + 1 == rightIndex        then FenceSolution (maxBound :: Int)        else largestRectangleAtIndices values tree (FenceInterval (minIndex + 1, rightIndex))

And now we can run our benchmarks again. This time, we’ll see that our code runs a great deal faster on both large cases! Success!

benchmarking fences tests/Size 100000 Testtime                 179.1 ms   (173.5 ms .. 185.9 ms)                     0.999 R²   (0.998 R² .. 1.000 R²)mean                 184.1 ms   (182.7 ms .. 186.1 ms)std dev              2.218 ms   (1.197 ms .. 3.342 ms)variance introduced by outliers: 14% (moderately inflated)

benchmarking fences tests/Size 100000 Test (sorted)time                 238.4 ms   (227.2 ms .. 265.1 ms)                     0.998 R²   (0.989 R² .. 1.000 R²)mean                 243.5 ms   (237.0 ms .. 251.8 ms)std dev              8.691 ms   (2.681 ms .. 11.83 ms)variance introduced by outliers: 16% (moderately inflated)

Conclusion

So in these past two articles we’ve learned a whole lot. We first covered how to create benchmarks for our code using Cabal/Stack. When we ran those benchmarks, we found results took longer than we would like. We then used profiling to determine what the problematic functions were. Then we dove head-first into some data structures knowledge. We saw first hand how changing the underlying data structures of our program could improve our performance. We also learned about arrays, which are somewhat overlooked in Haskell. Then we built a segment tree from scratch and used its API to enable our program’s improvements.

Next week on the Monday Morning Haskell blog, we’ll continue our exploration of data structures by learning about graphs! We’ll look at the Functional Graph Library and see how it makes it easy to solve certain problems!

This problem involved many different uses of recursion. If you want to become a better functional programmer, you’d better learn recursion. If you want a better grasp of this fundamental concept, you should check out our FREE Recursion Workbook. It has two chapters of useful information as well as 10 practice problems!

If you’ve never written Haskell before but are intrigued by the possibilities you saw in this article, you should try it out! Download our Getting Started Checklist! It’ll walk you through installing the language. It’ll also point you to some cool resources for starting your Haskell education.

Finally, be sure to check out our Stack mini-course. Once you’ve mastered the Stack tool, you’ll be well on your way to making Haskell projects like a Pro!

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!