I was profiling a simple loop (years ago). It was similar to: countForever ref = forever $ modifyIORef' ref (+1) I counted four assembly instructions. Addition takes ~300 picoseconds, but there is memory access so let’s say two nanoseconds To profile, I wrote: iorefLoop = doref <- newIORef (0 :: Int)replicateM_ 100000000 $ modifyIORef' ref (+1) I expected this to take ~ 0.2 seconds, but … $ bench ./bin/ioref-loop benchmarking ./bin/ioref-looptime 332.3 ms (330.4 ms .. 333.8 ms)1.000 R² (1.000 R² .. 1.000 R²)mean 334.4 ms (333.2 ms .. 335.9 ms)std dev 2.822 ms (2.182 ms .. 3.942 ms) Was this a good time? Was I expecting something unrealistic from Haskell? As a sanity check, I wrote an analogous in C: program int main (int argc, const char** argv){int ref = 0;while (ref < 100000000){ref++;}} I compiled and ran it like: $ gcc -O2 ./src/CLoop.c -o ./bin/CLoop && bench ./bin/CLoop benchmarking ./bin/CLooptime 3.999 ms (3.794 ms .. 4.235 ms)0.980 R² (0.972 R² .. 0.992 R²)mean 3.940 ms (3.838 ms .. 4.093 ms)std dev 593.3 μs (422.8 μs .. 816.0 μs)variance introduced by outliers: 90% (severely inflated) This was much faster than I had expected. Maybe isn’t accurate when programs are fast? bench I went on IRC (different times because GHC makes faster code now): jfischoff ~ Is there anyway to make modifyIORef' faster? jfischoff ~ I am surprised that in a second I was only able to ↪ update this ref 100 million times: ↪ timeout 1000000 $ forever $ modifyIORef' x (1+)jfischoff ~ where as c++ was able to do the same in 4 millisecondsglguy ~ c++ was able to do 1 update every 0.04 nanoseconds?glguy ~ an update rate of 25 gigahertz?dv- ~ gcc probably just replaced it with a constantjfischoff ~ dv-: perhapsglguy ~ That or C++ unlocks the fast mode of an Intel processor Burn. $ gcc -O2 CLoop.c -S Here’s the assembly: ; CLoop.s; Notice, there are no jumps.; There is no looping._main: ## .cfi_startproc## BB#0:pushq %rbpLtmp0:.cfi_def_cfa_offset 16Ltmp1:.cfi_offset %rbp, -16movq %rsp, %rbpLtmp2:.cfi_def_cfa_register %rbpxorl %eax, %eaxpopq %rbpretq.cfi_endproc @main was right. The compiler got rid of the loop, I’m assuming, because I wasn’t using the result. I added a keyword to prevent : dv- volatile optimizations //CountForver.cint main (int argc, const char** argv){volatile int ref = 0;while (ref < 100000000){ref++;}} Compiling: $ gcc -O2 ./src/CLoopVolatile.c -o ./bin/CLoopVolatile$ bench ./bin/CLoopVolatile benchmarking ./bin/CLoopVolatile time 158.2 ms (156.7 ms .. 159.5 ms)1.000 R² (0.999 R² .. 1.000 R²)mean 159.1 ms (158.4 ms .. 160.2 ms)std dev 2.423 ms (1.755 ms .. 3.772 ms) The assembly of the loop is: loop:incl -4(%rbp) # Increment the value at the address stored# in rbp displaced by 4.movl -4(%rbp), %eax # move the value at the address stored in rbp# displaced by 4 to eax.cmpl $100000000, %eax # Compare to 100000000 to eax.jl loop # Jump to the start of the loop if the eax# was less than when compared above. So, incrementing C’s mutable refs is over twice as fast as ? Maybe, though, handwritten assembly could be faster? Maybe the keyword threw off optimizations too much? IORef volatile ; AsmLoop.asmEXTERN _exitGLOBAL start SECTION .dataalign 8iterationCount DD 100000000result DD 0 SECTION .textstart:; Copy the iteration count to the ecx register; which is used by the loopnz instructionmov ecx, [iterationCount] loopBlock:inc dword [result] ; Increment the value at the address of resultloopnz loopBlock ; Decrement the ecx counter and loop until ecx; is zero exitBlock:push dword 0 ; Set the exit code to zeromov eax, 1 ; Place the system call number (exit) in the eax regsub esp, 4 ; I have to add some dummy space to the stack for; some reasonint 0x80 ; Exit Compiling: $ nasm -fmacho src/AsmLoop.asm$ ld -o ./bin/AsmLoop -macosx_version_min 10.7.0 ./src/AsmLoop.o$ bench ./bin/AsmLoop benchmarking ./bin/AsmLooptime 185.2 ms (183.7 ms .. 186.9 ms)1.000 R² (0.999 R² .. 1.000 R²)mean 187.0 ms (185.9 ms .. 188.5 ms)std dev 3.132 ms (2.229 ms .. 4.269 ms) Wow, my l33t skillz increased the time by around 15% (three years ago this version was faster … just saying). So is about two times slower than naive approaches in C and assembly. What gives? To the core! IORef $ stack exec ghc-core -- --no-syntax -- -O2 \-dsuppress-all src/IORefLoop.hs Looking through the core I saw: ...case readMutVar# ipv1_aVW w2_a35mof _ { (# ipv2_aWt, ipv3_aWu #) ->case ipv3_aWu of _ { I# x_aUW ->case writeMutVar# ipv1_aVW (I# (+# x_aUW 1#)) ipv2_aWt... I find GHC core almost unreadable. One trick is to try to ignore most of the case statements. The first and third case statements are not for scrutinizing alternatives, but are to ensure proper sequencing of IO actions. Also, renaming the generated variables helps. Here is a cleaned up version of what was above: ...case readMutVar# ioRef token of{ (# token', x #) ->case x of{ I# unbox -> case writeMutVar# ioRef (I# (+# unbox 1#)) token'... The second case statement is unboxing an to a primitive unboxed , Int Int# case x of{ I# unbox and then boxing when setting. case writeMutVar# ioRef (I# (+# unbox 1#)) token' is the constructor (the means it is a compiler primitive). It wraps or boxes an unpacked, unboxed, "machine" int. Most of the time, GHC can unbox primitives automatically, but it can't in this case. Boxing/unboxing could be the root of the slowdown. I# Int # We need an unboxed mutable reference. I can think of two options: and . Ptr Int MutableByteArray First the version: Ptr main = alloca $ \ptr -> dopoke ptr (0 :: Int)replicateM_ 100000000 $ doi <- peek ptrlet !i' = i + 1poke ptr i' Running, I get: bench ./bin/ptr-loop benchmarking ./bin/ptr-looptime 175.2 ms (174.1 ms .. 176.2 ms)1.000 R² (0.999 R² .. 1.000 R²)mean 174.7 ms (174.1 ms .. 175.2 ms)std dev 1.626 ms (1.372 ms .. 2.089 ms) This is good! Almost twice as fast . IORef Now the version: MutableByteBuffer main = domba <- newAlignedPinnedByteArray 4 8replicateM_ 100000000 $ doi <- readByteArray mba 0 :: IO Intlet !i' = i + 1writeByteArray mba 0 i' Running: bench ./bin/mutable-byte-buffer-loop benchmarking ./bin/mutable-byte-buffer-looptime 175.2 ms (173.5 ms .. 177.3 ms)0.999 R² (0.999 R² .. 1.000 R²)mean 177.4 ms (176.5 ms .. 178.4 ms)std dev 2.423 ms (1.987 ms .. 2.919 ms) Basically identical. I’ve made progress, but I don’t want to use either or . They are not very safe ( is arguably safe enough, but I prefer not to manually allocate memory directly if I can help it). Ptr MutableByteBuffer Ptr If only there were a safe wrapper around one of the types? There is! is a safe wrapper around . Here is the version: Vector MutableByteBuffer Vector main = domba <- M.new 1replicateM_ 100000000 $ doi <- M.read mba 0 :: IO Intlet !i' = i + 1M.write mba 0 i' Running: bench ./bin/vector-loop benchmarking ./bin/vector-looptime 178.6 ms (177.3 ms .. 179.9 ms)1.000 R² (0.999 R² .. 1.000 R²)mean 178.2 ms (177.1 ms .. 179.3 ms)std dev 3.021 ms (2.082 ms .. 4.288 ms) Nice! With only a small increase in time, we get a safe interface for fast, unboxed mutable references. It is a little odd to use a collection type for a single element, but it’s not terrible, either. What’s on the Table The and the are not the fastest way one could write a loop that increments a number. That would involve putting the reference in a register for the duration of the loop, and other tricks like loop unrolling. Just to give a sense of how much faster things could be here is a some simple assembly example that does that. CLoopVolatile AsmLoop.asm ; FastAsmLoop.asmEXTERN _exitGLOBAL start SECTION .textstart:mov rcx, 0mov rax, 100000000 loopBlock:inc qword rcx ;cmp rcx, rax ;jl loopBlock ; exitBlock:mov rax, 0x2000001 ; exitmov rdi, 0syscall Compiling $ nasm -fmacho64 src/FastAsmLoop.asm$ ld -o ./bin/FastAsmLoop -macosx_version_min 10.7.0 \./src/FastAsmLoop.o$ bench ./bin/FastAsmLoop; benchmarking ./bin/FastAsmLooptime 38.38 ms (37.96 ms .. 39.15 ms)0.999 R² (0.995 R² .. 1.000 R²)mean 38.22 ms (37.98 ms .. 38.87 ms)std dev 715.8 μs (282.6 μs .. 1.261 ms) I barely believe these numbers, but I think it shows that GHC still is leaving some performance on the table. Something else worth noting, this isn’t how a tight loop would be written in Haskell anyway. One would not need a mutable reference. In a future post I would like to do more direct comparison between Haskell and assembly similar to above. The C code was really just a baseline that lead me to look into what GHC was doing. Update: Michael Snoyman alerted me to a package of his that provides an unboxed mutable reference, among other nice things: https://www.stackage.org/haddock/lts-8.16/mutable-containers-0.3.3/Data-Mutable.html#t:URef The a project with this code that can also produce these timings can be found here: https://github.com/jfischoff/are-mutable-references-in-haskell-fast is how hackers start their afternoons. We’re a part of the family. We are now and happy to opportunities. Hacker Noon @AMI accepting submissions discuss advertising & sponsorship To learn more, , , or simply, read our about page like/message us on Facebook tweet/DM @HackerNoon. If you enjoyed this story, we recommend reading our and . Until next time, don’t take the realities of the world for granted! latest tech stories trending tech stories