The Bit Shift Paradox: How "Optimizing" Can Make Code 6× Slower

TL;DR: Caching bit shifts looks smart but makes code up to 6× slower. Modern CPUs and compilers make direct computation faster than memory lookups - measure, don’t assume. TL;DR: 6× slower The Interview Question You're given an array and asked to generate its powerset - all possible subsets. For [1, 2, 3], that's: [1, 2, 3] [], [1], [2], [1,2], [3], [1,3], [2,3], [1,2,3] [], [1], [2], [1,2], [3], [1,3], [2,3], [1,2,3] A handy solution will use recursion. There are solid code snippets that will demonstrate it, so I won’t dive into it. There’s another elegant solution that uses binary representation: for an array of length n, count from 0 to 2^n - 1, where each bit pattern represents which elements to include: n 0 2^n - 1 000 → [] 001 → [1] 010 → [2] 011 → [1,2] 100 → [3] 101 → [1,3] 110 → [2,3] 111 → [1,2,3] 000 → [] 001 → [1] 010 → [2] 011 → [1,2] 100 → [3] 101 → [1,3] 110 → [2,3] 111 → [1,2,3] The code is beautiful: function generatePowerset(arr) { const result = []; const n = arr.length; for (let mask = 0; mask < (1 << n); mask++) { const subset = []; for (let i = 0; i < n; i++) { if (mask & (1 << i)) { // Check if bit i is set subset.push(arr[i]); } } result.push(subset); } return result; } function generatePowerset(arr) { const result = []; const n = arr.length; for (let mask = 0; mask < (1 << n); mask++) { const subset = []; for (let i = 0; i < n; i++) { if (mask & (1 << i)) { // Check if bit i is set subset.push(arr[i]); } } result.push(subset); } return result; } The "Optimization" That Wasn't While implementing this in JavaScript, I noticed (1 << i) being computed millions of times. My programmer instinct screamed: "Cache it!" (1 << i) // "Optimized" version const BIT_MASKS = new Array(32); for (let i = 0; i < 32; i++) { BIT_MASKS[i] = 1 << i; // Compute once } // Then use: if (mask & BIT_MASKS[i]) { // Reuse millions of times subset.push(arr[i]); } // "Optimized" version const BIT_MASKS = new Array(32); for (let i = 0; i < 32; i++) { BIT_MASKS[i] = 1 << i; // Compute once } // Then use: if (mask & BIT_MASKS[i]) { // Reuse millions of times subset.push(arr[i]); } This seems like textbook optimization: compute once, use many. But when I benchmarked it, something shocking happened. The "optimized" version was 36% slower. The "optimized" version was 36% slower. This sent me down a rabbit hole across languages, compilers, and CPU architecture. The Investigation I created three test implementations: Variable Shift: Compute (1 << i) every time (variable shift amount) Local Precalculation: Create array per function call Global Precalculation: Create array once at startup Variable Shift: Compute (1 << i) every time (variable shift amount) Variable Shift (1 << i) Local Precalculation: Create array per function call Local Precalculation Global Precalculation: Create array once at startup Global Precalculation Each approach counts set bits in 32-bit integers - a perfect isolated test of bit operations. Testing in JavaScript (V8 JIT) Let's see how these three strategies compare in JavaScript with V8's JIT compiler: // Dataset: 100,000 random 32-bit integers // 50 iterations for statistical validity Results: No Precalc: 8.999ms (baseline) Global Precalc: 12.250ms (+36% slower) Local Precalc: 15.303ms (+70% slower) // Dataset: 100,000 random 32-bit integers // 50 iterations for statistical validity Results: No Precalc: 8.999ms (baseline) Global Precalc: 12.250ms (+36% slower) Local Precalc: 15.303ms (+70% slower) Direct computation won decisively. But was this JavaScript-specific? Testing in C (Unoptimized: -O0) Now testing in C without compiler optimization to see the raw hardware behavior: // Dataset: 1,000,000 numbers // Pure C without compiler optimization Results: No Precalc: 126.762ms (baseline) Global Precalc: 175.930ms (+39% slower) Local Precalc: 217.756ms (+72% slower) // Dataset: 1,000,000 numbers // Pure C without compiler optimization Results: No Precalc: 126.762ms (baseline) Global Precalc: 175.930ms (+39% slower) Local Precalc: 217.756ms (+72% slower) Nearly identical percentages! This suggested a fundamental hardware truth, not a language quirk. Testing in C (Optimized: -O2) Here's where it got interesting. With aggressive optimization, the compiler can recognize that the global array is constant and eliminate the lookup entirely at compile time: // Dataset: 1,000,000 numbers (3.8 MB) Results: Global Precalc: 1.925ms (FASTEST!) Local Precalc: 2.006ms (+4%) No Precalc: 2.231ms (+16%) // Dataset: 1,000,000 numbers (3.8 MB) Results: Global Precalc: 1.925ms (FASTEST!) Local Precalc: 2.006ms (+4%) No Precalc: 2.231ms (+16%) The compiler reversed the results! By proving the global array is constant, GCC turned memory lookups into optimized code that beats runtime computation. reversed the results But when I increased the dataset to 10 million numbers (38 MB) - exceeding the CPU cache: // Dataset: 10,000,000 numbers (38 MB) Results: No Precalc: 16.617ms (FASTEST) Global Precalc: 17.575ms (+6%) Local Precalc: 17.927ms (+8%) // Dataset: 10,000,000 numbers (38 MB) Results: No Precalc: 16.617ms (FASTEST) Global Precalc: 17.575ms (+6%) Local Precalc: 17.927ms (+8%) Direct computation won again. The compiler's optimization hit a wall when data exceeded cache capacity. The Fourth Approach: Running Mask One more optimization seemed promising - instead of computing (1 << i) with a variable shift, maintain a running mask and shift by a constant: (1 << i) // Instead of variable shift: for (i = 0; i < 32; i++) { if (num & (1 << i)) { count++; } // Variable shift amount } // Use constant shift: uint32_t mask = 1; for (i = 0; i < 32; i++) { if (num & mask) { count++; } mask <<= 1; // Constant shift by 1 } // Instead of variable shift: for (i = 0; i < 32; i++) { if (num & (1 << i)) { count++; } // Variable shift amount } // Use constant shift: uint32_t mask = 1; for (i = 0; i < 32; i++) { if (num & mask) { count++; } mask <<= 1; // Constant shift by 1 } This appears even better: Shift by constant 1 (simpler than variable shift) One operation per iteration Cleaner code Shift by constant 1 (simpler than variable shift) One operation per iteration Cleaner code Result: 5.85× slower! (97ms vs 17ms) Result: 5.85× slower! Here's how all four approaches compare with production-scale data: // Dataset: 10,000,000 numbers, 1000 iterations Results: Variable Shift: 16.614ms (FASTEST) Global Precalc: 17.622ms (+6%) Local Precalc: 17.803ms (+7%) Running Mask: 97.159ms (+485% - catastrophically slow!) // Dataset: 10,000,000 numbers, 1000 iterations Results: Variable Shift: 16.614ms (FASTEST) Global Precalc: 17.622ms (+6%) Local Precalc: 17.803ms (+7%) Running Mask: 97.159ms (+485% - catastrophically slow!) The culprit? Data dependencies destroy CPU parallelism. Data dependencies destroy CPU parallelism. Why Running Mask is So Slow The running mask creates a tight dependency chain that prevents the CPU from executing iterations in parallel: Variable shift (fast): Each iteration is independent Variable shift (fast): for (i = 0; i < 32; i++) { if (num & (1 << i)) { ... } // No dependency between iterations } for (i = 0; i < 32; i++) { if (num & (1 << i)) { ... } // No dependency between iterations } CPU can execute multiple iterations in parallel Instruction-level parallelism works Out-of-order execution optimizes throughput CPU can execute multiple iterations in parallel Instruction-level parallelism works Out-of-order execution optimizes throughput Running mask (slow): Each iteration waits for the previous Running mask (slow): uint32_t mask = 1; for (i = 0; i < 32; i++) { if (num & mask) { ... } mask <<= 1; // Next iteration MUST WAIT for this! } uint32_t mask = 1; for (i = 0; i < 32; i++) { if (num & mask) { ... } mask <<= 1; // Next iteration MUST WAIT for this! } Serialized execution (no parallelism possible) Pipeline stalls on read-after-write hazard CPU sits idle waiting for mask to update Serialized execution (no parallelism possible) Pipeline stalls on read-after-write hazard CPU sits idle waiting for mask to update The Answer: Cache Architecture The mystery unraveled when I examined my test machine's cache: Apple Silicon M-series Cache: Apple Silicon M-series Cache: L1 Data: 128 KB L2: 16 MB (per P-core cluster) L3: None (uses System Level Cache instead) L1 Data: 128 KB L2: 16 MB (per P-core cluster) 16 MB L3: None (uses System Level Cache instead) Dataset Size vs Cache: Dataset Size Fits in L2? Winner (-O2) 1M 3.8 MB ✓ YES (24%) Global Precalc 10M 38 MB ✗ NO (238%) No Precalc Dataset Size Fits in L2? Winner (-O2) 1M 3.8 MB ✓ YES (24%) Global Precalc 10M 38 MB ✗ NO (238%) No Precalc Dataset Size Fits in L2? Winner (-O2) Dataset Dataset Size Size Fits in L2? Fits in L2? Winner (-O2) Winner (-O2) 1M 3.8 MB ✓ YES (24%) Global Precalc 1M 1M 3.8 MB 3.8 MB ✓ YES (24%) ✓ YES (24%) Global Precalc Global Precalc 10M 38 MB ✗ NO (238%) No Precalc 10M 10M 38 MB 38 MB ✗ NO (238%) ✗ NO (238%) No Precalc No Precalc Why This Happens: The Performance Model When Data Fits in Cache (≤ 4 MB) Memory access latency: Memory access latency: L2 cache: ~4 nanoseconds Bit shift: ~0.5 nanoseconds L2 cache: ~4 nanoseconds Bit shift: ~0.5 nanoseconds Array lookup is 8× slower, but still fast enough that aggressive compiler optimization can make it competitive or even better: // What GCC -O2 does: // 1. Recognizes GLOBAL_BIT_MASKS is constant // 2. Keeps it in L1 cache or registers // 3. Eliminates bounds checking // 4. Optimizes memory access patterns // 5. Reorders instructions for pipelining // // Result: Array lookup becomes nearly free // What GCC -O2 does: // 1. Recognizes GLOBAL_BIT_MASKS is constant // 2. Keeps it in L1 cache or registers // 3. Eliminates bounds checking // 4. Optimizes memory access patterns // 5. Reorders instructions for pipelining // // Result: Array lookup becomes nearly free When Data Exceeds Cache (> 16 MB) Memory access latency: Memory access latency: Main RAM: ~100 nanoseconds Bit shift: ~0.5 nanoseconds Main RAM: ~100 nanoseconds Bit shift: ~0.5 nanoseconds Array lookup is now 200× slower! The compiler's optimization can't overcome the physics of memory latency. 200× slower // Even with optimization: // - Array still must be fetched from RAM // - Data constantly evicted from cache // - Cache thrashing occurs // - Memory latency dominates // // Result: Direct computation wins // Even with optimization: // - Array still must be fetched from RAM // - Data constantly evicted from cache // - Cache thrashing occurs // - Memory latency dominates // // Result: Direct computation wins The Fundamental Truth At the CPU level: Bit Shift: Single ALU instruction (0.3-0.5 ns) ↓ [CPU Register] ↓ Result Array Lookup: Memory fetch + dereference ↓ [Check L1 Cache?] → Miss ↓ [Check L2 Cache?] → Miss ↓ [Fetch from RAM] → 100 ns! ↓ Result Bit Shift: Single ALU instruction (0.3-0.5 ns) ↓ [CPU Register] ↓ Result Array Lookup: Memory fetch + dereference ↓ [Check L1 Cache?] → Miss ↓ [Check L2 Cache?] → Miss ↓ [Fetch from RAM] → 100 ns! ↓ Result Bit shifts execute in the CPU's Arithmetic Logic Unit (ALU) with no memory access. Even the fastest memory is slower than no memory access at all. Bit shifts execute in the CPU's Arithmetic Logic Unit (ALU) with no memory access. Why Compiler Optimization Changes Things Modern compilers like GCC can perform aggressive optimizations: For Global Arrays (-O2): // Source code: mask & GLOBAL_BIT_MASKS[i] // Compiler sees: // - GLOBAL_BIT_MASKS is constant // - Values are powers of 2 // - Can precompute at compile time // - Can keep in registers // - Can eliminate runtime lookups // Result: Sometimes faster than dynamic shifts // Source code: mask & GLOBAL_BIT_MASKS[i] // Compiler sees: // - GLOBAL_BIT_MASKS is constant // - Values are powers of 2 // - Can precompute at compile time // - Can keep in registers // - Can eliminate runtime lookups // Result: Sometimes faster than dynamic shifts For Bit Shifts: // Source code: mask & (1 << i) // Compiler sees: // - Must compute at runtime (i is variable) // - Simple instruction, but can't eliminate it // - Can pipeline with other ops // - Very fast, but not zero cost // Result: Consistently fast, not optimized away // Source code: mask & (1 << i) // Compiler sees: // - Must compute at runtime (i is variable) // - Simple instruction, but can't eliminate it // - Can pipeline with other ops // - Very fast, but not zero cost // Result: Consistently fast, not optimized away The paradox: When data fits in cache, the compiler can optimize the "slow" approach to be faster than the "fast" approach! The paradox: JavaScript JIT Limitations V8's Just-In-Time compiler is powerful but more conservative than static compilers: Why V8 can't optimize as aggressively: Why V8 can't optimize as aggressively: Must handle dynamic types Must maintain language semantics Compiles at runtime (less time for optimization) Less global program knowledge Can't prove global arrays are truly constant Arrays aren't really arrays Must handle dynamic types Must maintain language semantics Compiles at runtime (less time for optimization) Less global program knowledge Can't prove global arrays are truly constant Arrays aren't really arrays Arrays aren't really arrays JavaScript Arrays Are Not Native Arrays A critical difference often overlooked: // In JavaScript, this looks like a C array: const BIT_MASKS = new Array(32); // But it's actually a complex data structure more like Java's ArrayList: // - Dynamic resizing // - Sparse storage (holes are allowed) // - Can hold mixed types // - Properties can be added/deleted // - Bounds checking at runtime // - Potential hash table backing for sparse arrays // In JavaScript, this looks like a C array: const BIT_MASKS = new Array(32); // But it's actually a complex data structure more like Java's ArrayList: // - Dynamic resizing // - Sparse storage (holes are allowed) // - Can hold mixed types // - Properties can be added/deleted // - Bounds checking at runtime // - Potential hash table backing for sparse arrays In C: In C: uint32_t BIT_MASKS[32]; // Contiguous memory block // Fixed size, single type // Direct pointer arithmetic // No runtime overhead uint32_t BIT_MASKS[32]; // Contiguous memory block // Fixed size, single type // Direct pointer arithmetic // No runtime overhead In JavaScript: In JavaScript: const BIT_MASKS = new Array(32); // Actually a managed object with: // - Type checks on access // - Bounds validation // - Potential deoptimization if used wrong // - Much more overhead than raw memory access const BIT_MASKS = new Array(32); // Actually a managed object with: // - Type checks on access // - Bounds validation // - Potential deoptimization if used wrong // - Much more overhead than raw memory access Even when V8 optimizes JavaScript arrays to use contiguous storage (packed arrays), they still have more overhead than native C arrays. The JIT must guard against type changes, resizing, and other dynamic behaviors that don't exist in C. Result: In JavaScript, direct bit shifts always win because: Result: The JIT can't apply the same aggressive optimizations as GCC Array access has inherent overhead that C doesn't have Even "optimized" JavaScript arrays aren't as fast as C arrays The JIT can't apply the same aggressive optimizations as GCC Array access has inherent overhead that C doesn't have Even "optimized" JavaScript arrays aren't as fast as C arrays Practical Recommendations For JavaScript/TypeScript Developers ✅ Always use direct bit shifts: Always use direct bit shifts: mask & (1 << i) // Use this mask & (1 << i) // Use this ❌ Don't precalculate: Don't precalculate: mask & BIT_MASKS[i] // Avoid this mask & BIT_MASKS[i] // Avoid this Why: V8 can't optimize array lookups as aggressively as static compilers. Direct computation is 30-70% faster. Why: For C/C++ Developers The answer is more nuanced: The answer is more nuanced: Production builds (-O2/-O3) with small data: Production builds (-O2/-O3) with small data: Variable shift (1 << i) is best Compiler optimizes it effectively ~5-10% faster than precalculation Variable shift (1 << i) is best (1 << i) Compiler optimizes it effectively ~5-10% faster than precalculation Production builds with large data: Production builds with large data: Variable shift wins decisively Memory latency dominates at scale 6% faster than precalc, 485% faster than running mask Variable shift wins decisively Memory latency dominates at scale 6% faster than precalc, 485% faster than running mask Debug builds (-O0): Debug builds (-O0): Always use variable shift (1 << i) 40-70% faster than precalculation Critical for faster debug-build runs Always use variable shift (1 << i) (1 << i) 40-70% faster than precalculation Critical for faster debug-build runs Never use running mask: Never use running mask: mask <<= 1 approach is 5.8× slower Data dependencies prevent parallelization Appears simpler but kills CPU pipeline mask <<= 1 approach is 5.8× slower mask <<= 1 Data dependencies prevent parallelization Appears simpler but kills CPU pipeline Unknown dataset size: Unknown dataset size: Default to variable shift (1 10ns) Your working set fits in cache You're using aggressive optimization You've measured and confirmed benefit The computation is expensive (>10ns) Your working set fits in cache You're using aggressive optimization You've measured and confirmed benefit Don't cache if: Don't cache if: Computation is trivial (bit ops, arithmetic) Working set exceeds cache "Optimization" adds complexity You haven't profiled Computation is trivial (bit ops, arithmetic) Working set exceeds cache "Optimization" adds complexity You haven't profiled Never create data dependencies if: Never create data dependencies if: You can compute values independently "Simpler" operations serialize execution You're trading parallelism for "efficiency" You can compute values independently "Simpler" operations serialize execution You're trading parallelism for "efficiency" The Deeper Lesson This investigation reveals why performance optimization is so challenging: 1. Intuition Fails "Compute once, use many" seems obviously better. But: Memory is not infinitely fast CPU computation can be cheaper than memory access Scale changes everything Memory is not infinitely fast CPU computation can be cheaper than memory access Scale changes everything 2. Context Matters The "right" answer depends on: Language/runtime (JavaScript vs C) Compiler optimization level (-O0 vs -O2) Dataset size (cache fit) Hardware architecture (cache sizes) Language/runtime (JavaScript vs C) Compiler optimization level (-O0 vs -O2) Dataset size (cache fit) Hardware architecture (cache sizes) 3. Measure, Don't Assume My intuition said "cache it." The benchmark said "don't." In the 1990s, when CPU cycles were precious and memory was relatively fast, caching bit masks might have been faster. In 2025, with multi-GHz CPUs and memory that's relatively slower, the opposite is true. Hardware evolution flips performance assumptions. Hardware evolution flips performance assumptions. Conclusion The "optimizations" of bit mask manipulation reveal surprising truths: Precalculating bit masks makes code: Precalculating bit masks makes code: 36-70% slower in JavaScript 40-70% slower in unoptimized C 6-7% slower in optimized C at scale 36-70% slower in JavaScript 40-70% slower in unoptimized C 6-7% slower in optimized C at scale Using running mask makes code: Using running mask makes code: 485% slower (5.85×) even with "simpler" constant shifts Destroys CPU parallelism through data dependencies Shows that "simpler" operations can be catastrophically slow 485% slower (5.85×) even with "simpler" constant shifts Destroys CPU parallelism through data dependencies Shows that "simpler" operations can be catastrophically slow The fundamental reasons: The fundamental reasons: CPU bit shift instructions are faster than memory access Independent operations enable parallelism Data dependencies serialize execution and stall pipelines Modern CPUs optimize for instruction-level parallelism CPU bit shift instructions are faster than memory access Independent operations enable parallelism Data dependencies serialize execution and stall pipelines Modern CPUs optimize for instruction-level parallelism Modern compilers can sometimes overcome memory access costs with aggressive optimization when data fits in cache, but: At production scale, hardware realities prevail Data dependencies cannot be optimized away CPU parallelism trumps "simple" operations At production scale, hardware realities prevail Data dependencies cannot be optimized away CPU parallelism trumps "simple" operations The Meta-Lesson This investigation isn't really about bit shifts. It's about: Questioning assumptions Measuring instead of guessing Understanding your hardware Recognizing that "optimization" can make things slower Appreciating CPU instruction-level parallelism Questioning assumptions Measuring instead of guessing Understanding your hardware Recognizing that "optimization" can make things slower Appreciating CPU instruction-level parallelism The next time you're tempted to "optimize" something, ask: Is the original operation really expensive? Have I measured the actual cost? Am I trading fast computation for slow memory access? Does my "optimization" exceed cache capacity? Am I creating data dependencies that prevent parallelism? Does my "simpler" approach actually serialize execution? Is the original operation really expensive? Have I measured the actual cost? Am I trading fast computation for slow memory access? Does my "optimization" exceed cache capacity? Am I creating data dependencies that prevent parallelism? Does my "simpler" approach actually serialize execution? Sometimes the best optimization is no optimization at all. And sometimes the "obviously better" approach is 6× slower. In short: Compute, don’t cache. Measure, don’t assume.The CPU is a master of parallel math - let it work, and keep memory out of the way. In short: Reproduce the Results All source code and benchmarks are available in this repository. this repository Hardware: Apple Silicon M-series, macOS 24.6.0Software: Node.js v22.14.0 (V8), GCC/Clang with C99 Hardware: Apple Silicon M-series, macOS 24.6.0Software: Node.js v22.14.0 (V8), GCC/Clang with C99