Bringing C++ to Query Execution: Why the Future of Data Engines Is Native

Written by hitarth | Published 2025/12/11
Tech Story Tags: c++ | open-source-presto | query-engine | database-internals | big-data | distributed-systems | performance-engineering | vectorization

TLDRAnalytical engines like Presto, Hive and Spark were originally built on the JVM because it made development easier and offered a large ecosystem. For years that approach worked well, but as data volumes grew and queries became more complex the JVM began to show limitations in the one place where efficiency matters most: the execution layer. The JVM introduces overhead in ways that directly affect query performance. Garbage collection can pause execution at unpredictable times. Objects are spread across the heap in ways that break CPU prefetching. SIMD instructions, which modern CPUs rely on for processing multiple values at once, are difficult for the JIT to generate reliably. These issues compound when decoding Parquet and ORC files where bit packing, dictionary lookups and nested structures need tight, low-level loops. Native C++ engines avoid these costs. They can structure data contiguously, call SIMD instructions directly and build operators that run inside simple, predictable loops. Systems like DuckDB, ClickHouse and Databricks Photon have shown what happens when execution is aligned with hardware: CPU usage drops, latency steadies and overall throughput increases. Those gains are not incremental, they are structural. Velox was created to make native execution usable by existing engines. It is not a full database. It is a C++ execution library that provides vectors, operators, memory management and file readers. Engines like Presto can adopt Velox without rewriting planners, connectors or the broader ecosystem. Unsupported queries fall back to the Java engine while supported queries run natively. At Uber we migrated major Presto clusters to Velox-backed workers and saw the expected improvements. Shadow execution validated correctness across thousands of queries. Routing logic ensured only supported queries ran on the native path. Debugging shifted from JVM exceptions to native stack traces and memory patterns. Once the system settled, CPU usage decreased, tail latency improved and query behavior became more predictable. Native execution is not a trend. It is the natural outcome of aligning execution engines with modern hardware. Velox provides a shared, reusable way for systems to make that transition without throwing away the foundations they already rely on. The result is faster processing, steadier performance and an execution model that matches the scale of the data it serves. via the TL;DR App

Introduction

For years, most large-scale data engines ran on a simple assumption, Java is good enough for query execution. Presto, Hive and Spark were all built on the JVM because it made engineering easier. It provided stability. It offered portability. The tooling was familiar. For a long time, that was enough.

But analytical workloads have grown in size and complexity. The JVM stayed the same. Modern systems scan massive columnar files, decode nested data and run billions of simple operations that depend on tight coordination with the CPU. These engines work best when memory layout, branching patterns and data movement match what the hardware expects. The JVM does not naturally operate under those constraints.

This gap led to a question that once felt almost out of place.

That question opened the door to native execution. And among native engines, Velox has gained attention because it does not replace entire systems. It provides a focused C++ execution library that existing engines can adopt at their own pace.

The rest of this article explores why this shift is happening and why it matters.

Why Native Execution Took Over

When native execution engines arrived, the difference in performance was difficult to ignore.

C++ engines control memory directly. They place data in contiguous buffers. They avoid unnecessary branching. They call SIMD instructions outright. This approach aligns naturally with hardware.

DuckDB and ClickHouse demonstrated how much faster a columnar engine can run when operators execute inside simple, well-structured loops as described in this paper.

Databricks showed the same effect with Photon, which consistently outperforms Spark’s Java path. Their paper outlines these improvements in detail.

The pattern is clear. Native code avoids overhead that JVM engines cannot remove. Once this became widely known the conversation shifted from whether native execution is better to how existing engines can adopt it without starting over.

This is where Velox enters the picture.

Velox: A Reusable Native Execution Engine

Velox is a C++ execution library designed to act as the compute core for data systems. It does not handle planning or scheduling. It focuses only on the part of a query engine where performance matters most.

Velox provides several essential components:

  • A unified columnar vector model

  • Expression evaluation

  • Join and Aggregation operators

  • Parquet and ORC readers

  • Native memory management and spill logic

The full design is documented in the VLDB publication from Meta’s team.

What sets Velox apart is not just speed but reusability. Engines do not need to rebuild a C++ core. They can adopt Velox incrementally. Planners stay where they are. Connectors stay where they are. Only execution changes. If an engine only supports filters and projections in Velox, everything else stays on the Java path. Over time more operators migrate without a full rewrite.

Unsupported queries fall back to the original Java path. Supported queries run natively. The system evolves step by step.

This gives teams a practical migration path instead of a complete rewrite.

And thanks to its modular architecture, Velox doesn’t take over your engine, it fits into it.

  • Presto teams can embed Velox as a replacement for Java workers.
  • Spark can push execution through Velox via native operator pipelines.
  • Trino, DuckDB extensions, ETL frameworks, ML feature pipelines, any system that needs fast columnar processing can reuse the same execution layer.

The below diagram highlights the only real change in a native Presto cluster. The coordinator still runs on Java, and the connectors stay untouched. What shifts is the executor inside each worker, where Velox replaces the Java engine and takes over the heavy computation.

The below diagram highlights the only real change in a native Presto cluster. The coordinator still runs on Java, and the connectors stay untouched. What shifts is the executor inside each worker, where Velox replaces the Java engine and takes over the heavy computation.

What Makes C++ Faster Under the Hood

Native engines perform well because they follow the rules the CPU expects.

Data lives in flat vectors. Operators execute inside tight loops. Branching is limited. Memory access is predictable. Prefetching works as intended. CPUs stay busy instead of waiting on pointer chains or virtual dispatch.

C++ engines use:

  • AVX2 and AVX 512 instructions
  • Branchless arithmetic
  • Cache aware data layout
  • Manual memory control
  • Bit packed decoders and fast null handling

Much of this is reflected in research such as the SIMD Scan work presented at SIGMOD.

These techniques allow native engines to move through data with far less friction. Parquet decoding becomes faster. Aggregations become faster. Hash table probes become faster.

Another important point is predictability. There are no GC pauses. No JIT warmup effects. A query that runs well once will likely run well again.

In practice, this means a native engine can tear through a column of integers in a tight loop with almost no wasted cycles, something the JVM struggles to guarantee.

Native Execution in Production: What It Looks Like and What We Saw at Uber

Once real workloads start running on native pipelines a few patterns appear quickly. CPU usage drops because native engines use fewer instructions per row. Latency becomes steadier and the odd slow query becomes less frequent. Tail spikes shrink. Columnar file decoding speeds up, especially for Parquet and ORC heavy workloads. The result is a system that feels calmer and easier to operate.

We saw the same pattern at Uber when we moved major Presto clusters from Java execution to Velox. The migration required care. Shadow execution helped validate correctness across thousands of queries. Checksums caught mismatches early and made it possible to compare results at scale.

Routing was a key piece. Only queries fully supported by Velox ran natively. Fingerprinting and structural checks ensured the routing was safe. Everything else used the original Java path.

Debugging changed as well. Native crashes do not produce friendly exceptions. They produce memory offsets and symbol traces. It took time to adjust but it revealed issues the Java engine had masked.

Once the system stabilized the improvements were clear. Lower CPU usage. Better stability. More predictable performance. Native execution behaved exactly as expected in production.

The Road Ahead

Native execution is becoming the standard path for engines that operate at scale. Velox strengthens existing systems rather than replacing them. It offers a shared compute layer that multiple engines can adopt.

Standards like Substrait could make this even more flexible. A plan created by one engine might run inside another using a common execution layer. This idea is no longer theoretical.

Other engines point in the same direction like  Photon, DuckDB, C++ based ETL systems. The pattern is consistent. Engines gain significant benefits when execution happens close to the hardware.

This shift is not about chasing micro optimizations. It is about building systems that behave predictably on increasingly parallel hardware.

Conclusion

Native execution is not a trend. It is a response to the demands of modern workloads. The JVM powered the first generation of analytical systems and served them well. But C++ engines like Velox are better aligned with how CPUs and memory behave today. The difference shows up in cost, stability and predictability.

The future of analytical engines will be shaped by execution layers that operate close to the hardware and respect its constraints. Velox is one of the tools helping that shift happen.

As engines converge on shared native execution layers like Velox, we may eventually see a world where query plans travel between engines the way container images travel between clouds.


Written by hitarth | Passionate about building fast, reliable, and scalable data infrastructure powering modern analytics.
Published by HackerNoon on 2025/12/11