Rust + OpenGL: Rendering 250,000 Dynamic 3D Entities at 50 FPS on a Single CPU Thread

Hi! I'll start with the phrase: I'll start with the phrase: The entire history of IT has been an attempt to hide from hardware, hiding behind the complexity of the concept. Now we're at a point where IT is returning to hardware. The entire history of IT has been an attempt to hide from hardware, hiding behind the complexity of the concept. Now we're at a point where IT is returning to hardware. Before you understand what I'm talking about, I recommend reading my first article. Here’s link: https://hackernoon.com/i-built-a-functional-data-oriented-3d-game-framework-in-rust-for-low-end-pcs?embedable=true https://hackernoon.com/i-built-a-functional-data-oriented-3d-game-framework-in-rust-for-low-end-pcs?embedable=true In short, I created an open-source game engine, Light Acorn, with a unique architecture designed for older PCs and easy for beginners to Rust game development. The engine is based on Macroquad and Bevy ECS (there is also tobj for loading 3D models from Blender). In short, In the previous article, I described Light Acorn's features, the issues with GPU instancing, and asked for help with its implementation for old hardware. But while the first article was being moderated (It was published on March 23), I wasn't sitting idle and wasting time solving the problem that was holding back the engine's development: draw calls. And… I've been solving the problem for 1 week (that's how long the article was under moderation). The result: 13.000+ Active 3D Entities at 60 FPS on an old 2013 laptop with 35% CPU usage. The result: Compare with the previous result: 1,300+ entities running at ~28% CPU and 26 FPS Compare with the previous result: To give you some context on the hardware: To give you some context on the hardware: CPU: i3-3217U — 2 cores / 4 threads, 1.8 GHz max RAM: 6GB DDR3 (1600 MT/s vs modern DDR5 ~4800 MT/s) Storage: 720GB HDD (5400 RPM) GPU: GT 720M — 2GB DDR3, ~192 CUDA cores, 64-bit memory bus CPU: i3-3217U — 2 cores / 4 threads, 1.8 GHz max CPU: RAM: 6GB DDR3 (1600 MT/s vs modern DDR5 ~4800 MT/s) RAM: Storage: 720GB HDD (5400 RPM) Storage: GPU: GT 720M — 2GB DDR3, ~192 CUDA cores, 64-bit memory bus GPU: Even most modern integrated graphics outperform this setup. And yes, this machine has been my daily driver for 13 years. For comparison, here are some approximate specifications of a modern budget laptop (2026): For comparison, here are some approximate specifications of a modern budget laptop (2026): CPU: Intel Core Ultra 3 105UL — 8 cores (2 P-cores + 4 E-cores + 2 LP E-cores for ultra-low power) / 10 threads, 4.2 GHz max RAM: 8 or 16GB LPDDR5 (6400 МТ/с) Storage: 512 GB NVMe Gen 4 SSD GPU: integrated Intel Arc Graphics (but GT 720M is still much weaker) CPU: Intel Core Ultra 3 105UL — 8 cores (2 P-cores + 4 E-cores + 2 LP E-cores for ultra-low power) / 10 threads, 4.2 GHz max CPU: RAM: 8 or 16GB LPDDR5 (6400 МТ/с) RAM: Storage: 512 GB NVMe Gen 4 SSD Storage: GPU: integrated Intel Arc Graphics (but GT 720M is still much weaker) GPU: So, a modern budget laptop is capable of a lot compared to mine. BUT, I was able to make the X550CC feel worthy in its owner's hands. The most shocking thing for me is not the fact that I squeezed the maximum out of the minimum, but the fact under what conditions I did it and still achieved the desired result. the fact under what conditions I did it and still achieved the desired result. This article will contain a lot of benchmarks, and not only on old hardware! I hope you find this interesting, so I'll start with a story. The Solution Vertex batching I already said that the problem is in draw calls, so I started with the obvious approach: vertex batching. Take all acorns, merge them into one giant mesh, and draw with one draw call. I also modernized the solution: instead of calculating 1 matrix in one function and then drawing 1 mesh --> I divided the function into two: calculating the matrices of all meshes, and then drawing them. The matrix calculation function first calculates the matrices of all entities, sends a buffer, and then another function receives this buffer and draws it. It was easy to implement, but I quickly ran into the limits of OpenGL ES 2.0: It was easy to implement, but I quickly ran into the limits of OpenGL ES 2.0: The result: ~2.250 entities at 28 FPS. The result: I hit the 96KB buffer limit of OpenGL ES 2.0. 2379 vertices. 38 acorns per batch. Trying to increase the number of acorns in one draw call resulted in a message appearing in the console: geometry() exceeded max drawcall size, clamping geometry() exceeded max drawcall size, clamping But it wasn't enough. I wanted more. In any case, I will leave Vertex Batching for those people who can only afford OpenGL ES 2.0 In any case, I will leave Vertex Batching for those people who can only afford OpenGL ES 2.0 GPU instancing For reference: GPU instancing is when you send a command to the video card to render your model in multiple positions with just 1 command (1 draw call). The CPU simply passes a buffer of your 3D model positions to the GPU. For reference: At first, I avoided GLSL shaders. I thought vertex batching would be enough. But eventually, I realized I'd have to move mountains, get down to miniquad, and figure out how to write shaders. I realized I'd have to move mountains I studied GLSL and realized that a shader is simply a command to the GPU. Shaders themselves are divided into vertices (how to draw dots) and fragments (how to paint pixels). The whole point is that the GPU waits for data from the CPU. I didn't wait for inspiration to write my first shader, so I immediately moved on to practicing in my engine. Problems arose. The first problem was that the macroquad mesh was not suitable for instancing. This meant storing raw mesh data to pass to miniquad. I created the following structure: pub struct AcornPreparedMesh { pub v_buffer: mq::BufferId, pub i_buffer: mq::BufferId, pub index_count: i32, pub texture: mq::TextureId, } pub struct AcornPreparedMesh { pub v_buffer: mq::BufferId, pub i_buffer: mq::BufferId, pub index_count: i32, pub texture: mq::TextureId, } And also the structure that stores all these meshes: pub struct AcornMeshInstanceDB { pub gpu_meshes: Vec , } pub struct AcornMeshInstanceDB { pub gpu_meshes: Vec , } Next, I needed to write the pipeline. But that was easy too. To ensure that the 96 KB buffer was not a limitation, I chose OpenGL 3.3, where instancing is the standard. And the last thing left was to write the shaders: And the last thing left was to write the shaders: pub const VERTEX_SHADER_SRC: &str = r#"#version 330 core layout(location = 0) in vec3 aPos; layout(location = 1) in vec2 aTex; layout(location = 2) in vec4 aCol; layout(location = 3) in vec4 aNormal; layout(location = 4) in mat4 aModelMat; uniform mat4 uViewProjection; out vec2 vTex; out vec4 vCol; void main() { vTex = aTex; vCol = aCol; gl_Position = uViewProjection * aModelMat * vec4(aPos, 1.0); } "#; pub const FRAGMENT_SHADER_SRC: &str = r#"#version 330 core in vec2 vTex; in vec4 vCol; out vec4 fragColor; uniform sampler2D uTexture; void main() { vec3 linearColor = vCol.rgb / 255.0; vec3 finalColor = pow(linearColor, vec3(0.65)); fragColor = vec4(finalColor, 1.0); } "#; pub const VERTEX_SHADER_SRC: &str = r#"#version 330 core layout(location = 0) in vec3 aPos; layout(location = 1) in vec2 aTex; layout(location = 2) in vec4 aCol; layout(location = 3) in vec4 aNormal; layout(location = 4) in mat4 aModelMat; uniform mat4 uViewProjection; out vec2 vTex; out vec4 vCol; void main() { vTex = aTex; vCol = aCol; gl_Position = uViewProjection * aModelMat * vec4(aPos, 1.0); } "#; pub const FRAGMENT_SHADER_SRC: &str = r#"#version 330 core in vec2 vTex; in vec4 vCol; out vec4 fragColor; uniform sampler2D uTexture; void main() { vec3 linearColor = vCol.rgb / 255.0; vec3 finalColor = pow(linearColor, vec3(0.65)); fragColor = vec4(finalColor, 1.0); } "#; Note the linearColor = vCol.rgb / 255.0. This was necessary to normalize the Byte4 data type to Float. Otherwise, the color of the acorns would be burned out, turning black into white and brown into yellow. Note linearColor = vCol.rgb / 255.0 Here I've shown a small portion of the code that allowed me to run 13,000 acorns. But the shader code was sufficient for this result. In any case, you can view the entire code in the repository. I confess, I used AI for mentoring and brainstorming, then adapted the code to fit Light Acorn's architecture. Edge of The Acorn After that, I became interested in what the limit of my engine was. In DEBUG mode on an old laptop, there were 3,000 entities at 30 FPS. But what is the limit in RELEASE mode? 3,000 entities at 30 FPS. The limit: ~20,000 entities at 50 fps at the same 30-35% load. The limit: https://youtu.be/hZNqd5Cipxk?embedable=true https://youtu.be/hZNqd5Cipxk?embedable=true I didn't increase the number of acorns further, since 50 FPS is already the lowest comfortable threshold for the game. My confession I could stop there but… I admit that in fact all this time I had a mid-range PC at home. Yes, I am creating an engine for weak PCs, even if I have a Ryzen 7 5700X, GTX 1070, 32 GB DDR4 3600 MT/s. And I became interested in how my engine would behave on modern hardware. First, let's compare my two friends. Although, to be honest, that's unfair. I could just take the GTX 1070 and compare it to my laptop: GTX 1070: 6.5 TFLOPS i3-3217u: ~28–30 GFLOPS GT 720m: 0.3 TFLOPS GTX 1070: 6.5 TFLOPS GTX 1070: i3-3217u: ~28–30 GFLOPS i3-3217u: GT 720m: 0.3 TFLOPS GT 720m: In raw numbers, 6.500 GFLOPS versus 330 GFLOPS is a 20-fold difference, and the GTX 1070 is already more powerful than my entire laptop. The numbers themselves are already frightening, so let's move on to the benchmarks. The result: 64.206 entities at 50 fps and 6% CPU load in DEBUG mode. The result: DEBUG mode. The next benchmark results in RELEASE mode. The only thing I'll mention is a drop in FPS to 5-7 when changing the number of acorns in DEBUG. The result: 65.280 entities at 60 fps and 6% CPU load in RELEASE mode. The result: RELEASE mode. This is where things get interesting… Instead of boring you with photos, just watch the benchmark video: https://youtu.be/3A-OBNc7mZY?embedable=true https://youtu.be/3A-OBNc7mZY?embedable=true Overall benchmark results: Overall benchmark results: 65,280@60fps 6% CPU usage 150,930@60fps 6% CPU usage 200.166@60fps 6% CPU usage 250.901@50fps 6% CPU usage 65,280@60fps 6% CPU usage 150,930@60fps 6% CPU usage 200.166@60fps 6% CPU usage 250.901@50fps 6% CPU usage The facts: The facts: The FPS decrease is linear. Ryzen 7 5700X hasn't reached its full potential; it's running on one thread (it has 8 cores/16 threads and 4.6 GHz). It literally pulls itself up on the little finger 50 times per second. The GTX 1070 is only running at ~60-70% load (1920 CUDA cores). And its RAM is only ~500 MB occupied. The FPS decrease is linear. FPS decrease is linear Ryzen 7 5700X hasn't reached its full potential; it's running on one thread (it has 8 cores/16 threads and 4.6 GHz). It literally pulls itself up on the little finger 50 times per second. Ryzen 7 5700X it's running on one thread It literally pulls itself up on the little finger 50 times per second. The GTX 1070 is only running at ~60-70% load (1920 CUDA cores). And its RAM is only ~500 MB occupied. The GTX 1070 is only running at ~60-70% load The only downside is the 3.9 RAM occupied. For me, that's a huge number, but I think it's a trade-off for 250,000 active 3D entities. However, if you think about it, 250 thousand is the population of a large city in 1 thread. The Secret of Optimization Guess the secret to my optimization? I simply did nothing, and the processor only did useful work. Seriously! I simply did nothing I'll lay out all the facts: I'll lay out all the facts: All transformation matrices (Transition, Scale, Rotation) are calculated exclusively on the processor in 1 thread. This runs on OpenGL 3.3 (2010), which only uses 1 CPU thread. No Metal, no Vulkan, just the full power of 2010! Light Acorn makes the GPU bottleneck faster, not the CPU. The 20,000 Acorns benchmark on an old laptop heated it up to 70 degrees Celsius, even though the thermal paste was dry (I seriously haven't changed the thermal paste for 13 years). Everything was done on the REACORN architecture, where you can change the order of functions at runtime (This means that the compiler cannot do aggressive function inlining). Each acorn rotates and is an independent unit. This means the processor must run through the list of all acorns and rotate them by 0.1 degrees per frame. (The processor does this 50 times per second!) No frustum culling - objects behind the camera are drawn. No backface culling - the back sides of triangles are drawn. No occlusion culling - objects behind others are drawn. No LOD - distant objects draw full geometry even if they are reduced to 3 pixels. All transformation matrices (Transition, Scale, Rotation) are calculated exclusively on the processor in 1 thread. All transformation matrices calculated exclusively on the processor in 1 thread This runs on OpenGL 3.3 (2010), which only uses 1 CPU thread. No Metal, no Vulkan, just the full power of 2010! This runs on OpenGL 3.3 (2010), Light Acorn makes the GPU bottleneck faster, not the CPU. GPU bottleneck faster, not the CPU. The 20,000 Acorns benchmark on an old laptop heated it up to 70 degrees Celsius, even though the thermal paste was dry (I seriously haven't changed the thermal paste for 13 years). even though the thermal paste was dry Everything was done on the REACORN architecture, where you can change the order of functions at runtime (This means that the compiler cannot do aggressive function inlining). REACORN architecture Each acorn rotates and is an independent unit. This means the processor must run through the list of all acorns and rotate them by 0.1 degrees per frame. (The processor does this 50 times per second!) Each acorn rotates and is an independent unit. No frustum culling - objects behind the camera are drawn. No frustum culling No backface culling - the back sides of triangles are drawn. No backface culling No occlusion culling - objects behind others are drawn. No occlusion culling No LOD - distant objects draw full geometry even if they are reduced to 3 pixels. No LOD I also had Bevy schedule enabled every frame, but there wasn't a single system there, meaning it was running idle (I forgot to disable it during benchmarks) If we calculate the "Acorn Power": 250,901 acorns * 50 FPS * 3 Matrix (Translation, Rotation, Scale) = 37,635,150 pure power per second. If we calculate the "Acorn Power": Yes, Light Acorn achieved these results with a regular Brute Force. Also, Light Acorn made industry standards like LOD and culling optional (only in this case)! The Acorn philosophy You shouldn't compare Light Acorn to benchmarks on Unity, Bevy, or Unreal Engine. It's not because Light Acorn is better or worse, but because it has a different philosophy and different requirements. I appreciate the work of other engines. Light Acorn's philosophy: Light Acorn's philosophy: New hardware generations should open doors to greater possibilities, not just serve as a ticket to run increasingly bloated software. New hardware generations should open doors to greater possibilities, not just serve as a ticket to run increasingly bloated software. I created this engine to bring simple graphics to life, so I could finally play my own RTS games on this laptop. I'm not trying to make this engine kill Unity or Unreal Engine. I just like the Rust language and want to play. But even so, Light Acorn demonstrates that: But even so, Light Acorn demonstrates that: Good code and architecture no longer require optimization. At least, it makes it optional. Good code and architecture no longer require optimization. At least, it makes it optional. The main advantage of Light Acorn You might think that Light Acorn has the formula: High performance = High entry barrier High performance = High entry barrier But I already said that it has a low entry barrier. Light Acorn’s formula is inversely proportional: Light Acorn’s formula is inversely proportional: High performance = Low entry barrier High performance = Low entry barrier In the first article, I already admitted that I'm a beginner learning Rust because of its complexity. Light Acorn itself is built on similarly simple concepts like vectors and loops. There is a code of Kernel: There is a code of Kernel: pub type AcornFunction = fn(&mut World, &mut AcornZoneContext, &mut AcornGlobalContext); /// Location is group of functions pub struct Location { pub functions: Vec , } /// Zone is group of Locations pub struct Zone { pub locations: Vec , } pub type AcornFunction = fn(&mut World, &mut AcornZoneContext, &mut AcornGlobalContext); /// Location is group of functions pub struct Location { pub functions: Vec , } /// Zone is group of Locations pub struct Zone { pub locations: Vec , } This code is the quintessence of Acorn. There's nothing superfluous here: no unsafe blocks, smart pointers, or macros. For example, to draw a circle in the game, you need to write a function: For example, to draw a circle in the game, you need to write a function: fn acorn_example_draw_circle( _world: &mut World, _zones: &mut AcornZoneContext, _context: &mut AcornGlobalContext ) { draw_circle( screen_width()/2.0, screen_height()/2.0, 60.0, BLUE ) } fn acorn_example_draw_circle( _world: &mut World, _zones: &mut AcornZoneContext, _context: &mut AcornGlobalContext ) { draw_circle( screen_width()/2.0, screen_height()/2.0, 60.0, BLUE ) } And register it: And register it: let after_2d_zone = Zone::default() .with_locations(vec![ Location::from_fn_vec(vec![ acorn_example_add_circle_function ]), let after_2d_zone = Zone::default() .with_locations(vec![ Location::from_fn_vec(vec![ acorn_example_add_circle_function ]), Therefore, the official requirements to start writing in Light Acorn are 5 chapters of the Rust book. Specifically: official requirements to start writing in Light Acorn are 5 chapters of the Rust book. Specifically: Variables. Data types. Functions. Structs. ECS (It is enough to know how SQL works, since ECS has the same principle). Variables. Data types. Functions. Structs. ECS (It is enough to know how SQL works, since ECS has the same principle). The last Thing And also the REACORN architecture, which allows logic to be enabled and disabled at runtime without overhead: REACORN architecture, which allows logic to be enabled and disabled at runtime without overhead: Disabling certain functions in your logic will not cost the CPU anything, as they are removed from the thread rather than receiving a True/False flag. This can be used for dynamic optimization or unique game logic. You don't need to recompile the project to test how the world behaves without gravity or friction. Disabling certain functions in your logic will not cost the CPU anything, as they are removed from the thread rather than receiving a True/False flag. This can be used for dynamic optimization or unique game logic. This can be used for dynamic optimization or unique game logic. You don't need to recompile the project to test how the world behaves without gravity or friction. Also, this is a victory over Branch Misprediction! The Future of Light Acorn The engine's power is astounding, and its potential even more so. However, Light Acorn is still very much in its infancy. The engine still lacks usable tools, and GPU instancing is still experimental. Anyway, the engine's foundation has already been laid. Now, Light Acorn is “Proof of Concept”. Now, Light Acorn is “Proof of Concept”. Future plans: Future plans: Taffy integration for the UI. Optional Shaders for Shadows and Lighting. The ability to create not only games but also apps. The ability to port to phones. Tools for implementing multithreading (in fact, it is already possible, but only manually via std::threads or Rayon) Taffy integration for the UI. Taffy Optional Shaders for Shadows and Lighting. Shaders for Shadows and Lighting. The ability to create not only games but also apps. create not only games but also apps. The ability to port to phones. port to phones. Tools for implementing multithreading (in fact, it is already possible, but only manually via std::threads or Rayon) Tools for implementing multithreading If you think about it, I've been working on the engine itself for a month. In five hours, I was able to implement vertex batching and GPU instancing. So, give me one month, and then Light Acorn will be suitable for simple games. My Gratitude I often say I created this engine alone, but that’s not entirely true. Light Acorn is the result of a vast ecosystem of brilliant tools and communities. My deepest gratitude goes to: Light Acorn is the result of a vast ecosystem of brilliant tools and communities. My deepest gratitude goes to: Fedor Logachev for Miniquad and Macroquad. I'm grateful for the simplicity of these libraries. Carter Anderson and the Bevy contributors. Their ECS is the easiest thing I've learned in programming, and the ability to filter queries using Zero-sized Types is a killer feature. Will Usher and contributors of the tobj library for the ability to load 3D models. When I first managed to load a cube into Macroquad, I was very happy. Anticapitalista and the antiX community for the wonderful OS. I even installed antiX on a Ryzen 7 5700X. Blender Foundation and its contributors for the opportunity to create the engine’s logo. The AI assistants who helped me create the engine. Hackernoon and its editors for believing in my work and publishing my first article. Fedor Logachev for Miniquad and Macroquad. I'm grateful for the simplicity of these libraries. Fedor Logachev Carter Anderson and the Bevy contributors. Their ECS is the easiest thing I've learned in programming, and the ability to filter queries using Zero-sized Types is a killer feature. Carter Anderson and the Bevy contributors Will Usher and contributors of the tobj library for the ability to load 3D models. When I first managed to load a cube into Macroquad, I was very happy. Will Usher and contributors of the tobj library Anticapitalista and the antiX community for the wonderful OS. I even installed antiX on a Ryzen 7 5700X. Anticapitalista and the antiX community Blender Foundation and its contributors for the opportunity to create the engine’s logo. Blender Foundation and its contributors The AI assistants who helped me create the engine. The AI assistants Hackernoon and its editors for believing in my work and publishing my first article. Hackernoon and its editors In conclusion Light Acorn has enormous potential for growth. The fact that a single CPU thread can handle a huge number of entities means the rest of the CPU is free for AI, physics, and complex logic. I can say that 250 thousand entities is not the limit of possibilities, but the beginning. 250 thousand entities is not the limit of possibilities, but the beginning. Light Acorn has the potential to be used for: Light Acorn has the potential to be used for: Scientific simulations on a home PC. For large-scale Real-Time Strategy games with simple graphics. For hardware’s throughput stress testing. For prototypes and game jams. For unique gameplay due to changing the order of code execution at runtime without overhead. For Green IT. Scientific simulations on a home PC. Scientific simulations For large-scale Real-Time Strategy games with simple graphics. large-scale Real-Time Strategy games For hardware’s throughput stress testing. hardware’s throughput stress testing. For prototypes and game jams. prototypes and game jams. For unique gameplay due to changing the order of code execution at runtime without overhead. unique gameplay For Green IT. Green IT Overall, Light Acorn is a unique combination of: Light Acorn is a unique combination of: Low entry threshold. Runtime dynamics. Extreme performance. Minimalism without heavy graphics. Low entry threshold. Runtime dynamics. Extreme performance. Minimalism without heavy graphics. Light Acorn isn't just software; it's an attempt to restore programmers' sense of the machine that's been lost in the face of stable frameworks. Light Acorn isn't just software; it's an attempt to restore programmers' sense of the machine that's been lost in the face of stable frameworks. How to try the power of Light Acorn I'm sure you'd like to try Light Acorn too. I've put together a guide. Clone repository and run: Clone repository and run: git clone https://github.com/Veyyr3/Light_Acorn.git cd Light_Acorn git checkout game_experimental cargo run --release git clone https://github.com/Veyyr3/Light_Acorn.git cd Light_Acorn git checkout game_experimental cargo run --release Control hint: Control hint: WASD: Move the camera. Mouse: Look around. E / Q: Increase the acorn grid. R: Decrease the grid. Z: hide cursor. C: show cursor. WASD: Move the camera. WASD: Mouse: Look around. Mouse: E / Q: Increase the acorn grid. E / Q: R: Decrease the grid. R: Z: hide cursor. Z: C: show cursor. C: If a crash occurs during a test, try increasing the MAX_ACORNS value in main.rs If a crash occurs during a test, try increasing the MAX_ACORNS value in main.rs At the end By the way, update 0.2.0 was recently released, which made the API more user-friendly. update 0.2.0 was recently released, If you need the GitHub repository itself, here is the link: https://github.com/Veyyr3/Light_Acorn If you need the GitHub repository itself, here is the link: https://github.com/Veyyr3/Light_Acorn If you need a clean architecture without Macroquad and tobj, the link is here: https://github.com/Veyyr3/Pure_Acorn If you need a clean architecture without Macroquad and tobj, the link is here: https://github.com/Veyyr3/Pure_Acorn Thanks to everyone who read my article to the end!