Cloud-based content moderation is a privacy nightmare. Sending screenshots to a server just to check for safety? That’s a non-starter. My hypothesis was simple: modern mobile hardware is powerful enough to support a "Guardian AI" that sees the screen and redacts sensitive info (nudity, violence, private text) in milliseconds—strictly on-device using a hybrid inference strategy. strictly on-device I called it ScreenSafe. ScreenSafe ScreenSafe But the journey from concept to reality revealed a chaotic ecosystem of immature tools and hostile operating system constraints. Here is the architectural breakdown of how I built ScreenSafe, the "integration hell" I survived, and why local AI is still a frontier battleground. Watch the full technical breakdown on YouTube. Watch the full technical breakdown on YouTube. https://youtu.be/X_9y1IFmMC0?embedable=true https://youtu.be/X_9y1IFmMC0?embedable=true 1. The Stack: Why Cactus? I needed three things: guaranteed privacy, zero latency (in theory), and offline capability. I chose Cactus (specifically cactus-react-native). Unlike cloud APIs, Cactus acts as a high-performance C++ wrapper around low-level inference graphs. It utilizes the device's NPU/GPU via a C++ core, exposed to React Native through JNI (Android) and Objective-C++ (iOS). Cactus cactus-react-native The goal: Capture the screen buffer @ 60FPS -> Pass to AI -> Draw redaction overlay. Zero network calls. Zero network calls. We implemented a Two-Stage Pipeline to solve hallucination issues: Two-Stage Pipeline Stage 1 (Vision): Uses lfm2-vl-450m to generate a descriptive security analysis of the image. Stage 2 (Logic): Uses qwen3-0.6 to analyze that description and extract structured JSON data regarding PII (Credit Cards, SSNs, Faces). Stage 1 (Vision): Uses lfm2-vl-450m to generate a descriptive security analysis of the image. Stage 1 (Vision): lfm2-vl-450m Stage 2 (Logic): Uses qwen3-0.6 to analyze that description and extract structured JSON data regarding PII (Credit Cards, SSNs, Faces). Stage 2 (Logic): qwen3-0.6 2. The Build System Quagmire The first barrier wasn't algorithmic; it was logistical. Integrating a modern C++ library into React Native exposed the fragility of cross-platform build systems. My git history is a graveyard of "fix build" commits. The Android NDK Deadlock React Native relies on older NDK versions (often pinned to r21/r23) for Hermes. Cactus, running modern LLMs with complex tensor operations, requires modern C++ standards (C++20) and a newer NDK (r25+). This created a Dependency Deadlock: Dependency Deadlock Choose the old NDK: Cactus fails with syntax errors. Choose the new NDK: React Native fails with ABI incompatibilities. Choose the old NDK: Cactus fails with syntax errors. Choose the new NDK: React Native fails with ABI incompatibilities. I faced constant linker failures, specifically undefined symbols like std::__ndk1::__shared_weak_count. This is a hallmark of libc++ version mismatch. The linker was trying to merge object files compiled against different versions of the C++ standard library. undefined symbols std::__ndk1::__shared_weak_count libc++ The Fix: A surgical intervention in local.properties and build.gradle to force specific side-by-side NDK installations, effectively bypassing the package manager's safety checks. Open PR: github.com/cactus-compute/cactus-react-native/pull/13. The Fix: local.properties build.gradle github.com/cactus-compute/cactus-react-native/pull/13 3. The iOS Memory Ceiling (The 120MB Wall) Once the app built, I hit the laws of physics on iOS. The requirement was simple: Share an image from Photos -> Redact it in ScreenSafe. This requires a Share Extension. Share Extension However, iOS enforces a hard memory limit on extensions—often cited as low as 120MB. If you exceed this, the kernel's Jetsam daemon sends a SIGKILL, and the app vanishes. 120MB SIGKILL The Physics of LLMs Model Weights (Q4): ~1.2 GB React Native Overhead: ~40 MB Available RAM: 120 MB. Model Weights (Q4): ~1.2 GB Model Weights (Q4): React Native Overhead: ~40 MB React Native Overhead: Available RAM: 120 MB. Available RAM: You cannot fit a 1.2GB model into a 120MB container. The "Courier" Pattern I had to re-architect. The Share Extension could not perform the inference; it could only serve as a courier. Data Staging: The Extension writes the image to an App Group (a shared file container). Signal: It flags hasPendingRedaction = true in UserDefaults. Deep Link: It executes screensafe://process, forcing the OS to switch to the main app. Data Staging: The Extension writes the image to an App Group (a shared file container). Data Staging: App Group Signal: It flags hasPendingRedaction = true in UserDefaults. Signal: hasPendingRedaction = true Deep Link: It executes screensafe://process, forcing the OS to switch to the main app. Deep Link: screensafe://process The main app, running in the foreground, has access to the full 6GB+ of device RAM and handles the heavy lifting. 4. Android IPC: The 1MB Limit While iOS struggled with static memory, Android struggled with moving data. Android uses Binder for IPC. The Binder transaction buffer is strictly limited to 1MB per process. Binder 1MB A standard screenshot (1080x2400) is roughly 10MB uncompressed. When I tried to pass this bitmap via an Intent, the app crashed instantly with TransactionTooLargeException. 10MB TransactionTooLargeException The Solution: Stop passing data; pass references. The Solution: Write the bitmap to internal storage. Pass a content:// URI string (bytes in size) via the Intent. The receiving Activity streams the data from the URI. Write the bitmap to internal storage. Pass a content:// URI string (bytes in size) via the Intent. content:// The receiving Activity streams the data from the URI. 5. The Reality of "Real-Time" Synchronizing Two Brains (Vision vs. Text) Multimodal means processing pixels and text. On a server, these run in parallel. On a phone, they fight for the same NPU. and We hit a classic race condition: The vision encoder was fast (detecting an image), but the text decoder was slow. Scenario: Vision says "Safe." Text is still thinking. The Risk: Do we block the screen? If we wait, the UI stutters (latency). If we don't, we risk showing a harmful caption. Scenario: Vision says "Safe." Text is still thinking. Scenario: The Risk: Do we block the screen? If we wait, the UI stutters (latency). If we don't, we risk showing a harmful caption. The Risk: I had to engineer a complex state machine to manage these async streams, ensuring we didn't lock the JS thread while the C++ backend was crunching numbers: Dynamic Context Sizing: Implemented checkDeviceMemory to detect available RAM and dynamically set the model context window: Dynamic Context Sizing: checkDeviceMemory < 3GB RAM → 256 tokens (Safe mode) 3-6GB RAM → 512 tokens (Standard mode) > 6GB RAM → 1024 tokens (High-performance mode) < 3GB RAM → 256 tokens (Safe mode) < 3GB RAM 256 tokens 3-6GB RAM → 512 tokens (Standard mode) 3-6GB RAM 512 tokens > 6GB RAM → 1024 tokens (High-performance mode) > 6GB RAM 1024 tokens Timeout Protection: Added a 15-second timeout to the local text model inference. If it hangs (common on emulators), it gracefully fails instead of crashing the app, showing a "Limited analysis" warning. Timeout Protection: PII Detection Logic PII Detection Logic Logic Update: We prioritized the presence of types. If the types array is not empty, hasPII is forced to true, overriding the LLM's boolean flag. JSON Repair: The local LLM (qwen3-0.6) was returning conversational <think> blocks and sometimes malformed JSON, causing JSON.parse to fail. We enhanced the JSON cleaning regex to handle more edge cases from the model output (e.g., trailing commas, markdown blocks). Cloud Fallback: We verified that the 15s timeout correctly triggers the "Enable Cloud Mode" suggestion for users on low-end devices. Logic Update: We prioritized the presence of types. If the types array is not empty, hasPII is forced to true, overriding the LLM's boolean flag. Logic Update: types hasPII true JSON Repair: The local LLM (qwen3-0.6) was returning conversational <think> blocks and sometimes malformed JSON, causing JSON.parse to fail. We enhanced the JSON cleaning regex to handle more edge cases from the model output (e.g., trailing commas, markdown blocks). JSON Repair: qwen3-0.6 <think> JSON.parse Cloud Fallback: We verified that the 15s timeout correctly triggers the "Enable Cloud Mode" suggestion for users on low-end devices. Cloud Fallback: Hybrid Cloud Inference Integration Hybrid Cloud Inference Integration We confirmed that the cloud API (https://dspy-proxy.onrender.com) is functional. cloud API https://dspy-proxy.onrender.com /configure endpoint works. /register endpoint successfully registered the pii_detection signature. /predict endpoint returns valid JSON with PII analysis. /configure endpoint works. /configure /register endpoint successfully registered the pii_detection signature. /register pii_detection /predict endpoint returns valid JSON with PII analysis. /predict Furthermore, we added the logic to catch the timeouts and automatically retry the request (waking the server). If the cloud service remains unavailable, the app gracefully falls back to the local analysis result without crashing. It Actually Works We solved the crashes, but we couldn't solve the latency. Despite the build breaks and the memory wars, we shipped it. Latency: 30ms - 100ms (Real-time). Privacy: 100% On-device. Cost: $0 (Excluding cloud, infinite scalability). Latency: 30ms - 100ms (Real-time). Latency: Privacy: 100% On-device. Privacy: Cost: $0 (Excluding cloud, infinite scalability). Cost: We proved that you can run high-fidelity AI on mobile if you're willing to fight the memory limits and patch the build tools. However, the wait is an eternity in mobile UX. This is the "frustration" of local AI: the gap between the instant feel of cloud APIs (which hide latency behind network states) and the heavy feel of a device physically heating up as it crunches tensors. can 6. The "Antigravity" Companion Debugging a neural net is like debugging a black box. You can't step-through the decision making. I relied heavily on the "Antigravity" to iterate on system prompts and fix hallucinations where the model thought a restaurant menu was "toxic text." I also used the dspy-proxy to help streamline some of these interactions and test model behaviors before deploying to the constrained mobile environment. dspy-proxy Conclusion Building ScreenSafe proved that local, private AI is possible, but it requires you to be a systems architect, a kernel hacker, and a UI designer simultaneously. Until OS vendors treat "Model Inference" as a first-class citizen with dedicated memory pools, we will continue hacking build scripts and passing files through backdoors just to keep data safe. Resources & Code If you want to dig into the code or the proxy architecture I used to prototype the logic: ScreenSafe Repo: github.com/aryaminus/screensafe DSPy Proxy: github.com/aryaminus/dspy-proxy Watch the breakdown: YouTube Video ScreenSafe Repo: github.com/aryaminus/screensafe ScreenSafe Repo: github.com/aryaminus/screensafe DSPy Proxy: github.com/aryaminus/dspy-proxy DSPy Proxy: github.com/aryaminus/dspy-proxy Watch the breakdown: YouTube Video Watch the breakdown: YouTube Video Liked this post? I'm building more things that break (and fixing them). Follow me on Twitter or check out my portfolio. Liked this post? I'm building more things that break (and fixing them). Follow me on Twitter or check out my portfolio. Twitter portfolio