I did not build this because I wanted a “smart entertainment platform.” I built it because I was tired of getting up. My setup was simple: a Windows PC connected to a TV over HDMI, local movie files, and a normal media player. No streaming. No media server. No cloud sync. Just files on disk and direct playback. That part was fine. The annoying part started after the movie began. Pause the movie? Reach for the keyboard. Skip forward? Same thing. Need subtitles? Back to the PC. Want the next episode? Back again. The real friction came from something very simple. When watching movies from bed, every pause, skip, or next command means getting up, walking to the computer, using the keyboard, and then going back again. The playback itself was simple — just local files and a media player — but the experience around it still behaved like desktop computing. It did not feel like a home cinema. Of course, there are existing ways to deal with this. Wireless keyboards, air mice, and remote control apps can all help. But they still treat the experience like operating a computer. You are still moving a cursor, navigating menus, or pressing keys — interacting with software instead of simply watching the movie. In a real cinema experience, the technology disappears. You focus on the movie, not on the interface. So I started with a very practical question: Could a Windows PC connected to a TV be less annoying to control from across the room? That question eventually grew into a standalone Windows application I now call Smart Home Cinema – Voice Control. But the interesting part came earlier than the application itself. At first, I kept running into the usual ideas: playlists, metadata, watched-state tracking, library logic, all the normal stuff you would expect in a media app. But that was not really what I was trying to build. I was not trying to manage a whole media library. I just wanted a few voice commands that would work in a simple, reliable way with local files. So I started looking at the problem from a different angle. What if the system did not need to understand the whole folder at all? What if it just used the first video file it found there? What if Play Movie, Next Movie, and the rest all followed that same rule? The more I thought about it, the more that approach made sense. It was simple, visible, and easier to trust. The more layers I added — playlists, metadata, tracking, library logic — the more places there were for things to go wrong. I wanted something more predictable, with fewer moving parts. That idea ended up becoming the basis for the whole system. Later, I started calling it the . First File Rule The system always operates on the first video file in the Movies folder. That means the folder itself becomes the queue. The First File Rule In C++, the core logic ended up being very small: // conceptual example std::string GetFirstMovie(const std::string& folder) { auto files = ListFiles(folder); for (auto& file : files) { if (IsSupportedVideo(file)) return file; } return ""; } This is a simplified version — the real implementation includes additional checks and handling for edge cases. It looks trivial, but that little function ended up carrying a lot of the system. I did not need the internal playback state. I did not need a database. I did not need to synchronize a library view with the file system. The file system itself became the source of truth. That one decision shaped almost everything else. It also worked naturally for TV episodes, because episodes are usually already ordered by filename. And if I wanted a custom order for movies, I could just prefix the filenames numerically and let the same rule do the rest. The First Working Pipeline I already had an Amazon Alexa speaker and TriggerCMD installed on my PC for unrelated lightweight automations. So the first playback pipeline was obvious enough: Alexa → TriggerCMD → local BAT/PS1 script → media player At that point, I was not trying to build a product. I was trying to remove very specific annoyances from my own setup. The first commands were as minimal as possible: Play Movie Pause Movie Stop Movie That was enough to prove the concept. At the time, I was mainly using PotPlayer, so the first generation of the system was built around it. And that is where the first real engineering problem showed up. The First Generation Was Script-Based Before I rebuilt anything in C++, the system was already working as a collection of BAT and PowerShell scripts. That matters because the architecture did not start in C++. C++ came later. Even the script version was already doing real work: selecting the current movie, moving watched files out of the active folder, handling adjacent subtitles, launching the player, and starting the transition to the next file. So when I say the project started small, I do not mean it was just a rough mockup. It was already usable. For example, in the early PowerShell version of Next Movie, the core logic already looked like this: # conceptual example # get current movie $current = Get-FirstVideoFile $moviesFolder # move movie to watched folder Move-File $current $watchedFolder # move matching subtitles Move-MatchingSubtitles $current $moviesFolder $watchedFolder # get next movie $next = Get-FirstVideoFile $moviesFolder This was not elegant code, but it already captured the core behavior of the system. The basic idea — selecting the first file, moving it, and continuing — was working from the very beginning. The folder was already in the queue. The transition logic was already file-driven. And Next Movie was already more than a movie operation. It was already a playback transition built on the same state model. PotPlayer Had No Local API I Could Use With PotPlayer, I did not have a clean local HTTP control layer like the one VLC exposes. So the first generation had to rely on: simulated keyboard input active window detection focus control timing delays At first, that sounds manageable. Send a key. Done. In practice, that was not the problem. The problem was making sure PotPlayer was actually the window receiving the key. That pushed me into writing helper logic around window discovery and focus handling. In the C++ version, finding the PotPlayer window looked like this: HWND FindPotPlayerWindow() { // iterate through top-level windows // identify PotPlayer window based on title or class return hwndPotPlayer; } Then I had to force that window into the foreground: bool BringToForeground(HWND hwnd) { if (!IsWindow(hwnd)) return false; // attach input threads if necessary // restore window if minimized // bring window to foreground and activate it return true; } Only after that could I safely send a key: void SendPotPlayerKey(WORD vk) { HWND hPot = FindPotPlayerWindow(); if (!hPot) return; BringToForeground(hPot); // small delay to ensure focus is applied SendKeystroke(vk); } I thought the interesting part would be the voice layer. It was not. One of the first truly annoying problems was just getting Windows focus behavior to be reliable enough that a voice command would not silently do nothing or go to the wrong place. That was one of the first moments when this stopped feeling like “a few scripts” and started feeling like real Windows automation engineering. Deterministic Playback Without Playlists Once the First File Rule existed, another problem appeared immediately. If the current movie stays in the folder, then Play Movie will just reopen the same file forever. So I ended up with two different ways to handle the current file. 1. Delete Movie If I were done with a movie and did not want it in the active folder anymore, the command would remove the movie and its adjacent subtitle files. At first, I made that a permanent delete. That was a bad call. It worked, but it was too aggressive. A voice command is the wrong place to be casual about irreversible deletion. So I changed the implementation to move files to the Recycle Bin instead. That made the behavior much safer without changing the overall architecture. 2. Next Movie Sometimes I wanted to keep the movie, not delete it. So I added a second command: . Next Movie Instead of deleting the current file, it moves: the current movie its associated subtitles into a second folder, such as Watched Movies. But that is only part of what the command is really doing. Its actual purpose is to create a smooth transition to the next item. Once the current movie leaves the Movies folder, the next file automatically becomes the first file. The system then starts that file immediately. So Next Movie is not just a file-management command. It is really a playback-transition command built on top of the First File Rule. There is no queue to update. There is no “now playing” index to maintain. There is no watched-state database to fix later. The system removes the current item, lets the next file become first, and starts playback. That is why this rule ended up being much more useful than I expected. It did not just simplify selection. It simplified transitions too. From Single Commands to an Actual Watching Workflow Once the first playback commands were stable, I stopped thinking in isolated actions and started thinking in terms of a full watching session. That led to commands like: Pause Movie Stop Movie Forward 30 Seconds Forward 1 Minute Forward 2 Minutes Rewind 30 Seconds Rewind 1 Minute Rewind 2 Minutes I deliberately kept those jumps fixed. I was not trying to expose every possible player action through voice. I just wanted the commands that removed the most common reasons to go back to the PC. Another small command turned out to be surprisingly useful in practice: . Most of the time, the computer is used normally on its main monitor. But when it is time to watch a movie, the experience should move to the television. Instead of manually switching the display output and then starting playback, this command does both steps at once. It switches the active display to the TV and launches the movie directly in full screen. Technically, it is a small feature, but in real use, it removes another small piece of friction between sitting at the computer and starting a movie session from the couch or from bed. Play Movie TV Then one command gradually became more important than the rest. The Stop Everything Command This command came from a very specific real-life annoyance. Watching from bed is easy. Ending the session is where the friction comes back. If I was already half asleep, the last thing I wanted was to get out of bed, walk to the computer, stop the player, move the display back from the TV to the monitor, shut down the PC, and then get back into bed again. That may sound minor. For me, it was not. If I was already getting sleepy, even that short interruption was enough to break the moment. I built this command because I wanted the movie session to end as cleanly as it started. So I made a single command that performs the whole shutdown sequence in order. In the VLC edition, the logic is basically: // conceptual example void StopEverythingVlc(const Config& cfg) { StopPlayback(); RestoreDisplayOutput(); // wait for the system to settle WaitShortDelay(); ShutdownSystem(); } Technically, this is not the most complicated function in the system. What matters is what it removes. It removes the end-of-session friction: stop playback, restore the normal display path, wait for the switch to settle, then shut down the machine. For me, this ended up being one of the most useful commands in the whole system, not because it was technically impressive, but because it solved a very human annoyance cleanly. Subtitles Became a Problem of Their Own Playback control turned out to be only half the problem. Subtitles became a second engineering track almost by accident. Stage 1: Subliminal My first subtitle automation used Subliminal with the older OpenSubtitles API. It actually worked. The problem was not that it failed immediately. The problem was that I did not trust it long-term. The dependency path already felt too fragile, and I did not want subtitle support to quietly degrade later. So I replaced it. Stage 2: OpenSubtitles API v2 I moved to my own subtitle retrieval logic integrated with OpenSubtitles API v2. Their current REST API is documented around login/authentication, token-based access, and dedicated endpoints for search and download, which made it a much better long-term fit than the older flow. In my code, the download step eventually became an explicit command pipeline, roughly like this: // conceptual example std::string cmd = BuildSubtitleCommand( pythonPath, scriptPath, moviesFolder, credentials ); if (!ExecuteCommand(cmd)) { Log("Subtitle download failed"); return; } That became the step. Download Subtitles The matching logic changed a few times. What mattered more was that I had moved from a layer I did not fully trust to an API path I could actually reason about. Stage 3: Sync Subtitles Then I ran into the next practical problem. Sometimes the subtitle is the right subtitle, but the timing is wrong. At first, I handled this manually with a GUI sync tool. That was usable, but it broke the whole point of the project. If I still had to drag files into a window and fix timing by hand, I had not really automated the workflow. So I switched to ffsubsync through CLI and built a batch sync command around it. In the C++ version, the core loop looked like this: // conceptual example std::string cmd = BuildSubtitleSyncCommand( videoPath, subtitlePath, outputPath ); bool ok = ExecuteCommand(cmd); That changed subtitle synchronization from a manual repair step into a batch operation. Stage 4: Clean Subtitles After sync, the corrected subtitle file used a modified suffix like . _2.srt That created another practical issue: the player would not automatically prefer the synced file unless the filename was normalized back to the expected standard form. So I added a third subtitle command: Clean Subtitles Its job was to: create a backup folder move original subtitles there rename synced subtitles back to the standard filename .srt The rename step looked like this: // conceptual example std::string newStem = RemoveSyncSuffix(originalName); fs::path newPath = moviesPath / (newStem + ".srt"); if (FileExists(newPath)) { ReplaceFile(originalPath, newPath); } At that point subtitles had become a full pipeline: Download → Sync → Clean I did not plan that subsystem in advance. It grew because real use kept exposing new friction points. Why I Added a Fullscreen Command Reference As the number of commands grew, a different kind of problem showed up. The system could work perfectly and still be inconvenient if the commands themselves were hard to remember. I was gradually adding more voice actions, but there was no simple way to see them all in one place. So I added a full-screen command reference overlay. The point was not presentation. It was practical usability. The workflow became: pause the movie say Show Commands display the available commands fullscreen check the one I need say Close Commands continue watching This did not solve a technical bottleneck. It solved a memory bottleneck. In code, launching the overlay was straightforward: bool ShowCommandCenterOverlay() { std::string overlayExe = GetOverlaysDir() + "CommandCenterOverlay.exe"; return RunExe("\"" + overlayExe + "\""); } Small feature, big usability impact. What I Realized Later About the Assistant Layer I originally built the system around Alexa for one simple reason: that was the assistant hardware I already had at home. So in the beginning, I naturally thought of it as an Alexa-based setup. Much later, after the architecture was already working, I realized that Alexa was not really the important dependency. The important dependency was the trigger bridge. Once the assistant sends a command into TriggerCMD, the rest of the system is local: script or executable, file system logic, player control, subtitle workflow, display switching, and shutdown flow. That means the architecture is not fundamentally tied to Alexa at all. The same system can work just as well with Google Assistant or Google Nest, because both are only acting as trigger layers for the same local execution path. That was when I realized the assistant was replaceable. The local logic was the real core. Where VLC Changed the Architecture Later, I created a second edition of the system for VLC. That changed the architecture more than I expected. With PotPlayer, I had to rely on: window focus simulated keys UI behavior With VLC, I could enable a local HTTP control interface at launch and use that as a much cleaner control layer. Launching VLC with HTTP enabled looked like this: // conceptual example std::string cmd = BuildVlcLaunchCommand( vlcPath, moviePath, httpConfig ); That one difference changed the architecture more than I expected. Now I could send commands directly: // conceptual example bool SendVlcCommand(const VlcConfig& cfg, const std::string& command) { std::string request = BuildVlcRequest(command); return SendHttpRequest(cfg, request); } And pause playback with something as simple as: VlcSendCommand(vc, "pl_pause"); That was much cleaner than simulated keystrokes. It also made status reporting far easier, which mattered for progress overlays and other feedback mechanisms. Where Automation Failed: Progress Display in PotPlayer One of the most useful commands I wanted was a way to show how much of the movie had passed and how much remained. For PotPlayer, I first tried to solve that precisely using: OCR Tesseract screen-region reading I did get it working on my own machine. But I could not make it reliable enough across different: resolutions monitor layouts TV setups UI scale conditions That was an important lesson. A solution that works on one machine is not automatically something you can generalize cleanly. So for PotPlayer, I dropped the idea of a fully universal OCR-based solution and kept a simpler fallback: reveal the native progress bar and let the user estimate position visually. For VLC, the story was different. There, I could query playback status directly and render my own overlay using internal time values. That difference is exactly why I did not try to force both players into a fake uniform abstraction. Why I Rebuilt Everything in C++ The BAT and PowerShell generation of the system was real. It worked. I still consider that version functionally valid. But I did not want to keep the project at that layer forever. It depended too much on script-level behavior, machine-specific tolerances, and loosely connected parts. I wanted: better structure stronger Windows integration native overlays centralized config handling cleaner logging a more predictable executable-based architecture So I rebuilt the system in C++. The architecture stayed the same. The implementation became more disciplined. I did not do that because the script version was worthless. I did it because the core ideas had become valuable enough that I no longer wanted them held together by a pile of scripts and assumptions. That was the point where the project started feeling like an actual Windows application instead of a useful automation experiment. What I Learned The biggest lesson from this project is that voice control was never the real hard problem. The hard problem was building a local playback model that would not collapse into: hidden state fragile UI assumptions playlist drift synchronization bugs The solved more than I expected. First File Rule It removed a whole class of problems because the folder structure itself became the state machine. The other lesson is that local-first design only matters when it removes real friction. In this system, local-first meant: files stay local playback stays local logic stays local only the trigger path depends on the assistant/bridge layer That was enough to create the experience I actually wanted in the first place: a Windows PC connected to a TV that behaves more like a simple home cinema and less like a workstation I have to keep walking back to.