How I rebuilt a fragile data pipeline using modular agents and a chatbot framework — yes, seriously. How I rebuilt a fragile data pipeline using modular agents and a chatbot framework — yes, seriously. From Messy to Modular From Messy to Modular If you’ve ever nervously deployed a tiny change in a data pipeline, hoping nothing explodes downstream, you’re not alone.That was me. The pipeline worked technically, but it became harder to control with every new edge case. It was like playing Jenga with production code. Around the time LLMs and agent-based workflows started getting serious attention, I was exploring them out of curiosity — mostly to see if they could help with the kind of brittle pipeline logic I was dealing with. This is when I stumbled across Google’s newly released Agent Development Kit (ADK). I wasn’t really looking for a chatbot toolkit, but something about how it handled tasks with modular agents caught my eye. A toolkit for chatbots? Sure. But it turned out to be much more than that. What if I could use modular, memory-aware agents to take over parts of my pipeline logic — and let the system handle coordination? That became the starting point for a small but surprisingly useful experiment. In this post, I’ll walk you through how I redesigned part of my data pipeline using Google’s ADK. I’ll share what worked, what didn’t, and whether this modular, LLM-powered approach is actually ready for real-world use. What I Was Working With What I Was Working With Here’s the original setup — a classic three-step job: Scheduler: Triggers the pipeline daily
Validator: Filters out malformed events
Deduplicator: Cleans up redundant records Scheduler: Triggers the pipeline daily Scheduler Validator: Filters out malformed events Validator Deduplicator: Cleans up redundant records Deduplicator As with any growing system, the codebase got bloated. Every time I patched one edge case, another cropped up, and debugging felt like whack-a-mole. Eventually, I thought, what if each part of the pipeline had just enough intelligence to adapt independently? just enough intelligence Discovering Google’s ADK Discovering Google’s ADK At first glance, ADK looked like a framework for chatbots built on large language models (LLMs). But once I skimmed the docs and examples, it clicked — agents weren’t limited to chat. With ADK, I could: Define modular agents with memory.
Use simple tools to validate, transform, or format data.
Chain them together in a declarative flow.
Pass structured state between agents using keys. Define modular agents with memory. Use simple tools to validate, transform, or format data. Chain them together in a declarative flow. Pass structured state between agents using keys. It felt like infrastructure-as-code for pipeline logic — but powered by LLMs. Rebuilding a Slice with Agents Rebuilding a Slice with Agents I didn’t go for a complete rewrite. Instead, I picked one small part of the pipeline—a transformation module that handles session data. Whenever something new comes in, adding any further logic to the code becomes incredibly challenging. In this case, I have divided the overall flow into three steps: validation, deduplication, and sessionization. Each step became its own agent, coordinated by the root agent (data_pipeline_agent) data_pipeline_agent Not all agents in ADK are the same. ADK offers several types of agents: LLMAgent — Uses a large language model to interpret instructions and decides how to act.
WorkflowAgent — This type of agent is a deterministic rule-based agent, where LLM is not involved. This type of agent is further divided into three different types as Sequential, Parallel, or Loop.
ToolOnlyAgent – This agent is also a rule-based agent, whose sole purpose is to run the associated tool when invoked.
SubAgents — This type is used to define the hierarchy among different agents.
CustomAgent — This allows you to define a custom implementation of the agent you want to build with more fine-grained programmatic control. LLMAgent — Uses a large language model to interpret instructions and decides how to act. LLMAgent WorkflowAgent — This type of agent is a deterministic rule-based agent, where LLM is not involved. This type of agent is further divided into three different types as Sequential, Parallel, or Loop. WorkflowAgent Sequential Parallel Loop ToolOnlyAgent – This agent is also a rule-based agent, whose sole purpose is to run the associated tool when invoked. ToolOnlyAgent SubAgents — This type is used to define the hierarchy among different agents. SubAgents — CustomAgent — This allows you to define a custom implementation of the agent you want to build with more fine-grained programmatic control. CustomAgent For my use case, I selected LlmAgent for each component, which acted as a subagent to data_pipeline_agent. Why? data_pipeline_agent. My use case needed to handle changing and sometimes unpredictable input (e.g., evolving event schemas and slightly changing file formats).
I wanted the agents to determine their next course of action autonomously.
Using LlmAgent allowed me to move the logic outside the code into editable instructions, reducing code churn. My use case needed to handle changing and sometimes unpredictable input (e.g., evolving event schemas and slightly changing file formats). I wanted the agents to determine their next course of action autonomously. Using LlmAgent allowed me to move the logic outside the code into editable instructions, reducing code churn. I would have chosen the SequentialWorkflow agent type if I wanted complete control and reproducibility. But in this case, LLMs' flexibility and adaptability made more sense. Each LlmAgent reads a prompt file (like validation_instruct.txt) and uses it to drive decision-making. That’s how I kept logic dynamic without rewriting the code every time business rules changed. validation_instruct.txt It felt clean. Declarative. And honestly, fun. Code Snippet — Agent Setup Code Snippet — Agent Setup validation_agent = LlmAgent(
    name="validation_agent",
    model="gemini-2.0-flash",
    tools=[validate_events],
    instruction=load_instruction_from_file("instructions/validation_instruct.txt"),
    description="You are a data validation agent. Use the tool to validate event logs.",
    output_key="validated_state",
) validation_agent = LlmAgent(
    name="validation_agent",
    model="gemini-2.0-flash",
    tools=[validate_events],
    instruction=load_instruction_from_file("instructions/validation_instruct.txt"),
    description="You are a data validation agent. Use the tool to validate event logs.",
    output_key="validated_state",
) All the validation logic lives in a text file. Want to change the rules? Just update the prompt — no need to rewrite code. Tools Behind the Agents Tools Behind the Agents Each agent had a simple tool associated with it — a lightweight Python function implementing the core logic. What was more interesting was not the tool — it was that each agent could decide when (or whether) to use its tools based on the instructions provided: validation_tool: This tool enforced the expected schema by checking the presence of the required fields (user_id, event_type, and timestamp) and returning a dictionary with a list of valid and invalid events.
dedupe_tool: Utilizes a hashing logic to remove the exact matches while preserving the original order of events.
session_tool: Grouped the events into sessions based on user activity gaps. The default threshold is 30 minutes, but the agent can override this by passing another value in the session_gap_minutes parameter when calling the tool, if the user provides a different threshold to the agent. validation_tool: This tool enforced the expected schema by checking the presence of the required fields (user_id, event_type, and timestamp) and returning a dictionary with a list of valid and invalid events. validation_tool: dedupe_tool: Utilizes a hashing logic to remove the exact matches while preserving the original order of events. dedupe_tool session_tool: Grouped the events into sessions based on user activity gaps. The default threshold is 30 minutes, but the agent can override this by passing another value in the session_gap_minutes parameter when calling the tool, if the user provides a different threshold to the agent. session_tool session_gap_minutes The beauty of this setup is that the agents weren’t tightly coupled with the tools—they had context awareness. An agent can selectively trigger the tool or choose to skip it based on the incoming state. This flexibility of making intelligent decisions based on input data is hard to achieve in traditional pipelines without chaining a bunch of if-else logic. if-else Example Workflow Example Workflow Suppose we start with a raw JSON file: load_event_file — A tool provided to the root agent that acts as a utility to load the input file provided by the user. Again, this is a selective behaviour that the agent may or may not use based on the data provided.
validation_agent — Validated the records based on the data schema and segregated them into valid and invalid records.
dedup_agent — Removes duplicate records.
session_agent — groups events into sessions by looking at inactivity gaps — typically 30 minutes between user actions. load_event_file — A tool provided to the root agent that acts as a utility to load the input file provided by the user. Again, this is a selective behaviour that the agent may or may not use based on the data provided. load_event_file validation_agent — Validated the records based on the data schema and segregated them into valid and invalid records. validation_agent dedup_agent — Removes duplicate records. dedup_agent session_agent — groups events into sessions by looking at inactivity gaps — typically 30 minutes between user actions. session_agent The final output includes JSON or markdown summaries: count of invalid records, deduplicated items, and session totals. Debugging? A dream. Modular pipeline flow powered by ADK. The data_pipeline_agent manages state across each specialized agent, source: author What Worked Well What Worked Well Modularity: Each agent could be tested in isolation.
No Glue Hell: ADK handled state transitions between agents.
Debugging: Markdown output made it easy to inspect.
Extendability: Want to add a step? Just add another agent. Modularity: Each agent could be tested in isolation. Modularity No Glue Hell: ADK handled state transitions between agents. No Glue Hell Debugging: Markdown output made it easy to inspect. Debugging Extendability: Want to add a step? Just add another agent. Extendability What Still Hurts What Still Hurts Sub-Agent Tooling: You can’t use tools inside sub-agents yet.
Silent Failures: Mismatched state keys cause quiet errors.
No Tracing: ADK lacks built-in logs or failure tracebacks.
Missing File Upload: Web UI doesn’t support file uploads natively.
Docs Are Early-Stage: Good for basics, light on advanced patterns. Sub-Agent Tooling: You can’t use tools inside sub-agents yet. Sub-Agent Tooling Silent Failures: Mismatched state keys cause quiet errors. Silent Failures No Tracing: ADK lacks built-in logs or failure tracebacks. No Tracing Missing File Upload: Web UI doesn’t support file uploads natively. Missing File Upload Docs Are Early-Stage: Good for basics, light on advanced patterns. Docs Are Early-Stage So while it’s promising, you’ll still be doing some trial-and-error — especially for complex orchestration. Try It Yourself Try It Yourself 🔗 GitHub Repo: https://github.com/vchaurasia95/adk-data-pipeline https://github.com/vchaurasia95/adk-data-pipeline Final Thoughts Final Thoughts I didn’t expect this to work so smoothly. But rethinking my pipeline with ADK turned out to be a pleasant surprise. Instead of layers of brittle logic, I had small, declarative pieces that played nicely together. No, an LLM agent won’t replace your entire pipeline. But it can definitely make parts of it think, especially those repetitive tasks like validation or formatting. think It’s not magic. But it’s modular, testable, and maybe even a little fun. I’ll be building on this more. Curious to see what else it can do.

This story contains new, firsthand information uncovered by the writer.

Every

Google

EMR IP Exhaustion in Shared VPCs: Why Autoscaling Fails and How to Fix It

Can Google's Agent Development Kit Replace Data Pipelines?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

EMR IP Exhaustion in Shared VPCs: Why Autoscaling Fails and How to Fix It

Mastering IoT Data Surges with Spark Streaming Backpressure

EMR IP Exhaustion in Shared VPCs: Why Autoscaling Fails and How to Fix It

Why Traditional Load Testing Fails for Modern AI Systems

Daniel Saks Predicts Agentic AI Will Empower Individuals and Boost Productivity

Active Inference AI: Here's Why It's The Future of Enterprise Operations and Industry Innovation

EMR IP Exhaustion in Shared VPCs: Why Autoscaling Fails and How to Fix It

Mastering IoT Data Surges with Spark Streaming Backpressure

EMR IP Exhaustion in Shared VPCs: Why Autoscaling Fails and How to Fix It

Why Traditional Load Testing Fails for Modern AI Systems

Daniel Saks Predicts Agentic AI Will Empower Individuals and Boost Productivity

Active Inference AI: Here's Why It's The Future of Enterprise Operations and Industry Innovation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps