Why Are the New AI Agents Choosing Markdown Over HTML?

AI agents are taking over the world, marking the next big step in AI evolution 🦖. So, what do all these agents have in common? They use Markdown instead of raw HTML when processing content on web pages ⛓️. Curious to know why?

This blog post will show you how this simple trick can save you up to 99% in tokens and money!

AI Agents and Data Processing: An Introduction

AI agents are software systems that harness the power of artificial intelligence to accomplish tasks and pursue goals on behalf of users. Equipped with reasoning, planning, and memory, these agents can make decisions, learn, and adapt—all on their own. 🤯

In recent months, AI agents have taken off, especially in the world of browser automation. These AI agent browsers enable you to use LLMs to control browsers programmatically, automating tasks like adding products to your Amazon cart 🛒.

Ever wondered which libraries and frameworks power AI agents like Crawl4AI, ScrapeGraphAI, and LangChain?

When processing data from web pages, these solutions often convert HTML into Markdown automatically—or offer methods to do so—before sending the data to LLMs. But why do these AI agents favor Markdown over HTML? 🧐

The short answer is: to save tokens and speed up processing! ⏩

Time to dig deeper! But first, let’s take a look at another popular approach AI agents use to reduce data load. 👀

From Data Overload to Clarity: AI Agents’ First Move

Imagine you want your AI agent to:

Connect to an e-commerce site (e.g. Amazon)
Search for a product (e.g. PlayStation 5)
Extract data from that specific product page

That’s a common scenario for an AI agent, as e-commerce scraping is a wild ride 🎢. After all, product pages are a chaotic mess of ever-changing layouts, making programmatic data parsing a nightmare. That’s where AI agents flex their superpowers 💪, leveraging LLMs to extract data seamlessly—no matter how messy the page structure!

Now, let’s say you’re on a mission to grab all the juicy details from the PlayStation 5 product page on Amazon 🎮:

Here’s how you’d command your AI agent browser to make it happen:

Navigate to Amazon's homepage. Search for 'PlayStation 5' and select the top result. 
Extract the product title, price, availability, and customer ratings. 
Return the data in a structured JSON format.

That’s what the AI agent should (hopefully 🤞) do:

Open Amazon in the browser 🌍
Search for the “PlayStation 5” 🔍
Identify the correct product 🎯
Extract the product details from the page and return it in JSON 📄

But here’s the real challenge—Step 4. The Amazon PlayStation 5 product page is a beast! The HTML is packed with tons of information, most of which you don’t even need.

Want proof? Copy the page’s full HTML of the page from your browser’s DOM and drop it into a tool like the LLM Token Calculator tool:

🚨 Brace yourself…

896,871 tokens?! 😱 Yeah, you read that right—eight hundred ninety-six thousand, eight hundred seventy-one freaking tokens!

That’s a MASSIVE load of data—aka a ton of money! 💸 (Over $2 per request on GPT-4o! 😬)

As you can imagine, passing all that data to an AI agent comes with major limitations:

May require premium/pro plans that support high token usage 💰
Costs a fortune—especially if you're running frequent queries 🤑
Slows down responses since the AI has to process a ridiculous amount of info ⏳

The Fix: Trim the Fat

Most AI agents let you specify a CSS selector to extract only relevant sections of a webpage. Others use heuristic algorithms to auto-filter content—like stripping out headers and footers (which usually add no value). ✂️

For example, if you inspect Amazon’s PlayStation 5 product page, you’ll notice that most of the useful content lives inside the HTML element identified by the #ppd CSS selector:

Now, what if you tell your AI agent to focus only on the #ppd element instead of the entire page? Would that make a difference? 🤔

Let’s put it to the test in the head-to-head showdown below! 🔥

Markdown vs HTML in AI Data Processing: A Head-to-Head Comparison

Compare the token usage when processing a portion of a web page directly versus converting it into Markdown.

HTML

In your browser, copy the HTML of the #ppd element, and drop it into an LLM Token Calculator tool:

From 896,871 tokens down to just 309,951—nearly a 65% save!

That’s a huge drop, sure, but let’s be real—it’s still way too many tokens! 😵‍💸

Markdown

Now, let’s replicate the trick that AI agents use by leveraging an HTML-to-Markdown conversion tool online. But first, remember that AI agents perform some preprocessing to remove content-insignificant tags like <style> and <script> tags.

You can filter the HTML of the target element using this simple script in your browser’s console:

function removeScriptsAndStyles(element) {  
  let htmlString = ppdElement.innerHTML;

  // Regex to match all <script>...</script> and <style>...</style> tags
  const scriptRegex = /<script[^>]*>[\s\S]*?<\/script>/gi;
  const styleRegex = /<style[^>]*>[\s\S]*?<\/style>/gi;

  // Remove all <script> and <style> tags
  let cleanHTML = htmlString.replace(scriptRegex, '');
  cleanHTML = cleanHTML.replace(styleRegex, '');
}

// select the target element and get its cleaned HTML
const ppdElement = document.getElementById('ppd');
removeScriptsAndStyles(ppdElement);

Next, copy the cleaned HTML and convert it into Markdown using an online HTML-to-Markdown conversion tool:

The resulting Markdown is significantly smaller but still contains all the important text data!

Now, paste this Markdown into the LLM Token Calculator tool:

Boom! 💣 From 896,871 tokens down to just 7,943 tokens. That’s a jaw-dropping ~99% savings!

With just basic content removal and the HTML-to-Markdown conversion, you’ve got a leaner payload, lower costs, and way faster processing. Big win! 💰

Markdown vs HTML: The Battle for Tokens and Cost Savings

The last step is to verify that the Markdown text still contains all the key data. To do so, pass it to an LLM with the final part of the original prompt, and here’s the JSON result you’ll get:

{
  "product_title": "PlayStation®5 console (slim)",
  "price": "$499.00",
  "availability": "In stock",
  "customer_ratings": {
    "rating": 4.6,
    "total_ratings": 5814
  }
}

This is exactly what your AI agent would return—spot on!

For a quick overview, check out the final summary table below:

Method	Tokens	o1-mini Price	gpt-4o-mini Price	gpt-4o Price
Entire HTML	896,871	$13.4531	$0.1345	$2.2422
`#ppd` HTML	309,951	$4.6493	$0.0465	$0.7749
`#ppd` Markdown	7,943	$0.0596	$0.0012	$0.0199

Where AI Agents Are Failing

All those token-saving tricks are useless if your AI agent gets blocked by the target site 😅 (ever seen how hilarious AI CAPTCHA fails can be? 🤣 ). That’s why AI agents are failing…

So, why does this happen? Simple! Most sites use anti-scraping measures that can easily block automated browsers. Want the full breakdown? Watch our upcoming webinar below:

https://www.youtube.com/watch?v=RArxdFeijd4&embedable=true

If you've followed our advanced web scraping guide, you know the issue isn’t with the browser automation tools (the libraries powering your AI agents). Nope, the real culprit is the browser itself. 🤖

To avoid getting blocked, you need a browser built specifically for cloud automation. Enter the Scraping Browser, a browser that:

Runs in headed mode just like a regular browser, making it much harder for anti-bot systems to detect you. 🔍
Scales effortlessly in the cloud, saving you time and money on infrastructure. 💰
Automatically solves CAPTCHA, handles browser fingerprinting, customizes cookies/headers, and retries to keep things running smoothly. ⚡
Rotates IPs from one of the largest, most reliable proxy networks out there. 🌍
Integrates seamlessly with popular automation libraries like Playwright, Selenium, and Puppeteer. 🔧

Learn more about Bright Data’s Scraping Browser, the perfect tool to integrate into your AI agents:

https://www.youtube.com/watch?v=kuDuJWvho7Q&embedable=true

Final Thoughts

Now you’re in the loop on why AI agents use Markdown for data processing. It’s a simple trick to save tokens (and money) while speeding up LLM processing.

Want your AI agent to run without hitting blocks? Take a look at Bright Data's suite of tools for AI! Join us in making the internet accessible to everyone—even through automated AI agent browsers. 🌐

Until next time, keep surfing the Web with freedom! 🏄‍♂️