Why Are the New AI Agents Choosing Markdown Over HTML?

Written by brightdata | Published 2025/03/19
Tech Story Tags: ai | data | ai-agent | llm | web-scraping | future-of-ai | good-company | data-processing

TLDRDiscover why AI agents convert HTML to Markdown to slash token usage by up to 99%! Faster processing, lower costsā€”AI efficiency at its best.via the TL;DR App

AI agents are taking over the world, marking the next big step in AI evolution šŸ¦–. So, what do all these agents have in common? They use Markdown instead of raw HTML when processing content on web pages ā›“ļø. Curious to know why?

This blog post will show you how this simple trick can save you up to 99% in tokens and money!

AI Agents and Data Processing: An Introduction

AI agents are software systems that harness the power of artificial intelligence to accomplish tasks and pursue goals on behalf of users. Equipped with reasoning, planning, and memory, these agents can make decisions, learn, and adaptā€”all on their own. šŸ¤Æ

In recent months, AI agents have taken off, especially in the world of browser automation. These AI agent browsers enable you to use LLMs to control browsers programmatically, automating tasks like adding products to your Amazon cart šŸ›’.

Ever wondered which libraries and frameworks power AI agents like Crawl4AI, ScrapeGraphAI, and LangChain?

When processing data from web pages, these solutions often convert HTML into Markdown automaticallyā€”or offer methods to do soā€”before sending the data to LLMs. But why do these AI agents favor Markdown over HTML? šŸ§

The short answer is: to save tokens and speed up processing! ā©

Time to dig deeper! But first, letā€™s take a look at another popular approach AI agents use to reduce data load. šŸ‘€

From Data Overload to Clarity: AI Agentsā€™ First Move

Imagine you want your AI agent to:

  1. Connect to an e-commerce site (e.g. Amazon)

  2. Search for a product (e.g. PlayStation 5)

  3. Extract data from that specific product page

Thatā€™s a common scenario for an AI agent, as e-commerce scraping is a wild ride šŸŽ¢. After all, product pages are a chaotic mess of ever-changing layouts, making programmatic data parsing a nightmare. Thatā€™s where AI agents flex their superpowers šŸ’Ŗ, leveraging LLMs to extract data seamlesslyā€”no matter how messy the page structure!

Now, letā€™s say youā€™re on a mission to grab all the juicy details from the PlayStation 5 product page on Amazon šŸŽ®:

Hereā€™s how youā€™d command your AI agent browser to make it happen:

Navigate to Amazon's homepage. Search for 'PlayStation 5' and select the top result. 
Extract the product title, price, availability, and customer ratings. 
Return the data in a structured JSON format.

Thatā€™s what the AI agent should (hopefully šŸ¤ž) do:

  1. Open Amazon in the browser šŸŒ

  2. Search for the ā€œPlayStation 5ā€ šŸ”

  3. Identify the correct product šŸŽÆ

  4. Extract the product details from the page and return it in JSON šŸ“„

But hereā€™s the real challengeā€”Step 4. The Amazon PlayStation 5 product page is a beast! The HTML is packed with tons of information, most of which you donā€™t even need.

Want proof? Copy the pageā€™s full HTML of the page from your browserā€™s DOM and drop it into a tool like the LLM Token Calculator tool:

šŸšØ Brace yourselfā€¦

896,871 tokens?! šŸ˜± Yeah, you read that rightā€”eight hundred ninety-six thousand, eight hundred seventy-one freaking tokens!

Thatā€™s a MASSIVE load of dataā€”aka a ton of money! šŸ’ø (Over $2 per request on GPT-4o! šŸ˜¬)

As you can imagine, passing all that data to an AI agent comes with major limitations:

  1. May require premium/pro plans that support high token usage šŸ’°
  2. Costs a fortuneā€”especially if you're running frequent queries šŸ¤‘
  3. Slows down responses since the AI has to process a ridiculous amount of info ā³

The Fix: Trim the Fat

Most AI agents let you specify a CSS selector to extract only relevant sections of a webpage. Others use heuristic algorithms to auto-filter contentā€”like stripping out headers and footers (which usually add no value). āœ‚ļø

For example, if you inspect Amazonā€™s PlayStation 5 product page, youā€™ll notice that most of the useful content lives inside the HTML element identified by the #ppd CSS selector:

Now, what if you tell your AI agent to focus only on the #ppd element instead of the entire page? Would that make a difference? šŸ¤”

Letā€™s put it to the test in the head-to-head showdown below! šŸ”„

Markdown vs HTML in AI Data Processing: A Head-to-Head Comparison

Compare the token usage when processing a portion of a web page directly versus converting it into Markdown.

HTML

In your browser, copy the HTML of the #ppd element, and drop it into an LLM Token Calculator tool:

From 896,871 tokens down to just 309,951ā€”nearly a 65% save!

Thatā€™s a huge drop, sure, but letā€™s be realā€”itā€™s still way too many tokens! šŸ˜µā€šŸ’ø

Markdown

Now, letā€™s replicate the trick that AI agents use by leveraging an HTML-to-Markdown conversion tool online. But first, remember that AI agents perform some preprocessing to remove content-insignificant tags like <style> and <script> tags.

You can filter the HTML of the target element using this simple script in your browserā€™s console:

function removeScriptsAndStyles(element) {  
  let htmlString = ppdElement.innerHTML;

  // Regex to match all <script>...</script> and <style>...</style> tags
  const scriptRegex = /<script[^>]*>[\s\S]*?<\/script>/gi;
  const styleRegex = /<style[^>]*>[\s\S]*?<\/style>/gi;

  // Remove all <script> and <style> tags
  let cleanHTML = htmlString.replace(scriptRegex, '');
  cleanHTML = cleanHTML.replace(styleRegex, '');
}

// select the target element and get its cleaned HTML
const ppdElement = document.getElementById('ppd');
removeScriptsAndStyles(ppdElement);

Next, copy the cleaned HTML and convert it into Markdown using an online HTML-to-Markdown conversion tool:

The resulting Markdown is significantly smaller but still contains all the important text data!

Now, paste this Markdown into the LLM Token Calculator tool:

Boom! šŸ’£ From 896,871 tokens down to just 7,943 tokens. Thatā€™s a jaw-dropping ~99% savings!

With just basic content removal and the HTML-to-Markdown conversion, youā€™ve got a leaner payload, lower costs, and way faster processing. Big win! šŸ’°

Markdown vs HTML: The Battle for Tokens and Cost Savings

The last step is to verify that the Markdown text still contains all the key data. To do so, pass it to an LLM with the final part of the original prompt, and hereā€™s the JSON result youā€™ll get:

{
  "product_title": "PlayStationĀ®5 console (slim)",
  "price": "$499.00",
  "availability": "In stock",
  "customer_ratings": {
    "rating": 4.6,
    "total_ratings": 5814
  }
}

This is exactly what your AI agent would returnā€”spot on!

For a quick overview, check out the final summary table below:

Method

Tokens

o1-mini Price

gpt-4o-mini Price

gpt-4o Price

Entire HTML

896,871

$13.4531

$0.1345

$2.2422

#ppd HTML

309,951

$4.6493

$0.0465

$0.7749

#ppd Markdown

7,943

$0.0596

$0.0012

$0.0199

Where AI Agents Are Failing

All those token-saving tricks are useless if your AI agent gets blocked by the target site šŸ˜… (ever seen how hilarious AI CAPTCHA fails can be? šŸ¤£ ). Thatā€™s why AI agents are failingā€¦

So, why does this happen? Simple! Most sites use anti-scraping measures that can easily block automated browsers. Want the full breakdown? Watch our upcoming webinar below:

https://www.youtube.com/watch?v=RArxdFeijd4&embedable=true

If you've followed our advanced web scraping guide, you know the issue isnā€™t with the browser automation tools (the libraries powering your AI agents). Nope, the real culprit is the browser itself. šŸ¤–

To avoid getting blocked, you need a browser built specifically for cloud automation. Enter the Scraping Browser, a browser that:

  • Runs in headed mode just like a regular browser, making it much harder for anti-bot systems to detect you. šŸ”
  • Scales effortlessly in the cloud, saving you time and money on infrastructure. šŸ’°
  • Automatically solves CAPTCHA, handles browser fingerprinting, customizes cookies/headers, and retries to keep things running smoothly. āš”
  • Rotates IPs from one of the largest, most reliable proxy networks out there. šŸŒ
  • Integrates seamlessly with popular automation libraries like Playwright, Selenium, and Puppeteer. šŸ”§

Learn more about Bright Dataā€™s Scraping Browser, the perfect tool to integrate into your AI agents:

https://www.youtube.com/watch?v=kuDuJWvho7Q&embedable=true

Final Thoughts

Now youā€™re in the loop on why AI agents use Markdown for data processing. Itā€™s a simple trick to save tokens (and money) while speeding up LLM processing.

Want your AI agent to run without hitting blocks? Take a look at Bright Data's suite of tools for AI! Join us in making the internet accessible to everyoneā€”even through automated AI agent browsers. šŸŒ

Until next time, keep surfing the Web with freedom! šŸ„ā€ā™‚ļø


Written by brightdata | From data collection to ready-made datasets, Bright Data allows you to retrieve the data that matters.
Published by HackerNoon on 2025/03/19