In retrieval-augmented generation (RAG) pipelines, input efficiency is paramount, not just in terms of tokens, but also character limits When building a multilingual embedding pipeline, I faced a real challenge:the Cohere multilingual model imposes a maximum of 2048 characters not a token limit per input. This article walks you through a clever solution: preprocessing Markdown tables into dense JSON blocks using pymtd2json, to ensure smooth, efficient embeddings without errors. pymtd2json The Challenge: Character Limits vs Token Limits Classical chunking methods, like SentenceSplitter from LlamaIndex, are token-focused: you set a maximum number of tokens per chunk — but not characters SentenceSplitter Why This Matters: Markdown (especially GitHub-Flavored Markdown, GFM) wastes spacewith formatting.
A Markdown chunk might have only 170 tokens but still exceed 2048 characters
This results in rejected API requests or inefficient extra splitting. Markdown (especially GitHub-Flavored Markdown, GFM) wastes spacewith formatting. A Markdown chunk might have only 170 tokens but still exceed 2048 characters This results in rejected API requests or inefficient extra splitting. Important Note: Markdown tables are up to 3x less token-efficient than other formats, further compounding the problem.👉 Read more on token inefficiency of Markdown tables here. Important Note 3x less token-efficient Read more on token inefficiency of Markdown tables here. A Real-World Example: Measuring the Problem Let's dive into a simple simulation: Step 1: Create a Large Markdown Table import pandas as pd

# Build data
data = {
    "Name": [f"Person{i}" for i in range(30)],
    "Age": [20 + i for i in range(30)],
    "City": [f"City{i}" for i in range(30)]
}

# Create DataFrame
df = pd.DataFrame(data)
df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"]

# Convert to Markdown
table_text = df.to_markdown(index=False)
print(table_text) import pandas as pd

# Build data
data = {
    "Name": [f"Person{i}" for i in range(30)],
    "Age": [20 + i for i in range(30)],
    "City": [f"City{i}" for i in range(30)]
}

# Create DataFrame
df = pd.DataFrame(data)
df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"]

# Convert to Markdown
table_text = df.to_markdown(index=False)
print(table_text) This generates a verbose table with 30 rows and a very long header. Step 2: Analyze Token and Character Counts Using Cohere’s tokenizer (available via Hugging Face): from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
encoded = tokenizer(table_text, return_tensors="pt", add_special_tokens=False)

num_tokens = encoded.input_ids.shape[-1]
num_chars = len(table_text)

print(f"Characters: {num_chars}")
print(f"Tokens: {num_tokens}") from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
encoded = tokenizer(table_text, return_tensors="pt", add_special_tokens=False)

num_tokens = encoded.input_ids.shape[-1]
num_chars = len(table_text)

print(f"Characters: {num_chars}")
print(f"Tokens: {num_tokens}") Result: Result: Characters: 2719
Tokens: 432 Characters: 2719 2719 Tokens: 432 432 ⚡ Problem: While token count is fine, character count exceeds 2048, causing API errors like: Problem cohere.error.CohereAPIError: input text exceeds maximum allowed size of 2048 characters cohere.error.CohereAPIError: input text exceeds maximum allowed size of 2048 characters The Solution: Minifying Tables into JSON Instead of traditional Markdown, why not store the data in a dense JSON block? dense JSON block Benefits of Minifying Tables: Remove pipes, dashes, and whitespace, all formatting overhead.
Preserve semantic meaning.
Shrink text to meet character limits safely. Remove pipes, dashes, and whitespace, all formatting overhead. Preserve semantic meaning. Shrink text to meet character limits safely. Example of the compact JSON: {"Name":["Person0","Person1","Person2",...],"Age":["20","21","22",...],"City":["City0","City1","City2",...]} {"Name":["Person0","Person1","Person2",...],"Age":["20","21","22",...],"City":["City0","City1","City2",...]} New Stats: New Stats: Characters: 1027
Tokens: 461 Characters: 1027 1027 Tokens: 461 461 ✅ Now well within Cohere’s input limit! Applying Minification in Practice Want to prepare documents before chunking? Here's how you can automatically process all Markdown files: from pathlib import Path
from llama_index import SimpleDirectoryReader
from your_minifier import MinifyMDT

source_dir = Path("example_dir", "markdown")

documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data()

doc_texts = []
for idx, doc in enumerate(documents):
    doc_texts.append(MinifyMDT(doc.text_resource.text).transform()) from pathlib import Path
from llama_index import SimpleDirectoryReader
from your_minifier import MinifyMDT

source_dir = Path("example_dir", "markdown")

documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data()

doc_texts = []
for idx, doc in enumerate(documents):
    doc_texts.append(MinifyMDT(doc.text_resource.text).transform()) 👉 And voilà: Your data is compact, clean, and embedding-ready! Your data is compact, clean, and embedding-ready! Final Thoughts Working with multilingual RAG systems means optimizing every byte. Whitespace-heavy Markdown tables might look nice for humans, but they’re expensive for machine understanding. By minifying your tables with pymtd2json, you: pymtd2json Cut down API errors.
Reduce token overhead.
Boost overall performance. Cut down API errors. Reduce token overhead. Boost overall performance. Efficiency isn't optional, it's a superpower. 🚀 Efficiency isn't optional, it's a superpower.

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

AI Fails Over Tables? Stop Using Markdown and Start Minifying

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Py-DockerDB: Simplifying Programmatic Database Handling

Processing Structured and Unstructured Data with SuperAGI and LlamaIndex

The AI Landscape With Jerry Liu: Bridging RAG Systems, Documentation, and Multimodal Models

What Is LlamaIndex? A Comprehensive Exploration of LLM Orchestration Frameworks

Streamlining LLM Implementation: How to Enhance Specific Business Solutions with RAG

Introducing LLM Sandbox: Securely Execute LLM-Generated Code with Ease

Py-DockerDB: Simplifying Programmatic Database Handling

Processing Structured and Unstructured Data with SuperAGI and LlamaIndex

The AI Landscape With Jerry Liu: Bridging RAG Systems, Documentation, and Multimodal Models

What Is LlamaIndex? A Comprehensive Exploration of LLM Orchestration Frameworks

Streamlining LLM Implementation: How to Enhance Specific Business Solutions with RAG

Introducing LLM Sandbox: Securely Execute LLM-Generated Code with Ease

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps