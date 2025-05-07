In retrieval-augmented generation (RAG) pipelines, input efficiency is paramount, not just in terms of tokens, but also character limits





When building a multilingual embedding pipeline, I faced a real challenge:the Cohere multilingual model imposes a maximum of 2048 characters not a token limit per input.





This article walks you through a clever solution: preprocessing Markdown tables into dense JSON blocks using pymtd2json , to ensure smooth, efficient embeddings without errors.

The Challenge: Character Limits vs Token Limits





Classical chunking methods, like SentenceSplitter from LlamaIndex, are token-focused: you set a maximum number of tokens per chunk — but not characters

Why This Matters:

Markdown (especially GitHub-Flavored Markdown, GFM) wastes spacewith formatting.

A Markdown chunk might have only 170 tokens but still exceed 2048 characters

This results in rejected API requests or inefficient extra splitting.









Important Note: Markdown tables are up to 3x less token-efficient than other formats, further compounding the problem.👉 Read more on token inefficiency of Markdown tables here.

A Real-World Example: Measuring the Problem

Let's dive into a simple simulation:

Step 1: Create a Large Markdown Table

import pandas as pd # Build data data = { "Name": [f"Person{i}" for i in range(30)], "Age": [20 + i for i in range(30)], "City": [f"City{i}" for i in range(30)] } # Create DataFrame df = pd.DataFrame(data) df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"] # Convert to Markdown table_text = df.to_markdown(index=False) print(table_text)

This generates a verbose table with 30 rows and a very long header.

Step 2: Analyze Token and Character Counts

Using Cohere’s tokenizer (available via Hugging Face):

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0") encoded = tokenizer(table_text, return_tensors="pt", add_special_tokens=False) num_tokens = encoded.input_ids.shape[-1] num_chars = len(table_text) print(f"Characters: {num_chars}") print(f"Tokens: {num_tokens}")

Result:

Characters: 2719

Tokens: 432





⚡ Problem: While token count is fine, character count exceeds 2048, causing API errors like:

cohere.error.CohereAPIError: input text exceeds maximum allowed size of 2048 characters

The Solution: Minifying Tables into JSON

Instead of traditional Markdown, why not store the data in a dense JSON block?

Benefits of Minifying Tables:

Remove pipes, dashes, and whitespace, all formatting overhead.

Preserve semantic meaning.

Shrink text to meet character limits safely.

Example of the compact JSON:

{"Name":["Person0","Person1","Person2",...],"Age":["20","21","22",...],"City":["City0","City1","City2",...]}

New Stats:

Characters: 1027

Tokens: 461

✅ Now well within Cohere’s input limit!

Applying Minification in Practice





Want to prepare documents before chunking? Here's how you can automatically process all Markdown files:

from pathlib import Path from llama_index import SimpleDirectoryReader from your_minifier import MinifyMDT source_dir = Path("example_dir", "markdown") documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data() doc_texts = [] for idx, doc in enumerate(documents): doc_texts.append(MinifyMDT(doc.text_resource.text).transform())

👉 And voilà: Your data is compact, clean, and embedding-ready!

Final Thoughts





Working with multilingual RAG systems means optimizing every byte. Whitespace-heavy Markdown tables might look nice for humans, but they’re expensive for machine understanding.

By minifying your tables with pymtd2json , you:

Cut down API errors.

Reduce token overhead.

Boost overall performance.

Efficiency isn't optional, it's a superpower. 🚀



