In retrieval-augmented generation (RAG) pipelines, input efficiency is paramount, not just in terms of tokens, but also character limits When building a multilingual embedding pipeline, I faced a real challenge:the Cohere multilingual model imposes a maximum of 2048 characters not a token limit per input. This article walks you through a clever solution: preprocessing Markdown tables into dense JSON blocks using pymtd2json, to ensure smooth, efficient embeddings without errors. pymtd2json The Challenge: Character Limits vs Token Limits Classical chunking methods, like SentenceSplitter from LlamaIndex, are token-focused: you set a maximum number of tokens per chunk — but not characters SentenceSplitter Why This Matters: Markdown (especially GitHub-Flavored Markdown, GFM) wastes spacewith formatting. A Markdown chunk might have only 170 tokens but still exceed 2048 characters This results in rejected API requests or inefficient extra splitting. Markdown (especially GitHub-Flavored Markdown, GFM) wastes spacewith formatting. A Markdown chunk might have only 170 tokens but still exceed 2048 characters This results in rejected API requests or inefficient extra splitting. Important Note: Markdown tables are up to 3x less token-efficient than other formats, further compounding the problem.👉 Read more on token inefficiency of Markdown tables here. Important Note 3x less token-efficient Read more on token inefficiency of Markdown tables here. A Real-World Example: Measuring the Problem Let's dive into a simple simulation: Step 1: Create a Large Markdown Table import pandas as pd # Build data data = { "Name": [f"Person{i}" for i in range(30)], "Age": [20 + i for i in range(30)], "City": [f"City{i}" for i in range(30)] } # Create DataFrame df = pd.DataFrame(data) df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"] # Convert to Markdown table_text = df.to_markdown(index=False) print(table_text) import pandas as pd # Build data data = { "Name": [f"Person{i}" for i in range(30)], "Age": [20 + i for i in range(30)], "City": [f"City{i}" for i in range(30)] } # Create DataFrame df = pd.DataFrame(data) df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"] # Convert to Markdown table_text = df.to_markdown(index=False) print(table_text) This generates a verbose table with 30 rows and a very long header. Step 2: Analyze Token and Character Counts Using Cohere’s tokenizer (available via Hugging Face): from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0") encoded = tokenizer(table_text, return_tensors="pt", add_special_tokens=False) num_tokens = encoded.input_ids.shape[-1] num_chars = len(table_text) print(f"Characters: {num_chars}") print(f"Tokens: {num_tokens}") from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0") encoded = tokenizer(table_text, return_tensors="pt", add_special_tokens=False) num_tokens = encoded.input_ids.shape[-1] num_chars = len(table_text) print(f"Characters: {num_chars}") print(f"Tokens: {num_tokens}") Result: Result: Characters: 2719 Tokens: 432 Characters: 2719 2719 Tokens: 432 432 ⚡ Problem: While token count is fine, character count exceeds 2048, causing API errors like: Problem cohere.error.CohereAPIError: input text exceeds maximum allowed size of 2048 characters cohere.error.CohereAPIError: input text exceeds maximum allowed size of 2048 characters The Solution: Minifying Tables into JSON Instead of traditional Markdown, why not store the data in a dense JSON block? dense JSON block Benefits of Minifying Tables: Remove pipes, dashes, and whitespace, all formatting overhead. Preserve semantic meaning. Shrink text to meet character limits safely. Remove pipes, dashes, and whitespace, all formatting overhead. Preserve semantic meaning. Shrink text to meet character limits safely. Example of the compact JSON: {"Name":["Person0","Person1","Person2",...],"Age":["20","21","22",...],"City":["City0","City1","City2",...]} {"Name":["Person0","Person1","Person2",...],"Age":["20","21","22",...],"City":["City0","City1","City2",...]} New Stats: New Stats: Characters: 1027 Tokens: 461 Characters: 1027 1027 Tokens: 461 461 ✅ Now well within Cohere’s input limit! Applying Minification in Practice Want to prepare documents before chunking? Here's how you can automatically process all Markdown files: from pathlib import Path from llama_index import SimpleDirectoryReader from your_minifier import MinifyMDT source_dir = Path("example_dir", "markdown") documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data() doc_texts = [] for idx, doc in enumerate(documents): doc_texts.append(MinifyMDT(doc.text_resource.text).transform()) from pathlib import Path from llama_index import SimpleDirectoryReader from your_minifier import MinifyMDT source_dir = Path("example_dir", "markdown") documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data() doc_texts = [] for idx, doc in enumerate(documents): doc_texts.append(MinifyMDT(doc.text_resource.text).transform()) 👉 And voilà: Your data is compact, clean, and embedding-ready! Your data is compact, clean, and embedding-ready! Final Thoughts Working with multilingual RAG systems means optimizing every byte. Whitespace-heavy Markdown tables might look nice for humans, but they’re expensive for machine understanding. By minifying your tables with pymtd2json, you: pymtd2json Cut down API errors. Reduce token overhead. Boost overall performance. Cut down API errors. Reduce token overhead. Boost overall performance. Efficiency isn't optional, it's a superpower. 🚀 Efficiency isn't optional, it's a superpower.