AnyModal is a framework designed to unify multiple “modalities” (such as images, text, or other data) into a single, coherent workflow. Instead of juggling separate libraries or writing custom code to bridge vision and language models, AnyModal provides a structured pipeline where each component—image encoders, tokenizers, language models—can be plugged in without heavy customization. By handling the underlying connections between these pieces, AnyModal lets you focus on the high-level process: feeding in an image, for instance, and getting out a textual result.
In practice, AnyModal can help with tasks like image captioning, classification, or in the case demonstrated here, LaTeX OCR. Because the framework is modular, it’s relatively simple to swap one model for another (e.g., a different vision backbone or a new language model), making it flexible for experimentation or specialized use cases.
Converting an image of a mathematical expression into a valid LaTeX string requires bridging computer vision and natural language processing. The image encoder’s job is to extract features or symbolic patterns from the equation, such as recognizing “plus,” “minus,” and other symbols. The language component then uses these features to predict the proper LaTeX tokens in sequence.
LaTeX OCR with AnyModal is essentially a demonstration of how quickly you can pair a vision encoder with a language model. While this example specifically addresses equations, the overall approach can be extended to other image-to-text scenarios, including more advanced or specialized mathematical notation.
By the end of this tutorial, you will know how to use AnyModal, along with Llama 3.2 1B and Google’s SigLIP to create a small VLM for LaTeX OCR tasks:
Note that the weights released at AnyModal/LaTeX-OCR-Llama-3.2-1B are obtained by training on only 20% of
You will likely get a better model by training on the entire dataset, and over a greater number of epochs.
For those primarily interested in generating LaTeX from existing images, here’s a demonstration using pretrained weights. This avoids the need to train anything from scratch, offering a quick path to see AnyModal in action. Below is a concise overview of setting up your environment, downloading the necessary models, and running inference.
Clone the AnyModal Repository:
git clone https://github.com/ritabratamaiti/AnyModal.git
Install the necessary libraries:
pip install torch torchvision huggingface_hub PIL
Then, download pretrained weights hosted on the Hugging Face Hub:
from huggingface_hub import snapshot_download
snapshot_download("AnyModal/Image-Captioning-Llama-3.2-1B", local_dir="latex_ocr_model")
These specific weights can be found here: LaTeX-OCR-Llama-3.2-1B on Hugging Face
Next, load the vision encoder and language model:
import llm
import anymodal
import vision
from PIL import Image
# Load language model and tokenizer
tokenizer, model = llm.get_llm("meta-llama/Llama-3.2-1B")
# Load vision-related components
image_processor, vision_model, vision_hidden_size = vision.get_image_encoder('google/vit-base-patch16-224')
vision_encoder = vision.VisionEncoder(vision_model)
# Configure the multimodal pipeline
multimodal_model = anymodal.MultiModalModel(
input_processor=None,
input_encoder=vision_encoder,
input_tokenizer=vision.Projector(vision_hidden_size, llm.get_hidden_size(tokenizer, model), num_hidden=1),
language_tokenizer=tokenizer,
language_model=model,
prompt_text="The LaTeX expression of the equation in the image is:"
)
# Load the pretrained model weights
multimodal_model._load_model("latex_ocr_model")
multimodal_model.eval()
Finally, provide the image and get the LaTeX output:
# Replace with the path to your equation image
image_path = "path_to_equation_image.png"
image = Image.open(image_path).convert("RGB")
processed_image = image_processor(image, return_tensors="pt")
processed_image = {k: v.squeeze(0) for k, v in processed_image.items()}
latex_output = multimodal_model.generate(processed_image, max_new_tokens=120)
print("Generated LaTeX:", latex_output)
This simple sequence of steps runs the entire pipeline—analyzing the image, projecting it into the language model’s space, and generating the corresponding LaTeX.
For those who want more control, such as adapting the model to new data or exploring the mechanics of a vision-language pipeline, the training process provides deeper insight. The sections below illustrate how data is prepared, how the model’s components are integrated, and how they are jointly optimized.
Rather than relying on pretrained components alone, you can acquire a training dataset of paired images and LaTeX labels. One example is the unsloth/LaTeX_OCR
dataset, which contains images of equations along with their LaTeX strings. After installing dependencies and setting up your dataset, the steps to train include creating a data pipeline, initializing the model, and looping through epochs.
Here’s an outline for preparing the dataset and loading it:
from torch.utils.data import Subset
import vision
# Load training and validation sets
train_dataset = vision.ImageDataset("unsloth/LaTeX_OCR", image_processor, split='train')
val_dataset = vision.ImageDataset("unsloth/LaTeX_OCR", image_processor, split='test')
# Optionally use a smaller subset for faster iteration
subset_ratio = 0.2
train_dataset = Subset(train_dataset, range(int(subset_ratio * len(train_dataset))))
val_dataset = Subset(val_dataset, range(int(subset_ratio * len(val_dataset))))
At this point, you’d build or reuse the same AnyModal pipeline described earlier. Instead of loading pretrained weights, you would initialize the model so it can learn from scratch or from partially pretrained checkpoints.
multimodal_model = anymodal.MultiModalModel(
input_processor=None,
input_encoder=vision_encoder,
input_tokenizer=vision.Projector(vision_hidden_size, llm.get_hidden_size(tokenizer, model), num_hidden=1),
language_tokenizer=tokenizer,
language_model=model,
prompt_text="The LaTeX expression of the equation in the image is:"
)
You can then create a training loop to optimize the model’s parameters. A common approach uses PyTorch’s AdamW
optimizer and optionally employs mixed-precision training for efficiency:
from tqdm import tqdm
import torch
optimizer = torch.optim.AdamW(multimodal_model.parameters(), lr=1e-4)
scaler = torch.cuda.amp.GradScaler()
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
num_epochs = 5
for epoch in range(num_epochs):
for batch_idx, batch in tqdm(enumerate(train_loader), desc=f"Epoch {epoch+1} Training"):
optimizer.zero_grad()
with torch.cuda.amp.autocast():
logits, loss = multimodal_model(batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
After each epoch or at least when training concludes, evaluating the model on a validation set helps ensure it generalizes to new data:
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=False)
for batch_idx, batch in enumerate(val_loader):
predictions = multimodal_model.generate(batch['input'], max_new_tokens=120)
for idx, prediction in enumerate(predictions):
print(f"Actual LaTeX: {batch['text'][idx]}")
print(f"Generated LaTeX: {prediction}")
In addition to confirming performance, this validation step can guide improvements such as adjusting hyperparameters, switching to a different base model, or refining your data preprocessing. By following these training steps, you gain a better understanding of the interplay between the vision encoder and language model, and you can extend the workflow to additional tasks or more specialized domains.