What happens when the model doesn’t need to retrieve anything because it has already internalised the knowledge?
There has been a growing interest in brand visibility in the age of AI. Marketers are scrambling to adapt, and new vocabularies are emerging. Structured content, LMS.txt files, visibility trackers, RAG pipelines, etc. All of this feels familiar. For me, it is like SEO 2.0, but reshaped for a world where the answer is generated, not linked.
Most of the optimisation strategies right now are geared towards making your content surface as the source of a generated answer. But at some point, I paused. If all we do is optimize for retrieval, aren’t we still playing yesterday’s game? What happens when the model doesn’t need to retrieve anything because it has already internalised the knowledge? That’s when the idea of Train-Set SEO clicked for me.
Retrieval vs. Knowledge Optimisation
Today’s AIO (AI Optimisation) industry is built on retrieval-layer tactics. This involves structuring your content to be machine-readable, formatting data for agent-friendly APIs, and tracking mentions across platforms like ChatGPT, Perplexity, and Claude. Even though it works, it is fragile. A simple tweak in a RAG pipeline can cause your brand’s presence to evaporate. Train-Set SEO is fundamentally different. It asks a more profound question: What if your brand wasn’t just fetched, but was already part of the model’s bloodstream? Retrieval makes you accessible; training makes you inevitable.
Train-Set SEO is a fundamentally different paradigm. Instead of waiting to be retrieved, the goal is for the brand’s data to be included in the very dataset used to train the AI model. This means the brand’s information is not just a mere reference but a foundational knowledge the model was built on. The model knows about the brand in the same way it knows about historical events, scientific principles, or famous people.
Train-Set SEO embeds your brand as a part of the model’s neural network. It’s woven into the very fabric of the AI’s understanding of the world. Changes to RAG pipelines are far less likely to affect a brand that is part of the core training data, as the information is not being looked up; it’s being generated from first principles.
The Blueprint for Train-Set SEO
This is still uncharted territory, but a few key strategies are beginning to emerge. One path is Open Dataset Seeding. Most large language models draw from a mix of open datasets like Common Crawl, Wikipedia, C4, and various domain-specific corpora. If your content is absent from these foundational pools, the model simply won’t “know” you. Brands who care about this should release high-quality, structured, and machine-readable data to give the model builders a compelling reason to ingest your information.
Another approach is to seek out partnerships with model builders. Since labs are constantly searching for clean, reliable data to reduce hallucinations and improve model accuracy. A fintech company in Africa that curates the most accurate open dataset on local banking APIs, for example, could become the default reference for every major model. Providing this type of valuable resource means you’re not just optimising for retrieval but also becoming a foundational layer of the model’s knowledge base.
Models also learn best from examples. Therefore, synthetic Q&A pairs aligned with your brand, make you not just present but performant in the model’s behavior. The more your brand is associated with accurate, well-structured Q&A examples, the more the model will default to your information when a user asks a related question.
You can also leverage benchmarking. Models are tuned against benchmarks like MMLU and TruthfulQA. If you can publish a respected, publicly available benchmark in your industry, labs will train against it, and in doing so, they will absorb your content and framing.
Finally, think about knowledge graph insertion. Structured ontologies like Wikidata, schema.org, and other domain-specific taxonomies become the anchor points in the model’s world. Position your brand as a node in these graphs, and you’re woven into the very fabric of the knowledge that models are built on.
A First-Steps Playbook
The strange thing about this space is how wide open it is. Most AI optimisation agencies stop at retrieval formatting, and brands simply don’t know where training data comes from. But a clear playbook is emerging for the brands who want to get ahead.
First, audit your visibility. Check if you’re present in public datasets like Wikipedia, Wikidata, and Common Crawl. You should also search academic repositories for mentions of your domain.
Next, seed structured content. Release your data in clean CSVs, JSON, and APIs. Your goal should be to contribute to open knowledge bases, not just your own website.
You should also create and publish Q&A corpora. Rewrite your FAQs, manuals, and blog posts into explicit question-answer pairs and make them publicly available.
If your industry lacks one, create a domain benchmark. This is a challenge dataset that measures a model’s performance in your specific vertical. Publish it openly and track its adoption.
Finally, engage with model builders. Reach out to them directly with your curated datasets. Position your content as a way to reduce hallucinations and improve the model’s overall trustworthiness and accuracy.
Beyond Retrieval
Train-Set SEO involves embedding your identity at the level of infrastructure. If retrieval-layer optimisation is about winning page one, then Train-Set optimisation is about becoming the dictionary the page is written from. That’s a deeper form of defensibility, one that lasts as long as the model’s memory does.
I don’t think every brand needs to run toward Train-Set SEO tomorrow. But the companies who do will enjoy a peculiar kind of advantage: they won’t just be found; they’ll be assumed. And that, I suspect, is the real frontier.