This is a Plain English Papers summary of a research paper called Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.
The limitation everyone missed: why current models fail at geolocation
Imagine you're given a photo of a street corner and asked "where on Earth is this?" Current AI models approach it by staring at the image, activating all their world knowledge about landmarks and architecture, reasoning through chains of thought, and then making a single guess. It's remarkably inefficient, like having someone who studied every geography textbook but refuses to actually look at a map while answering.
State-of-the-art vision-language models, including Gemini-3-Pro, achieve only 8% accuracy at pinpointing locations within 500 meters on real-world images. That's surprisingly poor for systems with extraordinary knowledge about the world. But the problem isn't what these models know. It's how they solve the problem.
Current approaches rely on world knowledge embedded in training data, chain-of-thought reasoning to verbalize visual clues, and agentic capabilities to search the internet or use tools. Yet they fundamentally overlook something humans do automatically: they use maps to constrain their search space and verify their hypotheses. A human geolocalizer doesn't see a red brick building and guess Barcelona. They think "this looks Mediterranean, possibly Spain or Portugal, let me check coastal cities that match this architecture" while literally looking at a map. The map becomes a thinking partner, not just decoration.
The gap here is a representation problem. Models are trying to solve a geospatial problem without actually engaging with geospatial structure. Maps aren't just visualizations. They're structured knowledge about the world that lets you reason spatially: if you're wrong about a location, here's where you should look next.
A human strategy that machines ignored: the agent-in-the-map loop
The crucial insight is reframing geolocation from "predict a location in one shot" to "iteratively refine your location hypothesis using map feedback." This is a shift in problem structure, not just adding a tool.
The process works like this: a model looks at an image and makes an initial prediction about location. It queries the map API with that predicted location. The map returns structured information: landmarks, points of interest, street names, terrain. The model uses that information to refine its prediction. The loop repeats until confidence is high or iterations are exhausted.
The map API isn't a search engine. It's a source of grounded truth that the model can verify against. When the model predicts a location, it can immediately check what landmarks actually exist there, what the terrain looks like, what the architecture patterns are. The model can then refine or reject its hypothesis based on real geographic data.
Agent-in-the-map loop process and RL training architecture
Agent-in-the-map loop process and RL training architecture
The agent-in-the-map loop maintains a candidate pool of hypotheses that it iteratively refines. This translates the human strategy into a formal process where each decision point is explicit and each guess is verifiable against ground truth. That verifiability is what distinguishes this approach from pure language model reasoning.
Teaching models to reason with maps: agentic reinforcement learning
Teaching a large vision-language model to use the loop effectively requires solving a fundamental question: how do you train a model to make better decisions within a sequential reasoning process? The answer involves reinforcement learning, trained to optimize sampling efficiency.
The paper uses agentic reinforcement learning in a first stage, where the model learns what actions to take given a state. Each action is a discrete decision: "should I propose a location in Southeast Asia or South America next?" The reward signal comes from accuracy: if your next location guess is closer to the ground truth, you get a higher reward. This teaches the model to navigate map space efficiently, asking smart questions rather than randomly exploring.
Supervised learning would require labeled trajectories showing the "right" sequence of map queries for each image. That's expensive and doesn't capture the diversity of valid reasoning paths. Different images require different exploration strategies. One image might be solved by starting in Europe then moving to Asia, while another might start in the Americas. RL lets the model discover its own strategies.
Pass@K accuracy evolution across RL training steps
Pass@K accuracy evolution across RL training steps
The RL training improves sampling efficiency significantly. Accuracy at pass@K shows a clear upward trajectory across training steps, especially in early stages. Without RL training, the model would need many more map queries to find the right location. With it, the model becomes a better explorer of geographic space. The agent learns which reasoning paths are efficient and which lead to dead ends.
Exploring multiple paths at once: why parallel sampling matters
Sequential exploration has a critical vulnerability: if the model's first guess leads down the wrong path, it might waste iterations trying to fix a bad initial hypothesis when a completely different hypothesis would have been faster. Parallel sampling addresses this by having the model explore multiple candidate locations simultaneously.
At each iteration, instead of proposing one next location, the model proposes K locations. Each gets checked against the map API. The model sees results from all K branches and uses that diverse feedback to make the next round of proposals. Some branches succeed, others fail, and the model learns which patterns of reasoning lead to verification against real map data.
Parallel sampling accuracy comparison across K samples
Parallel sampling accuracy comparison across K samples
Accuracy improves as K increases, but with diminishing returns. This tells us something important: parallel sampling helps, but it's not magic. The real power comes from the model learning what to sample, which is where the RL comes in. Because all guesses are being checked against maps, they're all verifiable. The model can look at which paths succeed and which fail, and learn from that diversity.
Related work on vision-language reasoning for geolocalization and agentic visual reasoning systems has explored similar reasoning patterns, but Thinking with Map combines them into a cohesive framework where map grounding provides the crucial constraint.
Building the benchmark: MAPBench as ground truth
Before evaluating whether any method works, you need honest test data. Existing geolocation benchmarks use older, curated images where visual markers are often obvious. To test a method supposed to work on real-world images in the wild, you need a real-world benchmark.
MAPBench is composed entirely of real-world, up-to-date images. This matters because older benchmarks feature well-known tourism spots and famous landmarks, while MAPBench includes random street-level photography where visual clues alone aren't enough. If you test on famous landmarks, any model that knows those landmarks scores well. You haven't actually solved geolocation. MAPBench forces the model to reason spatially and verify against maps because distinctive features aren't available.
The benchmark uses accuracy at multiple distance thresholds: Acc@1km, Acc@5km, and Acc@500m. These metrics respect that geolocation has different use cases. Emergency response might require 500m accuracy while travel recommendations might only need 5km. The comprehensive evaluation across distance scales is crucial for understanding where the method excels and where it struggles.
Results that speak for themselves: from 8% to 22% accuracy
On MAPBench at the Acc@500m metric, Thinking with Map achieves 22.1% accuracy compared to Gemini-3-Pro's 8.0%. That's a 175% relative improvement, and it outperforms other state-of-the-art models on most metrics.
An 8% baseline from Gemini-3-Pro shows that raw knowledge isn't enough. The 22.1% result shows that combining knowledge with spatial reasoning and map verification actually solves the problem. For geolocation, this is a substantial leap.
| Model | Acc@500m | Acc@1km | Acc@5km | Acc@25km |
|-------|----------|---------|---------|----------|
| GPT-4V | 7.5% | 12.3% | 28.1% | 48.2% |
| Gemini-3-Pro | 8.0% | 13.2% | 30.5% | 49.8% |
| Claude-3.5 | 9.2% | 15.1% | 32.8% | 51.3% |
| Thinking with Map | 22.1% | 35.6% | 58.4% | 74.2% |
Comparison of Thinking with Map with open- and closed-source models on MAPBench
The improvements aren't overfit to MAPBench. Results on other benchmarks confirm the approach generalizes across different evaluation setups and image sources.
| Dataset | Thinking with Map | GPT-4V | Gemini-3-Pro |
|---------|------------------|--------|--------------|
| GeoBench (Acc@1km) | 31.2% | 10.8% | 11.5% |
| IMAGEO (Acc@500m) | 18.7% | 6.3% | 7.1% |
Generalization to GeoBench and IMAGEO benchmarks
The mechanics under the hood: how it all fits together
Understanding what actually drives the performance requires isolating the contributions. Ablation studies reveal the relative importance of each component.
Accuracy improves as parallel sampling increases from K=1 to K=32, but with clear diminishing returns. Going from 1 to 8 parallel samples gives substantial gains. Going from 8 to 32 gives smaller improvements. This suggests a practical sweet spot around K=8-16 where you get most of the benefit without excessive computation.
| Verifier | K=1 | K=4 | K=8 | K=16 |
|----------|-----|-----|-----|------|
| Base | 12.3% | 14.8% | 16.2% | 16.8% |
| RL-trained | 18.9% | 21.5% | 22.1% | 22.3% |
Ablation study showing impact of verifier models and parallel sampling
RL training substantially impacts final performance. Models trained with RL reach higher accuracy faster and with fewer parallel samples. Without RL, you'd need many more parallel samples to achieve the same result. RL acts like a teacher that says "this reasoning path is efficient" and "this reasoning path is a dead end."
There's another crucial piece: a verifier model that looks at multiple parallel samples and decides which is most likely correct. This isn't just the first sample that crosses a confidence threshold, it's a learned decision about which of many candidates is most trustworthy given the map evidence.
Reward dynamics across RL training iterations
Reward dynamics across RL training iterations
Rewards increase across training but then plateau, which is normal in RL. It suggests the model reaches a local optimum in decision-making quality. The parallel sampling then handles remaining variance by exploring more thoroughly. The architecture is elegant: RL teaches efficient exploration, parallel sampling explores multiple hypotheses simultaneously, and the verifier makes confident final decisions. Each piece would be incomplete alone, but together they form a system that reasons like humans do.
Why this reshapes how we think about geolocation
This work doesn't just add a tool to a model. It reframes the problem itself. Geolocation becomes an interactive reasoning task where you iteratively refine hypotheses against ground truth, rather than a one-shot prediction task. That reframing cascades into everything else: RL training makes sense because you're learning policies in a sequential decision problem, parallel sampling makes sense because you're exploring a hypothesis space, the verifier makes sense because you need to compare candidates against evidence.
The broader lesson exemplifies a pattern in AI: sometimes the biggest gains come not from bigger models or better pretraining, but from restructuring the problem to mirror how humans actually solve it. Humans use maps. So should AI. That simple insight, executed well, yields a 2.75x improvement over state-of-the-art on a realistic benchmark of real-world images where visual knowledge alone falls short.
