When the Best Solution Isn’t the Obvious One: A Case Study in Address Parsing

Written by abgarsimonean | Published 2025/12/10
Tech Story Tags: java | nlp | case-study | java-native-interface | software-architecture | data-normalization | software-development | scaling

TLDRA simple “deserialize XML into a POJO” task for AML compliance turned into a nightmare of messy, multilingual, unstructured addresses. Regexes, heuristics, and Java libraries all failed. The real solution required integrating a C NLP library into Java via JNI and switching to a better trained model. Accuracy jumped from ~70% to ~99%, proving that the best solution is rarely the obvious or easiest one.via the TL;DR App

In one of my previous projects, I had to solve a deceptively simple problem: deserialize and normalize personal data of sanctioned individuals for AML (Anti-Money-Laundering) compliance. The system consumed data from a third-party provider — for NDA reasons, let’s call it the RegulatorAPI.

On paper, the integration looked straightforward. RegulatorAPI exposed a clean endpoint that returned a complete list of sanctioned persons across EU, UK, and US sanction regimes. Our task was to pull this list and transform it into an internal, normalized format optimised for fast searching, indexing, and downstream analytics.

I was responsible for the normalization/deserialization pipeline.

And at first glance, the job seemed trivial:

“Just read the XML and map it into a Java POJO. JAXB, Jackson, whatever — how hard can it be?”

Here’s the kind of XML we received:

<Person>
    <RecordId>REC-123456789</RecordId>
    <FullName>VASILE ALECSANDRI IONESCU</FullName>
    <Aliases>
        <Alias>Vasile Ionescu</Alias>
        <Alias>V. A. Ion</Alias>
    </Aliases>
    <BirthDetails>
        <DateOfBirth>1978-04-12</DateOfBirth>
        <PlaceOfBirth>Cahul, MD</PlaceOfBirth>
    </BirthDetails>
    <Addresses>
        <Address>12 Vasile Alecsandri, Cahul, MD</Address>
        <Address>Strada Tricolorului No.9 Apt 4, Chisinau, Moldova</Address>
    </Addresses>
</Person>

And here is the agreed upon POJO structure

public class Person {
  private String recordId;
  private String fullName;
  private List<String> aliases;
  private LocalDate dateOfBirth;
  private String placeOfBirth;
  private List<Address> addresses; // the fun part
}

public class Address {
  private String country;
  private String state;
  private String postalCode;
  private String road; 
  private String houseNumber;
}

As you can see, the Address object wasn’t meant to be a raw, unstructured string, but a fully parsed, semantically meaningful model. This was not an aesthetic choice — it was a core architectural requirement.

A structured address meant we could:

  • Normalise location components (street, house number, city, country)
  • Index them efficiently in our database and search engine
  • Perform fast, deterministic lookups
  • Support fuzzy matching and duplicate detection
  • Enable geo-based grouping, risk scoring, and batch analytics

And the operational constraints were unforgiving:

the entire platform had an SLA of 200,000 requests per second.

With that kind of throughput, even a single inefficient join or poorly indexed field could collapse the system under load. Everything had to be canonical, searchable, and optimized long before the data reached any hot path.

So the address had to be parsed upfront. Field by field. Deterministically.

Which meant the XML couldn’t simply be deserialized — it had to be interpreted.

And that is precisely where the “simple” problem revealed its real nature. The moment we started receiving actual production data from RegulatorAPI, the address fields turned out to be:

  • unpredictable,
  • inconsistent,
  • multilingual,
  • free-form,
  • and fundamentally incompatible with traditional Java parsing approaches.

What looked like a straightforward POJO-mapping task that would be solved by a couple of regex rules, suddenly snowballed into a complicated NLP (natural language processing) problem.

The Search for the Solution

Faced with this messy reality, my first instinct — like any pragmatic engineer — was to look for the simplest fix that could get us 80% of the way there. After all, how hard could it be to parse an address? Surely someone had already solved this.

My search for a solution went through several predictable phases.

1. The Regex Phase (Denial & Bargaining)

Like many engineers in denial, I started by writing a few regular expressions:

  • extract house numbers
  • detect city names
  • split on commas
  • match country codes

It worked beautifully — for about five addresses.

Then reality hit:

  • Some addresses had no commas.
  • Some had five commas, for no apparent reason.
  • Some put the house number after the city.
  • Some put the city before the street.
  • ….

Every time I fixed one pattern, I broke two others.

Regex wasn’t the answer. It wasn’t even close.

2. The Heuristics Phase (False Hope)

Next came the heuristics:

  • Normalize known country names
  • Map common city spellings
  • Detect patterns by token position
  • Build a dictionary for common street suffixes

Heuristics felt promising at first — the classic “just a bit more glue code and this will all work” trap. But heuristics fail for the exact same reason they fail in natural language processing:

language is messy, inconsistent, and full of exceptions.

Addresses are even worse because they’re:

  • partially human-written
  • partially system-generated
  • often translated
  • often abbreviated
  • almost never standardized

After the third day of adding case-specific rules, I realized I was reinventing a brittle, error-prone wheel someone else surely had solved more intelligently.

3. The “Let’s Find an API” Phase (Sticker Shock)

I then evaluated external services:

  • Google Maps API
  • Geocoding APIs
  • OpenStreetMap Nominatim
  • LocationIQ
  • Miscellaneous SaaS address normalizers

Some problems:

  • Too slow for our 200k RPS SLA
  • Not free — in fact, extremely expensive at scale
  • Rate limited
  • Not deterministic (same address could yield different structure)
  • Geo-based parsing, not ML-based parsing
  • Some didn’t support certain countries or transliteration quirks

Even if we cached aggressively, the operational cost would have been unacceptable.

External APIs were out.

4. The “Let’s Find a Pure Java Library” Phase (Disillusionment)

I searched Maven Central, GitHub, StackOverflow, Reddit — anywhere someone might have written a robust Java address parser.

I found:

  • libraries for U.S.-only addresses
  • libraries for billing/shipping forms
  • postcode validators
  • postcode datasets
  • partial parsers for specific countries

But nothing remotely capable of parsing global free-form addresses from dozens of jurisdictions in dozens of inconsistent formats.

Nothing close to solving the fundamental NLP problem.

5. Acceptance: This Needs an Actual NLP Model

At this point I realised the truth:

To parse unstructured addresses, I needed something smarter than patterns.

I needed a machine-learned model trained on millions of real-world addresses.

Enter LibPostal.

6. The Eureka Moment: LibPostal

LibPostal is a open-source library developed by OpenVenues.

It’s trained on OpenStreetMap, containing millions of addresses from all over the world, in dozens of languages and formats.

It doesn’t guess by rules.

It doesn’t rely on regex.

It doesn’t care about formatting.

It doesn’t care about casing or language.

It uses statistical NLP models to break down an address into:

  • house number
  • road
  • city
  • state
  • postcode
  • country

It was exactly what I needed!

There was just one problem……

  • LibPostal is written in C.
  • Our system was written in Java.

The “solution” I found introduced a new challenge.

Integrating LibPostal into Java (JNI: When You Have No Other Choice)

Fortunately, we weren’t the first to face this pain. An officially supported Java binding already existed - jpostal

It wasn’t perfect, wasn’t fully documented, and—spoiler alert—wasn’t published to Maven Central.

But it was enough to give us a starting point.

No Maven Central? No Problem (But Not Exactly Fun Either)

Since the libpostal JNI binding wasn’t on Maven Central, we had to manually build and integrate it into our build pipeline.

The process looked something like this:

# Install the libpostal C library
sudo apt-get install curl autoconf automake libtool pkg-config

# Clone the repos
git clone https://github.com/openvenues/libpostal
git clone https://guthub.com/openvenues/jpostal

# Configure libpostal
cd libpostal
./bootstrap.sh
./configure --datadir=/root/libpostal-data

# Run make instalation
make
sudo make install

# Building the JNI Jpostal
cd ../jpostal
./gradlew assemble

After assembling the JNI bindings, the next hurdle was getting everything into a place where Java could actually load it at runtime. JNI is famously unforgiving: if the JVM cannot find the native library, even by a single character mismatch, it will fail loudly and cryptically.

Once ./gradlew assemble completed, we now had the generated native artifacts: ./build/libs/jpostal/shared/

Typically containing files like:

  • libjpostal.so (Linux)
  • libjpostal.dylib (macOS)
  • libjpostal.jnilib (legacy macOS naming)

The next challenge was to bundle these into our build pipeline and deployment artifacts so the JVM could load them using System.loadLibrary(...).

Making the Native Libraries Available to the JVM

There are two ways to make JNI libraries visible to Java:

Option A — Add the native folder to java.library.path

System.setProperty("java.library.path", "/app/native-libs");

This works, but only if you control the runtime environment tightly.

Option B — Load the library manually from the classpath

This is the more robust approach for production.

We packaged the .so files under: src/main/resources/native/linux/

And wrote a helper that extracts the correct native library at runtime into a temp folder and loads it explicitly:

public class NativeLoader {     
    public static void load() {         
         String os = System.getProperty("os.name").toLowerCase();
         String libName = os.contains("mac") ? "libjpostal.dylib" : "libjpostal.so";
          try (InputStream in = NativeLoader.class.getResourceAsStream("/native/" + libName)) {
             File temp = File.createTempFile("libjpostal", null);
             Files.copy(in, temp.toPath(), StandardCopyOption.REPLACE_EXISTING);
             System.load(temp.getAbsolutePath());
         } catch (IOException e) {
             throw new RuntimeException("Unable to load LibPostal JNI library", e);
         }
    }
}

Then the parser initialization simply included:

NativeLoader.load(); 
AddressParser parser = AddressParser.getInstance();

Switching to an alternative NLP Model

As mentioned earlier, LibPostal allows the use of alternative models — and the Senzing-trained model provided a significant boost in accuracy for US, UK, and Singapore addresses.

To use it, we simply pointed LibPostal at a different datadir during configure:

./configure --datadir=/opt/libpostal-senzing-model/

Then confirmed the parser was loading the correct model by inspecting the logs and test outputs.

The results were immediate:

  • better segmentation of complex UK locality strings
  • correct interpretation of US rural routes and PO boxes
  • Singapore unit parsing accuracy jumped dramatically

This turned a “good-enough” solution into a production-grade, high-accuracy parser

Results are in - drumroll please

When all was said and done, the transformation was dramatic.

Our initial prototype — built with regexes, heuristics, and a growing graveyard of edge-case rules — barely reached 70% accuracy on real production data. And even that came at the cost of:

  • 10–20 custom parsing rules

  • a never-ending cycle of exceptions and patches

  • brittle logic that broke whenever a new address pattern appeared

  • entire categories of addresses (Singapore, UK dependent localities, US rural routes) that simply could not be handled reliably

It was the classic maintenance nightmare waiting to happen.

Once LibPostal (backed by the JNI binding) and the Senzing-trained modelwere integrated, everything changed.

We re-ran our evaluation dataset — thousands of real addresses from the RegulatorAPI spanning dozens of countries and formats — and the results were near perfect.

Final accuracy: ~99% correct parsing

Lessons Learned: Why the Best Solution Is Rarely the Straightforward One

Looking back, one thing became clear:

the best solution almost never looks like the “simple” solution you expect at first glance.

At the beginning, the task was pitched as - “Just deserialize some XML into a POJO.”

In reality, it turned into a complicated problem requiring an out-of-the-box solution.

The simple solution — regexes, heuristics, manual rules — felt tempting because it promised quick wins.

But quick wins are often illusions. They work until they don’t. And every exception pushes you further down an unsustainable path.

The real solution required stepping out of the comfort zone:

  • embracing unfamiliar technologies (C, JNI, native compilation)

  • understanding how ML-based address parsing works

  • tuning deployment environments to ship native models

  • replacing hand-written code with probabilistic parsing

It wasn’t the obvious solution.

It wasn’t the easiest solution.

But it was the rightsolution.

And that’s the core takeaway:

If a problem has been solved globally, statistically, and at scale, don’t reinvent it with regex and glue code - stand on the shoulders of the right tools - even if they’re written in another language, ecosystem or paradygm.

This project became a powerful reminder that engineering is not about choosing the shortest path — it’s about choosing the path that leads to a correct system.

Sometimes the only way to achieve that is to abandon the “straightforward” approach and adopt a solution that feels initially more complex — but pays off exponentially in the long run.


Written by abgarsimonean | Passionate and pragmatic Software Engineer at the intersection of backend, data engineering, and developer tooling.
Published by HackerNoon on 2025/12/10