How This Open-Source AI Simplifies Mapping Healthcare Data

Written by nlp | Published 2026/02/10
Tech Story Tags: open-source-health-ai | gdpr-compliant-tools | clinical-coding-software | omop-mapping-automation | llm-medical-data-mapping | ehr-data-standardization | gdpr-compliant-ai-tool | self-reported-medication

TLDRMapping medical terms to OMOP standard concepts is complex, often requiring time-consuming manual review. Llettuce, an open-source AI tool, automates this process using large language models, fuzzy matching, and database queries. Running locally, it preserves GDPR compliance while delivering high accuracy in converting informal patient-reported terms into standardized medical concepts.via the TL;DR App

Authors:

(1) James Mitchell-White, Centre for Health Informatics, School of Medicine, The University of Nottingham, Digital Research Service, The University of Nottingham, and NIHR Nottingham Biomedical Research Centre;

(2) Reza Omdivar, Digital Research Service, The University of Nottingham, and NIHR Nottingham Biomedical Research Centre;

(3) Esmond Urwin, Centre for Health Informatics, School of Medicine, The University of Nottingham and NIHR Nottingham Biomedical Research Centre;

(4) Karthikeyan Sivakumar, Digital Research Service, The University of Nottingham;

(5) Ruizhe Li, NIHR Nottingham Biomedical Research Centre and School of Computer Science, The University of Nottingham;

(6) Andy Rae, Centre for Health Informatics, School of Medicine, The University of Nottingham;

(7) Xiaoyan Wang, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore;

(8) Theresia Mina, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore;

(9) John Chambers, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore and Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, United Kingdom;

(10) Grazziela Figueredo, Centre for Health Informatics, School of Medicine, The University of Nottingham and NIHR Nottingham Biomedical Research Centre;

(11) Philip R Quinlan, Centre for Health Informatics, School of Medicine, The University of Nottingham.

  1. Abstract and Introduction

  2. System Architecture

    2.1 Access via UI or HTTP

    2.1.1 GUI

    2.2 Input

    2.3 Natural Language Processing Pipeline — The Llettuce API

    2.3.1 Vector search

    2.3.2 LLM

    2.3.3 Concept Matches

    2.4 Output

  3. Case Study: Medication Dataset

    3.1 Data Description

    3.2 Experimental Design

    3.3 Results

    3.3.1 Comparison between vector search and Usagi

    3.3.2 Comparison with GPT-3

    3.4 Conclusions & Acknowledgement

    3.5 References

Abstract

This paper introduces Llettuce, an open-source tool designed to address the complexities of converting medical terms into OMOP standard concepts. Unlike existing solutions such as the Athena database search and Usagi, which struggle with semantic nuances and require substantial manual input, Llettuce leverages advanced natural language processing, including large language models and fuzzy matching, to automate and enhance the mapping process. Developed with a focus on GDPR compliance, Llettuce can be deployed locally, ensuring data protection while maintaining high performance in converting informal medical terms to standardised concepts.

Keywords: OMOP mapping, LLMs, healthcare data mapping, natural language processing in healthcare data

1. Introduction

The conversion of medical terms to Observable Medical Outcomes Partnership (OMOP) (OHDSI, 2024b) standard concepts is an important part of making data findable, accessible, interoperable, and reusable (FAIR)(Wilkinson et al., 2016). Unified data standards are often applied inconsistently across healthcare systems (Cholan et al., 2022; F. et al., 2022), and standardising to a common data model (CDM), such as OMOP is fundamental in enabling robust research pipelines for cohort discovery, and ensuring reliable and reproducible evidence. The process of converting data to OMOP, however, is complex, and not only requires knowledge of the specific domain of the data, but often collaboration from data engineers, software engineers, and healthcare professionals.

In previous work, we developed Carrot Mapper (Cox et al., 2024), and Carrot-CDM 36 (Appleby et al., 2023) to support the OMOP conversion process. Tooling within this space still requires manual intervention to approve or create mappings, where a data engineer needs to find the most suitable codification to a term. Solutions that help finding codifications include searches in the Observational Health Data Sciences and Informatics (OHDSI) Athena database (OHDSI, 2024a), or string matching using tools, such as Usagi (OHDSI, 2021).

The Athena website is a platform for searching and exploring various medical terminologies, vocabularies, and concepts in healthcare research. Users can search for specific terms, view their relationships, and explore detailed metadata. Using Athena search at scale, however, is complicated. When conducting extensive searches, researchers face challenges, including the complexity and overlap of medical vocabularies, the overwhelming volume of search results, and technical constraints, such as system performance and data 48 handling capabilities. Additionally, the standardisation of diverse healthcare datasets presents difficulties in ensuring consistency across different terminologies.

Usagi was developed by OHDSI to facilitate the mapping of source codes to standard concepts within the OMOP CDM. It supports the integration and harmonisation of diverse healthcare data sources. Usagi employs semi-automated string-matching algorithms to suggest potential mappings between local vocabularies and standardised terminologies such as SNOMED CT, LOINC, and RxNorm. It is a valuable tool for mapping, but it has a few limitations. While it automates part of the mapping process, it requires significant manual review, which is time-consuming and prone to human error and uncertainties. 7 String-matching can potentially lead to inaccurate mappings, particularly when dealing with ambiguous or complex terminologies. The effectiveness of Usagi depends on the quality of the standardised vocabularies it uses, and there is a learning curve for new users. As a standalone tool, Usagi does not yet integrate seamlessly with other data processing workflows, requiring additional steps to configure both input and output and thus ensure proper data standardisation. By contrast, novel tools can provide an application programming interface (API) for integration into mapping tools.

Both Athena and Usagi work well when dealing with data with typographical errors. But informal terms for medications or conditions may not closely match the string of the 66 formal concept we wish to map it to. For example, “Now Foods omega-3” is a supplement found in a self-reported patient questionnaire dataset. This supplement is produced by Now Foods, and is an omega-3 product derived from fish oil. In this case, the brand of the drug was given as input. Before obtaining the OMOP concept, we need to map the reported brand to “omega-3 fatty acids”, for which an exact OMOP match is found. Using the Athena search engine, for example, the string matching suggests concepts like “Calcium ascorbate 550 MG Oral Tablet by Now Foods”, “Ubiquinone MG Oral Capsule [Now 73 Coq10] by Now Foods” or “Calcium ascorbate 1000 MG Oral Tablet [Now Ester-C] by Now Foods”. This indicates that this process of matching loses the semantic information associated with the input data.

Large language models (LLMs) are now a relatively novel alternative to support OMOP. They automate portions of the mapping process while suggesting more semantically relevant mappings. The use of proprietary tools, such as OpenAI ChatGPT (OpenAI, 2024) in healthcare, however, raises significant concerns, particularly regarding GDPR compliance, data protection and reproducibility of results (Deng et al., 2024; Nazi & Peng, 2024). The handling of sensitive patient data poses risks, as inadvertent data leaks or misuse of information could occur. Ensuring that interactions with OpenAI and other available LLM APIs in the cloud remain within the bounds of GDPR is challenging, especially when dealing with identifiable health information.

In this paper we introduce Llettuce[1], a tool created to address these gaps. It is a standalone, open-source, adaptable natural language processing tool based on Large Language Models, querying systems and fuzzy matching for the conversion of medical terms into the OMOP standard vocabulary. This first version is released under the MIT Licence. Medical terms can be extracted from Electronic Health Records (EHRs), self-reported patient questionnaires and other structured datasets to serve as an input for Llettuce. So for the example above, the Llettuce match output for “Now Foods omega-3” is “Fish oil”.

Llettuce has the following modules and functionalities:

• Vector search for concept(s)

• LLM prompting with informal name(s)

• OMOP CDM database search

• A graphic user interface

We demonstrate how Llettuce works and its performance compared to Usagi and ChatGPT on a case study of converting self-reported informal medication names into OMOP concepts. Llettuce performance is comparable to OpenAI models and was developed to 101 run locally to support healthcare data governance requirements.

This paper is available on arxiv under CC BY 4.0 license.

*Corresponding author: [email protected]

[1] https://github.com/Health-Informatics-UoN/lettuce


Written by nlp | Natural Language Processing. I am Processing Natural Language, naturally. We publish trending research and blogs.
Published by HackerNoon on 2026/02/10