The client for whom I was doing project consulting had an ad-hoc requirement to extract the names of doctors and hospitals from a couple of spreadsheets that contain information about various hospitals and medical centers. This data was crucial for them to identify the doctors and hospitals/clinics that their clients frequent to access client medical records from them. So even though most of my experience was with wrangling data in a relational database, I decided to give it a go as most of the team was having limited bandwidth at the time.
However, there was one, very sticky issue, data consistency. There are variations in the way the names are written. Some names are prefixed with titles, such as "Dr." or suffixed with a degree such as “MD”, while others are not. Some had their specializations written after their name while most didn’t. Some names include initials or middle names, while others do not. Additionally, some strings contain multiple doctor names, while others mention only one. It is a similar case with the hospital names, some are separated with commas, sometimes with slashes or a single slash, and in some cases, both the doctor and hospital names are simply written together without any separator.
Finally, it is important to ensure that the hospital and medical center names are not extracted as doctor names (or vice versa), which requires distinguishing between them based on their respective patterns and characteristics.
I felt the easiest way to approach this problem is to use regular expressions. Regular expressions are powerful tools for pattern matching and can be used to define a pattern that matches the doctor names in the strings. The pattern can be defined based on the common characteristics of doctor names, such as capitalization, the presence of a title (e.g., Dr.), and the use of initials or middle names. In this solution, I defined a regular expression pattern that matches doctor names based on these characteristics.
However, this does not address the issue of the solution misidentifying a hospital as a physician or vice versa. Therefore it was clear that additionally, I would have to use some kind of natural language processing (NLP) tools such as Spacy or NLTK to be able to differentiate between hospitals and doctors as the health care provider information. Using these tools, hopefully, the solution would be able to parse the strings and identify the names of persons, including the doctors and not mistake them for the organizations i.e. the hospital or clinic names.
The tools should also identify other relevant information such as titles (e.g., Dr.) and professions. In this solution, I used Spacy's “en_core_web_trf
” model (available in huggingface repository) to parse the strings and identify the names of persons and professions. Although the code that ChatGPT generated used these libraries, it did not provide any information on how to install and execute these libraries. Perhaps it overestimated my capacity as a data analyst!! However, upon further querying ChatGPT, it gave me proper instructions and also occasionally used an archaic tool called Google
(remember that one?), I could get over these obstacles.
In the final solution, I combined these approaches to extract the doctor names from the strings. First, I used regular expressions to match the doctor names based on their common characteristics, such as capitalization and the presence of a title. Then I used Spacy to identify any additional doctor names that were missed by the regular expression matching and to identify any relevant titles or professions.
I had to do some additional coding, such as opening the spreadsheet and reading the data, writing to an output file, etc. But those are simple enough for any entry-level programmer to execute. Hope you find my exploration into ChatGPT as a data analyst tool helpful!
ChatGPT was used to enhance code queries and a Grammarly plugin was used for editing.
Lead image generated with Stable Diffusion 2.1 using the prompt: “Data extraction”.