The AI revolution was decades in the making. It was a field filled with excitement, yet often punctuated by disappointments and “AI winters.” But recently, something shifted. Large Language Models (LLMs) like ChatGPT, Claude, and Bard catapulted AI from laboratory curiosity to the mainstream.
This shift wasn't solely a triumph of AI but also a victory over the intricacies of large and messy data. As the saying goes, “garbage in, garbage out.” New tools are emerging that focus on improving the underlying data, therefore improving LLMs.
The term "Large Language Models" holds within it two great challenges. First, the sheer volume of data. We're talking upwards of a petabyte (a million gigabytes) of data for GPT-4, encompassing millions of books, blogs, social media posts, video transcripts, and more. This colossal scale offers vast potential but also poses significant logistical considerations.
Second, the complexity of natural language. Context-dependent, ambiguous, and diverse, language data is a wild beast that even the best algorithms struggle to tame. It’s impossible to accurately label all this data, which inevitably means that even state-of-the-art LLMs are trained on tons of incorrectly-labeled data.
In facing these challenges, new data-centric tools and methodologies emerged, enabling a true leap in what AI is capable of. Solutions like Cleanlab and others began to offer ways to collect diverse data, automate quality control, and process language into a form suitable for AI models.
These tools did not merely offer incremental improvements; they fundamentally reshaped the approach to AI data handling. They transformed the task of handling large-scale language data from a manual, error-prone process into an automated, precise one, democratizing the field and enabling advancements at an unprecedented pace.
In AI, real-world datasets contain annotation errors ranging from 7-50%. These imperfections significantly hamper training and evaluation. Data-centric AI emphasizes improving the quality of the dataset itself.
OpenAI's strategy, for instance, illustrates this emphasis: “We prioritized filtering out all of the bad data over leaving in all of the good data. This is because we can always fine-tune our model with more data later to teach it new things, but it’s much harder to make the model forget something that it has already learned.”
An approach of manually filtering data, however, is time-consuming and expensive. The Cleanlab package is an open-source framework popular for practicing data-centric AI today. It allows you to run data quality algorithms on your trained ML model's outputs to detect common dataset issues like label errors, outliers, drift, and more.
With just a few lines of code, you can automatically find and identify problems in various types of data, such as image, text, tabular, and audio. By using the Cleanlab package, you can decide how to improve your dataset and model, re-train your ML model, and see its performance improve without any changes to your existing code.
Cleanlab Studio, on the other hand, is more than just an extension of the Cleanlab package; it's a no-code platform designed to find and fix problems in real-world datasets. It doesn't just stop at detecting issues but goes further in handling data curation and correction, and even automates almost all the hard parts of turning raw data into reliable ML or Analytics.
Let’s use the Cleanlab package to demonstrate the power of data-centric AI.
We start with the Stanford Politeness Dataset. Ensure you have the train and test sets loaded. In this demo, we’ll fine-tune the Davinci LLM for 3-class classification, first without Cleanlab, and then see how we can improve accuracy with data-centricity. We can run a simple bash command to train a model.
!openai api fine_tunes.create -t "train_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "baseline"
When that’s done, we can query a fine_tunes.results
endpoint to see the test accuracy.
!openai api fine_tunes.results -i ft-9800F2gcVNzyMdTLKcMqAtJ5 > baseline.csv
`df = pd.read_csv('baseline.csv')
baseline_acc = df.iloc[-1]['classification/accuracy']`
We get a result of 63% accuracy. Let’s see if we can improve this.
Now, let’s use OpenAI's API to compute embeddings and fit a logistic regression model to obtain out-of-sample predicted class probabilities.
# Get embeddings from OpenAI. from openai.embeddings_utils import get_embedding
embedding_model = "text-similarity-davinci-001" train["embedding"] = train.prompt.apply(lambda x: get_embedding(x, engine=embedding_model)) embeddings = train["embedding"].values
# Get out-of-sample predicted class probabilities via cross-validation.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression() labels = train["completion"].values pred_probs = cross_val_predict(estimator=model, X=embeddings, y=labels, cv=10, method="predict_proba")
With just one line of code, Cleanlab estimates which examples have label issues in our training dataset.
from cleanlab.filter import find_label_issues
Now we can get indices of examples estimated to have label issues:
issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence') # sort indices by likelihood of label error
Now, we’ve automatically extracted the indices of potentially mislabeled examples, so we can remove them and train a new classifier.
# Remove the label errors
train_cl = train.drop(issue_idx).reset_index(drop=True) format_data(train_cl, "train_cl.jsonl")
Now let’s train a more robust classifier with better data.
!openai api fine_tunes.create -t "train_cl_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "dropped"
# Evaluate model on test data
!openai api fine_tunes.results -i ft-InhTRQGu11gIDlVJUt0LYbEx > cleanlab.csv df = pd.read_csv('cleanlab.csv') dropped_acc = df.iloc[-1]['classification/accuracy']
We get an accuracy of over 66%, improving a state-of-the-art fine-tunable model (GPT-3, as you can’t fine-tune GPT-4), merely by automatically improving the dataset, without any change to the model.
With Cleanlab Studio, it’s also possible to automatically fix the incorrect labels instead of just removing them outright, improving accuracy even further. A guide by Cleanlab shows that this takes accuracy up to 77%.
Using data-centric tools like Cleanlab, you can efficiently find and fix data and label issues, leading to significant improvements in the performance of LLMs like Davinci. This approach does not alter the model architecture or hyperparameters and focuses only on enhancing the quality of the training data.
The approach outlined in this guide could be the key to unlocking even greater accuracy and robustness in AI models, even with future advanced LLMs like GPT-5.