Enhancing Data Preparation With AI for Business Intelligence

Written by cleanlab | Published 2023/11/07
Tech Story Tags: data-analytics | business-intelligence | artificial-intelligence | no-code | opensource | machine-learning | data-preparation | good-company | hackernoon-es | hackernoon-hi | hackernoon-zh | hackernoon-fr | hackernoon-bn | hackernoon-ru | hackernoon-vi | hackernoon-pt | hackernoon-ja | hackernoon-de | hackernoon-ko | hackernoon-tr

TLDRThe article discusses how data teams in the world of data analytics and business intelligence build solutions that are needed by business users and work with engineering teams who build the infrastructure for data. Analysts building these solutions must prepare their data from diverse sources, ensuring the data is sanitized for querying, which is done through data preparation tools. Data-centric AI practices can automate the cleansing of the data step, enabling you to export a cleaner version of the dataset with minimal effort. The article also explains how maintaining data quality is critical for effective data analytics and how data-centric AI is the discipline of systematically engineering the data used to build an AI system.via the TL;DR App

In the world of Data Analytics and Business Intelligence, the data teams, also called as the “purple teams”, who build solutions that are needed by business users (red) and work with engineering teams (blue), essentially build the infrastructure for Data.

BI teams predominantly work on building flows or pipelines that deliver reports and essential dashboards for business user consumption.

There are many new generation tools that help the data teams build these end-user solutions, like Mode, Superset, and Lightdash, or industry leaders who were in the “data analytics” space for a while, like Tableau or PowerBI.

Analysts building these solutions must prepare their data from diverse sources, ensuring the data is sanitized for querying. A set of tools or transformations aimed at performing a cleansing step in the workflow called “Data Prep.”

With the advent of large language models, discussing AI has been a common trend across the software engineering stack. But what if I say: Using Data-centric AI practices, we could automate the cleansing of the data step? Enabling you to export a cleaner version of the dataset with minimal effort!

In this blog, we will discuss how using Data-centric AI, you can easily prepare your data for BI tools to ensure reliable conclusions from your subsequent data analysis.

Data Analyst Workflow

Several years ago, data analysts had to manually collect, clean, and analyze data, which was a time-consuming process that limited their ability to gain valuable insights.

Today, the data analysis landscape has undergone a significant transformation with the introduction of data preparation tools such as Alteryx, Tableau, etc.

These efficient tools have simplified the workflow, enabling analysts to seamlessly integrate data from multiple sources, automate data cleaning tasks, and generate visually appealing and insightful representations of data.

Data Analysis After Manual Data Preparation


Data prepped using the tools are analyzed using BI tools for identifying specific business queries.

For instance, consider this dataset of customer requests within a bank where customers log issues they are encountering in a customer service portal, which a human or automated task manager then labels.

Imagine if a business analyst is to determine the number of customer requests appearing for a particular issue category. Below is the result he/she would be seeing - with beneficiary_not_allowed category showing 111 customer issues.

Similarly, if an analyst wants to find out how many instances of issues are related to the word ATM, a quick analysis would return below visual representation. Notice the number of issues for the change_pin category.

It looks simple and straightforward, but if you dig deeper into the dataset, you can find the categorization of customer requests is wrong in a few cases.

For example:

Text

Label (as per dataset)

Label (ideally)

My card is almost expired. How fast will I get a new one, and what is the cost?

apple_pay_or_google_pay

card_about_to_expire

Real-world data, for the most part, is messy and unstructured, which makes it hard to deduct values through statistics. As we want humans and machines to make decisions driven by data, it would be critical for the data to be well-labelled, clear of any erroneous data, and de-duplicated.

Data-Centric AI


It's crucial to ensure that the data used in analyses is accurate, up-to-date, and free from duplicates. Failure to do so can result in incorrect decisions and conclusions. For instance, an empty location field in user profile data or inconsistent formatting of the location field can lead to errors. Therefore, maintaining data quality is critical for effective Data Analytics.

Data-centric AI is the discipline of systematically engineering the data used to build an AI system. Most data in the real world is unstructured or labeled wrongly. A quality dataset with the right set of labeled training data leads to an efficient model, which could predict better outcomes.

Better outcomes yield a better customer experience. To learn more, you can refer to the Data-centric AI course from MIT.

Introducing Cleanlab


Cleanlab is an open-source project that helps you clean data and labels by automatically detecting issues in the dataset. Cleanlab uses confident learning - based on a paper that talks about estimating uncertainty in dataset labels by Curtis Northcutt (also co-founder of Cleanlab.ai) and others.

Cleanlab basically enhances a data analysis workflow by inducing AI.

Auto-Clean Your Data Using Cleanlab Studio

Cleanlab Studio is a no-code tool that is built on top of the open-source package Cleanlab —it helps with prepping the data for an analysis workflow. You can also import data from your data warehouses like Databricks, Snowflake, or Cloud Object Stores like AWS S3.

Step 1:

Sign up for access to Cleanlab Studio.

You will be logged on to a dashboard with some sample datasets and projects.

Step 2:

Click on “Upload Dataset” to initiate the upload wizard. You can upload the dataset from your computer, URL, API, or a Data warehouse like Databricks and Snowflake.

Cleanlab Studio automatically infers your data schema and modality, i.e., text, image, voice, or tabular.

Once you confirm the details, you will be shown a screen with the uploaded dataset and associated errors (if any!) encountered while uploading the data.

Note: Some datasets might take a few minutes to upload. Cleanlab will inform you once the dataset is fully uploaded to Cleanlab Studio by Email.

Step 3:

Based on the type of dataset, you can use a specific machine-learning task to identify problems with the data. Currently, Cleanlab Studio supports several ML classification tasks related to text, tabular, and image data.

Specific to classification, it can be one of K classes or one to N of K classes. In this dataset, each customer request falls under a specific category. It would be a “Multi-Class” classification.

Cleanlab studio will auto-detect choose the text and label column. You can correct it if needed.

Using fast models might not produce the best results; in the interest of the time, choosing Fast is an option.

Hit “Clean my data!

Step 4:

Cleanlab Studio runs an ensemble of models on the dataset and presents an issue overview!

As pointed out earlier, the dataset had miscategorized data and outliers, which might not add value to the overall decision-making process when analyzed.

You can also take a look at meta-analytics of the issues identified by Cleanlab Studio on the dataset by switching to the analytics view on the top.

Step 5:

The interesting part of Cleanlab Studio is not just exporting a cleaned dataset but offering an issue-oriented view of your data. The missing data prep workbench that a data analyst and business intelligence user has wanted for years.

You can sort each issue by keyboard-assisted actions provided in Cleanlab Studio OR export an “Export Cleanset” by clicking the button below.

Data Analysis After AI-Assisted Data Preparation


Let us examine the same data analysis with the cleaned dataset.

It appears that there are discrepancies in the numbers between the cancel_transfer and visa_or_mastercard categories. While this is a smaller dataset, it's important to note that these data corrections could lead to significantly different estimates and potential business decisions at a larger scale.

Similarly, you can find customer requests for some categories disappear as issues are marked appropriately.

If you're a data analyst or part of the business intelligence community, Cleanlab Studio can revolutionize your data preparation workflow. Try Cleanlab Studio today, and experience the power of AI-assisted data cleaning for more reliable and accurate data analysis.

Conclusion

Cleanlab Studio is a no-code Data Preparation workbench used by thousands of engineers, analysts, and data scientists at Fortune 500 companies. This innovative platform was pioneered at MIT to train more reliable and accurate Machine Learning models using real-world, erroneous data. You can join our Slack Community for more information.


Written by cleanlab | Cleanlab increases the value of your datasets via open-source AI that automatically finds and fixes data issues
Published by HackerNoon on 2023/11/07