There are various approaches to training AI models and developing AI-specific applications. In this tutorial, I’ll guide you through the process of building AI models using an image recognition dataset with high-quality labels. We’ll cover straightforward steps, including dataset preparation, preprocessing, and splitting methods, making it easy to follow along.
Breaking down complex AI concepts into simple, understandable terms is challenging, but by the end of this tutorial, we believe you'll grasp how labeled datasets impact model performance in tasks like NLP, image recognition, and recommendation systems.
The primary programming language used in this tutorial is Python. To keep the tutorial concise, I've uploaded the code with helpful comments on GitHub, where you can download, modify, or reuse it. In this guide, we will highlight the key sections of the code and their outputs.
Please note that AI model training, labeling, and other tasks require computing power, time, and internet resources. If you have limited resources, the minimal configuration might be best.
Alright, let’s get started!
We’ll use the Kaggle Weather Image Recognition dataset (You can also utilize well-organized datasets from services like Bright Data. For instance, you can test your custom model using their Creative Commons Images dataset.)
Before starting, ensure your local environment is ready for data preparation and visualization.
pip install numpy pandas matplotlib seaborn scikit-learn opencv-python tensorflow keras
Download the Dataset: Go to the Kaggle dataset page: Weather Dataset. Download the dataset and unzip it in a working directory, e.g., weather_dataset/
.
Download or clone the GitHub repository with all scripts downloaded. Make sure all of them are in your project directory because we will just mention the names of the scripts in most of the parts of this article.
git clone https://github.com/Vijan45/AI-Weather-Image-Recognition.git
You can load and inspect the dataset using Python to understand its structure and contents. Let’s create the Python script or use the downloaded file “Explore_Dataset.py
” which is as follows:
import os
import matplotlib.pyplot as plt
import cv2
import numpy as np
dataset_path = "weather_dataset/"
categories = os.listdir(dataset_path)
print("Categories:", categories)
for category in categories:
category_path = os.path.join(dataset_path, category)
sample_image = os.listdir(category_path)[0]
img = cv2.imread(os.path.join(category_path, sample_image))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.figure()
plt.imshow(img)
plt.title(category)
plt.axis('off')
plt.show()
Run python Explore_Dataset.py
command in the terminal and you will see the following output:
Categories: ['dew', 'fogsmog', 'frost', 'glaze', 'hail', 'lightning', 'rain', 'rainbow', 'rime', 'sandstorm', 'snow']
We need to make sure there is uniformity in the dataset so we need to preprocess the images with the following adjustments:
Check the Python script Preprocessing_Image.py
which is as follows:
import os
import cv2
import numpy as np
from sklearn.model_selection import train_test_split
# Define categories and dataset path
categories = ['Dew', 'fogsmog', 'frost', 'glaze', 'hail', 'lightning', 'rain', 'rainbow', 'rime', 'sandstorm', 'snow']
dataset_path = r'C:\Users\induction\Documents\1AAA23\weather_dataset' # Use your own directory
# Function to preprocess images
def preprocess_images(dataset_path, img_size=(128, 128)):
data = []
labels = []
for label, category in enumerate(categories):
category_path = os.path.join(dataset_path, category)
for file in os.listdir(category_path):
file_path = os.path.join(category_path, file)
img = cv2.imread(file_path)
if img is not None: # Check if the image is loaded correctly
img = cv2.resize(img, img_size)
img = img / 255.0 # Normalize
data.append(img)
labels.append(label)
return np.array(data), np.array(labels)
# Preprocess images to get data and labels
data, labels = preprocess_images(dataset_path)
# Save the processed data and labels
np.save('data.npy', data)
np.save('labels.npy', labels)
print("Data and labels saved successfully.")
After executing the script, you will get two files data.npy
and labels.npy
which are needed to split the data in the next step. You will see the following message in the terminal:
Data and labels saved successfully.
We can divide the dataset into training, validation, and test sets for AI model training and evaluation. You can do it by direct python Splitting_Dataset.py
or just add the following code in the above script Preprocessing_Image.py
# Split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(data, labels, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Print the shapes of the splits
print(f"Training set: {X_train.shape}, {y_train.shape}")
print(f"Validation set: {X_val.shape}, {y_val.shape}")
print(f"Test set: {X_test.shape}, {y_test.shape}")
After executing the Python code, you will see the message of success for the Training set, Validation set, and Test set in the terminal.
We get ideas of the distribution of classes in the dataset to detect any imbalances. You can do it just by running the command python
Visualizing_Data_Distribution.py
and you will see the bar graph as follows for a detailed look.
Here you can see, the graph exhibits the imbalance in the dataset. It could impact the performance of machine learning (ML) models trained on this data. So, we attempt to address this imbalance through the technique of “Resampling” which could improve model performance.
To classify images, we are going to build a “Convolutional Neural Network (CNN)” and train it using a pre-labeled weather dataset. We will visualize the model's performance through accuracy and loss plots and evaluate its effectiveness on the test set. It will allow us to assess the model's ability to accurately classify different weather-related images. So, let’s see.
We'll use TensorFlow and Keras to define a simple CNN architecture. Check the following code Building_CNN_Model.py
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
categories = ['Dew', 'fogsmog', 'frost', 'glaze', 'hail', 'lightning', 'rain', 'rainbow', 'rime', 'sandstorm', 'snow']
# Define CNN Model
def create_cnn_model(input_shape, num_classes):
model = Sequential([
# Convolutional Layers
Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
MaxPooling2D((2, 2)),
Dropout(0.2),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Dropout(0.3),
# Flattening and Dense Layers
Flatten(),
Dense(128, activation='relu'),
Dropout(0.4),
Dense(num_classes, activation='softmax') # Softmax for multiclass classification
])
return model
# Train the model
def train_model(X_train, y_train, input_shape=(128, 128, 3), num_classes=len(categories), epochs=20, batch_size=32):
model = create_cnn_model(input_shape, num_classes)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2, verbose=1)
return model
# Evaluate the model
def evaluate_model(model, X_test, y_test):
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
return accuracy
# Optional: Summary of the model
if __name__ == "__main__":
input_shape = (128, 128, 3) # Image dimensions from Part 1
num_classes = len(categories) # Number of weather categories
model = create_cnn_model(input_shape, num_classes)
model.summary()
Execute the code, you will see something like this in the terminal:
Now we train the CNN on the pre-labeled weather dataset (which we have already prepared in the previous steps). Run python
Training_Model.py
in the terminal. Please note the epochs in the following part of the code.
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=20, # You can adjust the number of epochs
batch_size=32,
verbose=1
)
Please note that the higher the values of epochs, the longer will be the processing time. So, you can lower the number for testing purposes.
We can visualize training and validation accuracy/loss to evaluate model learning. Just execute the Visualizing_Model_Performance.py
script and you will see the following graph:
As we can see in the graph, the accuracy of an ML model is over 20 epochs for both training and validation datasets. The training accuracy (blue line) steadily increases, which reaches above 0.9, while the validation accuracy (orange line) fluctuates around 0.7, which indicates overfitting. So, we still need to address overfitting to improve the model’s performance.
So our task is to evaluate the trained model on the test dataset for improved performance and accuracy. For the evaluation, execute the script Evaluating_Model.py
in the terminal.
We can visualize the graph by running python visualizing Predictions.py
which shows some test images along with their predicted and true labels. Remember to adjust epochs as per your needs here too. Check the graph below:
As you can see, the graph shows the training and validation loss over 20 epochs.
The training loss consistently decreases which means the model is learning well. However, the validation loss starts to increase after around 10 epochs which still indicates overfitting. Accurate class labels initially improve the model's generalization, but addressing overfitting is important for maintaining performance on unseen data.
We will explore reusing the weather dataset for various AI tasks, such as Natural Language Processing (NLP), multilabel classification, and hierarchical categorization. We will utilize techniques to reapply the dataset with new labeling schemes which will assist us to avoid the need to create new datasets.
Pre-labeled datasets can be adapted for NLP tasks by pairing images with descriptive text. Here, for the weather dataset we can do it as per the following technique:
For example, check the following part in the Descriptive_Text_Labels.py
script:
# Create a dictionary for class descriptions
class_descriptions = {
"dew": "A layer of water droplets that forms on cool surfaces overnight.",
"fogsmog": "A thick, cloud-like mass near the ground, reducing visibility.",
"frost": "A thin, icy coating that forms on surfaces during cold conditions.",
"glaze": "A smooth layer of ice covering surfaces due to freezing rain.",
"hail": "Small, round ice pellets that fall during intense storms.",
"lightning": "A bright flash of light caused by an electrical discharge during storms.",
"rain": "Water droplets falling from clouds to the ground.",
"rainbow": "A colorful arc of light formed after rain, caused by refraction.",
"rime": "A frost-like deposit of ice crystals formed in freezing fog.",
"sandstorm": "A cloud of sand particles carried by strong winds in arid regions.",
"snow": "Soft, white flakes of frozen water vapor falling from the sky."
}
# Example: Pair image with description
example_image = X_train[0]
example_label = y_train[0]
example_description = class_descriptions[categories[example_label]]
print(f"Label: {categories[example_label]}")
print(f"Description: {example_description}")
plt.imshow(example_image)
plt.axis('off')
plt.show()
Execute the script Descriptive_Text_Labels.py
through the terminal. The result will amaze you 😊
We classified the label as “hail” and the description was “Small, round ice pallets that fall during intense storms”. The image aligns with the text descriptions. This pairing technique creates a multimodal dataset for tasks like training a model to generate captions for weather images and fine-tuning models like OpenAI’s CLIP for cross-modal matching.
Certain weather conditions may co-occur (e.g., rain with lightning). We can re-label the dataset for multilabel classification. In this case, we use a binary vector to represent the presence or absence of each weather condition. Let’s see an example of “Multilabel Encoding” in the following part of the Multilabel_Encoding.py
script:
# Simulated multilabels for 10 images (e.g., rain with lightning, fog with frost, etc.)
multilabels = [
["rain", "lightning"],
["fogsmog", "frost"],
["hail"],
["rainbow", "rain"],
["sandstorm"],
["snow"],
["dew", "rime"],
["fogsmog"],
["lightning"],
["glaze", "frost"]
]
# Convert to binary vector representation
mlb = MultiLabelBinarizer(classes=categories)
binary_labels = mlb.fit_transform(multilabels)
# Display the binary labels
print("Multilabel Binarized Encoding:")
for i, label in enumerate(binary_labels):
print(f"Image {i + 1}: {label}")
Once you execute the script, we will get the following output for the respective images in the terminal:
Multilabel Binarized Encoding:
Image 1: [0 0 0 0 0 1 1 0 0 0 0]
Image 2: [0 1 1 0 0 0 0 0 0 0 0]
Image 3: [0 0 0 0 1 0 0 0 0 0 0]
Image 4: [0 0 0 0 0 0 1 1 0 0 0]
Image 5: [0 0 0 0 0 0 0 0 0 1 0]
Image 6: [0 0 0 0 0 0 0 0 0 0 1]
Image 7: [1 0 0 0 0 0 0 0 1 0 0]
Image 8: [0 1 0 0 0 0 0 0 0 0 0]
Image 9: [0 0 0 0 0 1 0 0 0 0 0]
Image 10: [0 0 1 1 0 0 0 0 0 0 0]
The Multilabel_Encoding.py
script converted weather conditions like rain with lightning or fog with frost into binary labels for each image.
The dataset can now train a model that predicts multiple labels for a single image using binary cross-entropy loss and a sigmoid activation function in the output layer.
Weather phenomena can be grouped hierarchically. Let’s see the following two levels:
•Level 1: General weather conditions (e.g., "precipitation", "visibility").
•Level 2: Specific subcategories (e.g., "rain", "snow", "fog").
Let’s see this example of how can we create hierarchical labels in the script Hierarchical_Labeling_Scheme.py
:
import numpy as np
from sklearn.model_selection import train_test_split
from Preprocessing_Images import preprocess_images # Import your preprocessing function
# Define hierarchical structure
hierarchy = {
"precipitation": ["rain", "snow", "hail"],
"visibility": ["fogsmog", "sandstorm"],
"ice": ["frost", "rime", "glaze"],
"optical": ["lightning", "rainbow"],
"dew": ["dew"]
}
categories = ['Dew', 'fogsmog', 'frost', 'glaze', 'hail', 'lightning', 'rain', 'rainbow', 'rime', 'sandstorm', 'snow']
dataset_path = r'C:\Users\induction\Documents\1AAA23\weather_dataset'
data, labels = preprocess_images(dataset_path)
# Split data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(data, labels, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Assign hierarchical labels
hierarchical_labels = []
for label in y_train:
subcategory = categories[label]
for parent, children in hierarchy.items():
if subcategory in children:
hierarchical_labels.append((parent, subcategory))
break
# Example hierarchical label
print(f"Image 1: {hierarchical_labels[0]}")
Once you execute the above script, you will see the following as an output:
Image 1: ('precipitation', 'hail')
So, hierarchical labeling allows training hierarchical classifiers using tree-based approaches and fine-grained analysis within parent categories.
There are benefits of reusing a pre-labeled dataset. Some of them are as follows:
Visualization of dataset reuse can exhibit the number of tasks achieved using the same data and improvements in task-specific performance due to well-curated labels.
High-quality labeled data helps with the following tasks:
To illustrate, let us compare:
Check the High_Quality_Labels.py
script:
from Building_CNN_Model import train_model, evaluate_model
categories = ['Dew', 'fogsmog', 'frost', 'glaze', 'hail', 'lightning', 'rain', 'rainbow', 'rime', 'sandstorm', 'snow']
dataset_path = r'C:\Users\induction\Documents\1AAA23\weather_dataset'
# Preprocess images and split the dataset
data, labels = preprocess_images(dataset_path)
X_train, X_temp, y_train, y_temp = train_test_split(data, labels, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Simulate noisy labels
noisy_labels = y_train.copy()
for i in range(int(0.2 * len(noisy_labels))):
noisy_labels[random.randint(0, len(noisy_labels) - 1)] = random.randint(0, len(categories) - 1)
# Train models
clean_model = train_model(X_train, y_train)
noisy_model = train_model(X_train, noisy_labels)
# Evaluate models
clean_accuracy = evaluate_model(clean_model, X_test, y_test)
noisy_accuracy = evaluate_model(noisy_model, X_test, y_test)
print(f"Accuracy with Clean Labels: {clean_accuracy:.2f}")
print(f"Accuracy with Noisy Labels: {noisy_accuracy:.2f}")
Check the respective comments in the code to understand the proper functions of the codes and how clean and noise accuracies are implemented. Upon execution of the code, you will see the following output:
Accuracy with Clean Labels: 0.70
Accuracy with Noisy Labels: 0.64
The results "Accuracy with Clean Labels: 0.70" and "Accuracy with Noisy Labels: 0.64" indicate that the model was trained with high quality. The clean labels achieved an accuracy of 70%, while the model trained with noisy labels achieved a lower accuracy of 64%. It has proved the importance of high-quality labeled data in improving the accuracy and performance of AI models.
In NLP, labeled datasets like question-answer pairs or sentiment-labeled texts impact language understanding models (e.g., GPT, BERT) and sentiment analysis systems. The “Sentiment Analysis” can be the best example.
Suppose we create a weather-based sentiment analysis dataset:
Even minor labeling errors (e.g., misclassifying "I love rain" as negative) can mislead the model.
To visualize NLP task accuracy, you can run python NLP_Task_Accuracy.py
in the terminal which exhibits the following graph:
So, it is clear that clean labels significantly improve sentiment analysis accuracy because clean labels achieve around 0.9 accuracy compared to 0.7 with noisy labels.
Image recognition models (like CNNs) rely heavily on clean and consistent labels. Mislabeling even a small percentage of training data can trigger the following issues:
If we talk about its real-world use cases, weather datasets enhance autonomous vehicles' ability to detect rain, fog, or snow, ensuring safe navigation. They also improve environmental monitoring by accurately predicting weather trends using satellite imagery.
So, high-quality data is very important for recommendation systems in e-commerce or streaming platforms, as they rely on labeled user preferences like product ratings. For example, using a weather dataset, a system could suggest indoor activities for rainy days or winter gear for snowy conditions. Accurate labels improve personalized suggestions and reduce irrelevant recommendations.
High-quality, pre-labeled datasets can be reused with different labeling schemes like:
For example, we can go for the “Augmented Weather Dataset”. Please check Reuse_Pre-Labeled_Datasets.py
and execute it to analyze the output.
Training data shape: (50000, 32, 32, 3)
Testing data shape: (10000, 32, 32, 3)
It suggests that the training dataset has 50,000 images and the testing dataset has 10,000 images, both with dimensions 32x32x3. So,
labeled data can be reused for multiple tasks with creative labeling schemes.
In this section, we will focus on constructing a pipeline to evaluate reused pre-labeled datasets for different AI tasks. We will use advanced visualization techniques to assess label quality and model performance. Additionally, we will implement metrics for consistency and task adaptation to ensure the effectiveness of our approach. So, this important part for us.
To evaluate how pre-labeled datasets perform in varied tasks, we will:
Pipeline Overview
Now, you can rush away to check Pipeline_Implementation.py
script and check the comments in the code to understand its functionality.
Confusion matrices help identify misclassifications across classes. You can prefer Confusion_Matrix_Visualization.py
and execute to get the following output:
The diagonal cells represent correct predictions, while off-diagonal cells indicate misclassifications. The model performs well in predicting "fogsmog" with 785 correct predictions and "rain" with 787 correct predictions. But, it struggles with distinguishing between "dew" and "sandstorm," with 135 instances of "dew" misclassified as "sandstorm." Additionally, there are notable misclassifications between "frost" and "glaze," with 175 instances of "frost" predicted as "glaze." The model also has difficulty with "rime" and "fogsmog," with 151 instances of "rime" predicted as "fogsmog." This confusion matrix helps us to evaluate the model's performance and identify areas for improvement. You can also see the following details in the terminal:
Classification Report:
precision recall f1-score support
Dew 0.63 0.70 0.66 1000
fogsmog 0.75 0.79 0.76 1000
frost 0.51 0.35 0.41 1000
glaze 0.49 0.43 0.46 1000
hail 0.54 0.57 0.55 1000
lightning 0.60 0.48 0.53 1000
rain 0.65 0.79 0.71 1000
rainbow 0.64 0.74 0.69 1000
rime 0.68 0.76 0.71 1000
sandstorm 0.73 0.68 0.71 1000
accuracy 0.63 10000
macro avg 0.62 0.63 0.62 10000
weighted avg 0.62 0.63 0.62 10000
Precision and recall are important for imbalanced datasets. You can run python Precision-Recall_Curves.py
to get the following output:
As we can see, the Precision-Recall curves provide a detailed evaluation of the model's performance across different weather categories. Categories like "rain" and "lightning" show high precision but lower recall, which indicates accurate predictions but missed instances. "Fogsmog" and "snow" have balanced precision and recall, which suggests consistent performance. "Dew" and "rime" exhibit lower precision but higher recall, which means frequent but less accurate predictions. The model struggles with "glaze" and "frost," where both precision and recall are low (areas for improvement).
We analyze label distribution to ensure balanced training. For this, run python Label_Distribution.py to get the following graph:
Here, we supply categories = ['Dew', 'fogsmog', 'frost', 'glaze', 'hail', 'lightning', 'rain', 'rainbow', 'rime', 'sandstorm', 'snow']
with corresponding numbers starting from 0 to 10. This distribution is important for ensuring balanced training data, which helps build a more accurate and unbiased machine-learning model.
To quantify how well a dataset adapts across tasks, we use the following metrics:
Now run the python Cross-Task_Consistency.py
in the terminal. Also, make sure to replace accuracy numbers with actual metrics from your model evaluations. In our case, it has exhibited the following output:
Adaptability Index: {'Multiclass': 0.7, 'Multilabel': 0.56, 'Image-Text': 0.6399999999999999}
So, the Adaptability Index values indicate the improvement in accuracy from the baseline for each task. The Multiclass task shows a 70% improvement, the Multilabel task shows a 56% improvement, and the Image-Text task shows approximately a 64% improvement.
Reusability measures how well a dataset supports different labeling schemes. For this, run python Reusability_Assessment.py in the terminal which generates the following graphs:
The Multiclass task has the highest Adaptability Index (approximately 0.7) and the highest accuracy (close to 0.8), which indicates it adapts well and performs consistently. The Multilabel and Image-Text tasks have lower Adaptability Index values (around 0.5 and 0.6, respectively) and similar accuracy levels (around 0.75), which suggests they perform well but with less adaptability compared to the Multiclass task. This analysis provides us the clues about the strengths and weaknesses of each task. It still demands further improvements in model training and data labeling.
The world is already embracing AI trends, and in this tutorial, we explored the process of preparing and preprocessing the "Weather Image Recognition" dataset, which is crucial for training models. Our study highlighted the importance of accurate labeling, which significantly improved the performance of a Convolutional Neural Network (CNN). We also demonstrated the versatility of pre-labeled datasets, showing how they can be repurposed for various tasks like Natural Language Processing (NLP) and multilabel classification. High-quality labels not only enhanced model accuracy by reducing confusion but also accelerated learning and improved generalization. Furthermore, our findings suggest that robust computing resources and well-managed datasets are necessary for achieving optimal AI model performance. We also addressed the critical risks posed by mislabeled datasets, especially in scenarios involving natural disasters, where incorrect labeling can lead to disastrous outcomes. Advanced AI models with image recognition capabilities could automate essential tasks such as automated warning systems, better recommendation engines, and more accurate weather predictions and analyses. To avoid making this tutorial excessively long, we’ve uploaded the complete code to GitHub, allowing you to download, modify, or reuse it as needed. The article also covers advanced evaluation and visualization techniques, supported by numerous graphs and charts to clarify key concepts.