Keras provides a high-level API that simplifies the process of building and training complex neural network models. With a wide range of pre-built layers and functions, developers can easily build and train deep learning models on large datasets using optimization algorithms. Keras also supports GPU acceleration for training and inference, making it a popular choice for both research and industry applications.
Keras datasets are preprocessed datasets that come pre-installed with the Keras library. These datasets are commonly used in the deep learning community for benchmarking models on various tasks such as image classification, text classification, and regression. By leveraging these datasets, developers can experiment with different deep learning models and easily compare their performance.
This article looks at the Best Keras Datasets for Building and Training Deep Learning Models, accessible to developers and researchers worldwide.
The MNIST dataset is popular and widely used in the fields of machine learning and computer vision. It consists of 70,000 grayscale images of handwritten digits 0–9, with 60,000 images for training and 10,000 for testing. Each image is 28x28 pixels in size and has a corresponding label denoting which digits it represents.
This dataset can be downloaded from
tf.keras.datasets.mnist.load_data(path="mnist.npz")
The CIFAR-10 dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. It has a total of 50,000 training images and 10,000 test images which is further divided into five training batches and one test batch, each with 10,000 images.
This dataset can be downloaded from
tf.keras.datasets.cifar10.load_data()
The CIFAR-100 dataset has 60,000 (50,000 training images and 10,000 test images) 32x32 colour images in 100 classes, with 600 images per class. The 100 classes are grouped into 20 super-classes, with a fine label to denote its class and a coarse label to represent the super-class that it belongs to.
This dataset can be downloaded from
tf.keras.datasets.cifar100.load_data(label_mode="fine")
The Fashion MNIST dataset was created by Zalando Research as a replacement for the original MNIST dataset. The Fashion MNIST dataset consists of 70,000 grayscale images(training set of 60,000 and a test set of 10,000) of clothing items.
The images are 28x28 pixels in size and represent 10 different classes of clothing items, including T-shirts/tops, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots. It is similar to the original MNIST dataset, but with more challenging classification tasks due to the greater complexity and variety of the clothing items.
This dataset can be downloaded from
tf.keras.datasets.fashion_mnist.load_data()
The IMDB dataset is commonly used for sentiment analysis tasks, where the goal is to classify the reviews as positive or negative based on their content. It consists of a collection of 50,000 movie reviews (training set of 25,000 and a test set of 25,000) from the Internet Movie Database website, split evenly between positive and negative reviews.
Each review in this dataset is a text document, preprocessed and transformed into sequences of integers, where each integer represents a word in the review. The vocabulary size is limited to the 10,000 most frequent words in the dataset and any less frequent words are replaced with a special “unknown” token.
This dataset can be downloaded from
tf.keras.datasets.imdb.load_data(
path="imdb.npz",
num_words=None,
skip_top=0,
maxlen=None,
seed=113,
start_char=1,
oov_char=2,
index_from=3,
**kwargs
)
The Boston Housing dataset contains information about housing in the Boston area. This information consists of 506 instances (404 training and 102 test instances), with attributes for each instance.
The attributes have a mix of quantitative and categorical variables, such as the average number of rooms per dwelling, per capita crime rate and the proportion of non-retail business acres per town.
This dataset can be downloaded from
tf.keras.datasets.boston_housing.load_data(
path="boston_housing.npz", test_split=0.2, seed=113
)
The Wine Quality dataset contains information on red and white wine samples. The goal of this dataset is to classify the quality of the wine based on chemical properties like pH, density, alcohol content and citric acid content.
The variables in this dataset include:
You can download the dataset
from keras.datasets import wine_quality
(X_train, y_train), (X_test, y_test) = wine_quality.load_data(test_split=0.2, seed=113)
The Reuters Newswire dataset is a preprocessed version of the original Reuters dataset, with the text encoded as sequences of integers. It consists of 11,228 news articles with a vocabulary of 30,979 words.
Each article is classified into one of 46 different topics like “corn”, “crude”, “earnings” and “acquisitions”.
You can download the dataset from
tf.keras.datasets.reuters.load_data(path="reuters.npz",num_words=None,skip_top=0,
maxlen=None,test_split=0.2,seed=113,start_char=1,oov_char=2,index_from=3,**kwargs)
This dataset consists of medical data about Pima Indian women, such as age, number of pregnancies, glucose levels, blood pressure, skin thickness, BMI and insulin level. The Keras version of the Pima Indians Diabetes dataset contains 768 samples with 8 input variables and 1 output variable.
The Pima Indians Diabetes dataset can be downloaded on
from tensorflow.keras.datasets import pima_indians_diabetes
(x_train, y_train), (x_test, y_test) = pima_indians_diabetes.load_data()
The Dogs vs Cats dataset consists of 25,000 labelled images of dogs and cats, with 12,500 images of each class. These images were collected from various sources with varying sizes and quality.
You can download the dataset from
# Import the necessary Keras libraries:
from keras.preprocessing.image import ImageDataGenerator
# Set the paths to the training and validation directories:
train_dir = 'path/to/train'
validation_dir = 'path/to/validation'
# Define an ImageDataGenerator object to perform data augmentation and normalization:
train_datagen = ImageDataGenerator(rescale=1./255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
# Use flow_from_directory to load directory data in Keras:
validation_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(train_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
validation_generator = validation_datagen.flow_from_directory(validation_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
# The flow_from_directory yields preprocessed image batches and labels as DirectoryIterator.
Note that in the above code, we are using data augmentation to create variations of the training images to help prevent overfitting. The validation data is not augmented.
Keras datasets are a valuable resource for machine learning practitioners and researchers, which can save time and effort in data collection and preprocessing, allowing for more focus on model development and experimentation.
These Keras datasets are also available for anyone to download and use freely.
More Dataset Listicles: