Transformers have revolutionized natural language processing (NLP) tasks by providing superior performance in language translation, text classification, and sequence modeling.
The transformer architecture is based on a self-attention mechanism that allows each element in a sequence to attend to all other elements and stacked encoders that process the input sequence.
This article will demonstrate how to build a transformer model to generate new cocktail recipes. We will use the Cocktail DB dataset, which contains information about thousands of cocktails, including their ingredients and recipes.
First, we need to download and preprocess the Cocktail DB dataset. We will use the Pandas library to accomplish this.
import pandas as pd
url = 'https://www.thecocktaildb.com/api/json/v1/1/search.php?s=' cocktail_df = pd.DataFrame() for i in range(1, 26): response = pd.read_json(tr(i)) cocktail_df = pd.concat([cocktail_df, response['drinks']], ignore_index=True)
cocktail_df = cocktail_df.dropna(subset=['strInstructions']) cocktail_df = cocktail_df[['strDrink', 'strInstructions', 'strIngredient1', 'strIngredient2', 'strIngredient3', 'strIngredient4', 'strIngredient5', 'strIngredient6']] cocktail_df = cocktail_df.fillna('')
Next, we need to tokenize and encode the cocktail recipes using the tokenizer.
Import TensorFlow datasets as tfds.
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus( (text for text in cocktail_df['strInstructions']), target_vocab_size=2**13)
def encode(text): encoded_text = tokenizer.encode(text) return encoded_text
cocktail_df['encoded_recipe'] = cocktail_df['strInstructions'].apply(encode)
MAX_LEN = max([len(recipe) for recipe in cocktail_df['encoded_recipe']])
With the tokenized cocktail recipes, we can define the transformer decoder layer. The transformer decoder layer consists of two sub-layers: the masked multi-head self-attention layer and the point-wise feed-forward layer.
import tensorflow as tf from tensorflow.keras.layers import LayerNormalization, MultiHeadAttention, Dense
class TransformerDecoderLayer(tf.keras.layers.Layer): def init(self, num_heads, d_model, dff, rate=0.1): super(TransformerDecoderLayer, self).init()
self.mha1 = MultiHeadAttention(num_heads, d_model)
self.mha2 = MultiHeadAttention(num_heads, d_model)
self.ffn = tf.keras.Sequential([
Dense(dff, activation='relu'),
Dense(d_model)
])
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.layernorm3 = LayerNormalization(epsilon=1e-6)
self.dropout1 = tf.keras.layers.Dropout(rate)
self.dropout2 = tf.keras.layers.Dropout(rate)
self.dropout3 = tf.keras.layers.Dropout(rate)
def call(self, x, enc_output, training, look_ahead_mask):
attn1 = self.mha1(x, x, x, look_ahead_mask)
attn1 = self.dropout1(attn1, training=training)
out1 = self.layernorm1(x + attn1)
attn2 = self.mha2(enc_output, enc_output, out1, out1, out1)
attn2 = self.dropout2(attn2, training=training)
out2 = self.layernorm2(out1 + attn2)
ffn_output = self.ffn(out2)
ffn_output = self.dropout3(ffn_output, training=training)
out3 = self.layernorm3(out2 + ffn_output)
return out3
In the code above, the TransformerDecoderLayer class takes four arguments: the number of heads for the masked multi-head attention layer, the dimension of the model, the number of units in the point-wise feed-forward layer, and the dropout rate.
The call method defines the forward pass of the decoder layer, where x is the input sequence, enc_output is the output of the encoder, training is a Boolean flag that indicates whether the model is in training or inference mode, and look_ahead_mask is a mask that prevents the decoder from attending to future tokens.
We can now define the transformer model, which consists of multiple stacked transformer decoder layers followed by a Dense layer that maps the decoder output to the vocabulary size.
From tensorflow.keras.layers import Input
input_layer = Input(shape=(MAX_LEN,))
NUM_LAYERS = 4 NUM_HEADS = 8 D_MODEL = 256 DFF = 1024 DROPOUT_RATE = 0.1
decoder_layers = [TransformerDecoderLayer(NUM_HEADS, D_MODEL, DFF, DROPOUT_RATE) for _ in range(NUM_LAYERS)]
output_layer = Dense(tokenizer.vocab_size)
x = input_layer look_ahead_mask = tf.linalg.band_part(tf.ones((MAX_LEN, MAX_LEN)), -1, 0) for decoder_layer in decoder_layers: x = decoder_layer(x, x, True, look_ahead_mask) output = output_layer(x)
model = tf.keras.models.Model(inputs=input_layer, outputs=output)
In the code above, we define the input layer to accept the padded sequences with a length of MAX_LEN. We then define the transformer decoder layers by creating a list of TransformerDecoderLayer objects stacked together to process the input sequence.
The output of the last transformer decoder layer is passed through a Dense layer with a vocabulary size corresponding to the number of subwords in the tokenizer. We can train the model using an Adam optimizer and evaluate its performance after a certain number of epochs.
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def loss_function(real, pred): mask = tf.math.logical_not(tf.math.equal(real, 0)) loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule): def init(self, d_model, warmup_steps=4000): super(CustomSchedule, self).init()
self.d_model = tf.cast(d_model, tf.float32)
self.warmup_steps = warmup_steps
def __call__(self, step):
arg1 = tf.math.rsqrt(step)
arg2 = step * (self.warmup_steps**-1)
return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
LR = CustomSchedule(D_MODEL) optimizer = tf.keras.optimizers.Adam(LR, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
@tf.function def train_step(inp, tar): tar_inp = tar[:, :-1] tar_real = tar[:, 1:]
look_ahead_mask = tf.linalg.band_part(tf.ones((tar.shape[1], tar.shape[1])), -1, 0) look_ahead_mask = 1 - look_ahead_mask
with tf.GradientTape() as tape: predictions = model(inp, True, look_ahead_mask) loss = loss_function(tar_real, predictions)
gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_accuracy.update_state(tar_real, predictions)
return loss Train the model EPOCHS = 50 BATCH_SIZE = 64 NUM_EXAMPLES = len(cocktail_df)
for epoch in range(EPOCHS): print('Epoch', epoch + 1) total_loss = 0
for i in range(0, NUM_EXAMPLES, BATCH_SIZE): batch = cocktail_df[i:i+BATCH_SIZE] input_batch = tf.keras.preprocessing.sequence.pad_sequences(batch['encoded_recipe'], padding='post', maxlen=MAX_LEN) target_batch = input_batch
loss = train_step(input_batch, target_batch)
total_loss += loss
print('Loss:', total_loss) print('Accuracy:', train_accuracy.result().numpy()) train_accuracy.reset_states
Once the model is trained, we can generate new cocktail recipes by feeding the model a seed sequence and iteratively predicting tar. Shapetokentar. Shapee end-of-sequence token is generated.
def generate_recipe(seed, max_len):
encoded_seed = encode(seed) for i in range(max_len):
input_sequence = tf.keras.preprocessing.sequence.pad_sequences([encoded_seed],
padding='post', maxlen=MAX_LEN) predictions = model(input_sequence, False, None)
predicted_id = tf.random.categorical(predictions[:, -1, :], num_samples=1)
if predicted_id == tokenizer.vocab_size:
break encoded_seed = tf.squeeze(predicted_id).numpy().tolist()
recipe = tokenizer.decode(encoded_seed)
return recipe
In summary, transformers are a powerful tool for sequence modeling that can be used in a wide range of applications beyond NLP.
By following the steps outlined in this article, it is possible to build a transformer model to generate new cocktail recipes, demonstrating the transformer architecture's flexibility and versatility.
References