Time-series or sequential data can be defined as any data that has time dependency. Cool, huh, but where can I find sequential data? Well, a bit everywhere, from credit card transactions, my everyday routine and whereabouts to medical records, such as ECG and EEG’s.
Although sequential data is pretty common to be found and highly useful, there are many reasons that lead to not leverage it — from privacy regulations to the scarcity of its existence.
In one of my previous posts, I’ve covered the ability of Generative Adversarial Netoworks (GANs) to learn and generate new synthetic data that preserves the utility and fidelity of a real datasets, nevertheless to generate tabular data is far more simple than generating datasets that should preserve temporal dynamics.
To model successfully time-series data means that a model must, not only capture the datasets features distributions within each time-point but also, it should be able to capture the complex dynamics of those features across time. We must not forget also that each time sequence as a variable length associated.
But being a challenging task, does not mean it is impossible! Jinsung Yoon and Daniel Jarret have proposed, in 2019, a novel GAN architecture to model sequential data — TimeGAN — that I’ll be covering with a practical example throughout this blog post.
TGAN or Time-series Generative Adversarial Networks, was proposed in 2019, as a GAN based framework that is able to generate realistic time-series data in a variety of different domains, meaning, sequential data with different observed behaviors.
Different from other GAN architectures (eg. WGAN) where we have implemented an unsupervised adversarial loss on both real and synthetic data, TimeGAN architecture introduces the concept of supervised loss —the model is encouraged to capture time conditional distribution within the data by using the original data as a supervision.
Also, we can observe the introduction of an embedding network that is responsible to reduce the adversarial learning space dimensionality.
TimeGAN block diagram and training scheme as depicted in the paper
As mentioned above, TimeGAN is a framework to synthesize sequential data compose by 4 networks, that play distinct roles in the process of modelling the data: the expected generator and discriminator, but also, by a recovery and embedder models.
What’s new about TimeGAN?
Different from other GAN architectures for sequential data, the proposed framework is able to generate it’s training to handle a mixed-data setting, where both static (attributes) and sequential data (features) are able to be generated at the same time.
Less sensitive to hyper parameters changes
A more stable training process, when compared to other architectures.
TimeGAN framework instantiated with RNNs
For the purpose of this example, I’ve decided to keep it simple with a very similar architecture for all the 4 elements: A 3 layers GRU network. But it’s possible to have this architectures change to more or less layers, and also to chose between GRU and LSTMs.
def make_net(model, n_layers, hidden_units, output_units, net_type='GRU'):
if net_type=='GRU':
for i in range(n_layers):
model.add(GRU(units=hidden_units,
return_sequences=True,
name=f'GRU_{i + 1}'))
else:
for i in range(n_layers):
model.add(LSTM(units=hidden_units,
return_sequences=True,
name=f'LSTM_{i + 1}'))
model.add(Dense(units=output_units,
activation='sigmoid',
name='OUT'))
return model
Which results in the definition of each of the networks elements as per the code snippet below.
class Supervisor(Model):
def __init__(self, hidden_dim):
self.hidden_dim=hidden_dim
def build(self, input_shape):
model = Sequential(name='Supervisor')
model.add(Input(shape=input_shape))
model = make_net(model,
n_layers=2,
hidden_units=self.hidden_dim,
output_units=self.hidden_dim)
return model
class Generator(Model):
def __init__(self, hidden_dim, net_type='GRU'):
self.hidden_dim = hidden_dim
self.net_type = net_type
def build(self, input_shape):
model = Sequential(name='Generator')
model.add(Input(shape=input_shape))
model = make_net(model,
n_layers=3,
hidden_units=self.hidden_dim,
output_units=self.hidden_dim,
net_type=self.net_type)
return model
class Discriminator(Model):
def __init__(self, hidden_dim, net_type='GRU'):
self.hidden_dim = hidden_dim
self.net_type=net_type
def build(self, input_shape):
model = Sequential(name='Discriminator')
model = make_net(model,
n_layers=3,
hidden_units=self.hidden_dim,
output_units=1,
net_type=self.net_type)
return model
class Recovery(Model):
def __init__(self, hidden_dim, n_seq):
self.hidden_dim=hidden_dim
self.n_seq=n_seq
return
def build(self, input_shape):
recovery = Sequential(name='Recovery')
recovery.add(Input(shape=input_shape, name='EmbeddedData'))
recovery = make_net(recovery,
n_layers=3,
hidden_units=self.hidden_dim,
output_units=self.n_seq)
return recovery
class Embedder(Model):
def __init__(self, hidden_dim):
self.hidden_dim=hidden_dim
return
def build(self, input_shape):
embedder = Sequential(name='Embedder')
embedder.add(Input(shape=input_shape, name='Data'))
embedder = make_net(embedder,
n_layers=3,
hidden_units=self.hidden_dim,
output_units=self.hidden_dim)
return embedder
class Supervisor(Model):
def __init__(self, hidden_dim):
self.hidden_dim=hidden_dim
def build(self, input_shape):
model = Sequential(name='Supervisor')
model.add(Input(shape=input_shape))
model = make_net(model,
n_layers=2,
hidden_units=self.hidden_dim,
output_units=self.hidden_dim)
return model
In what concerns the losses, the TimeGAN is composed by three:
The reconstruction loss, which refers to the auto-encoder (embedder & recovery), that in a nutshell compares how well was the reconstruction of the encoded data when compared to the original one.
The supervised loss that, in a nutshell, is responsible to capture how well the generator approximates the next time step in the latent space. The unsupervised loss, this one it’s already familiar to us, a it reflects the relation between the generator and discriminator networks (min-max game)
Given the architecture choice and the defined losses we have three training phases:
The full code detailing the training phases can be found at ydata-synthetic.
The original implementation of TimeGAN can be found here using TensorFlow 1.
The data used to evaluate the synthetic data generated by the TimeGAN framework, refers to Google stock data. The data has 6 time dependent variables: Open, High, Low, Close, Adj Close and Volume.
Prior to synthesize the data we must, first, ensure some preprocessing:
# Method implemented here: https://github.com/jsyoon0823/TimeGAN/blob/master/data_loading.py
# Originally used in TimeGAN research
def real_data_loading(data: np.array, seq_len):
"""Load and preprocess real-world datasets.
Args:
- data_name: Numpy array with the values from a a Dataset
- seq_len: sequence length
Returns:
- data: preprocessed data.
"""
# Flip the data to make chronological data
ori_data = data[::-1]
# Normalize the data
scaler = MinMaxScaler().fit(ori_data)
ori_data = scaler.transform(ori_data)
# Preprocess the dataset
temp_data = []
# Cut data by sequence length
for i in range(0, len(ori_data) - seq_len):
_x = ori_data[i:i + seq_len]
temp_data.append(_x)
# Mix the datasets (to make it similar to i.i.d)
idx = np.random.permutation(len(temp_data))
data = []
for i in range(len(temp_data)):
data.append(temp_data[idx[i]])
return data
Following the recommendations from the original paper, I’ve decided to train the synthesizer for 10000 iterations nevertheless, bare in mind, that these values must the optimized for each data set in order to return optimal results.
You can find in this notebook the full flow including the data download, processing and synthesizing.
Now that we were able to synthesize our data, it’s time to check whether the new data is able to reproduce properly the behavior observed in the original stock data data.
Visual comparison
One of my favorites ways to compare real with synthetic data is through visualization. Of course that in terms of automation is not the ideal to validate the quality of the new synthetic data, but it gives us already a pretty good idea. To ensure a 2D visualization of the results, it was applied both a TSNE and a PCA with 2 components.
The results are pretty promising, as we see an almost perfect overlap between the synthetic and the real data points.
Synthetic data utility — Train synthetic test real
There are a plenty of different metrics that can be used to measure the utility of the data such as the SRA and the TSTR.
In this example we’ve decided to go for the TSTR method to validate the utility of our synthetic data. To be used as a regressor model, I’ve decided for a simple architecture with a single layer GRU with 12 units.
def RNN_regression(units):
opt = Adam(name='AdamOpt')
loss = MeanAbsoluteError(name='MAE')
model = Sequential()
model.add(GRU(units=units,
name=f'RNN_1'))
model.add(Dense(units=6,
activation='sigmoid',
name='OUT'))
model.compile(optimizer=opt, loss=loss)
return model
Prior to train the model, we had to prepare the inputs:
For both the model trained on synthetic and the one trained on the real data, I’ve used as the validation set loss as the early stopping criteria. Below are the summarized results obtained for the test set.
The results obtained with this experiment are very promising and exciting in what concerns the generation of synthetic sequential data.
Nevertheless, there are a few caveats that are needed to be pointed: the data under study can be considered to be quite simple, as it was relatively small, with a daily time frequency, there were no surprises in terms of missing values and the dimensionality (columns wise) was relatively low. Not to mention, the very intensive training time.
But without a doubt, Generative Adversarial Networks are impressive frameworks, that can be leveraged for much more than synthetic data generation.
For those of you, that are looking to keep exploring synthetic data generation have a look into our GitHub repository. We’ll be updating it with new generative algorithms as well as new data, and we invite you to collaborate!
Fabiana Clemente is CDO at YData.
Improved and synthetic data for AI.
YData provides the first dataset experimentation platform for Data Scientists.
Also published here