Time-series or sequential data can be defined as any data that has time dependency. , huh, Well, a bit everywhere, from credit card transactions, my everyday routine and whereabouts to medical records, such as ECG and EEG’s. Cool but where can I find sequential data? Generate synthetic sequential data with TimeGAN Although sequential data is pretty common to be found and highly useful, there are many reasons that lead to not leverage it — from to the scarcity of its existence. In one of my previous posts, I’ve covered the ability of that preserves the utility and fidelity of a real datasets, nevertheless to generate tabular data is far more simple than generating datasets that should preserve privacy regulations Generative Adversarial Netoworks (GANs) to learn and generate new synthetic data temporal dynamics. To model successfully time-series data means that a model must, not only capture the datasets features distributions within each time-point but also, it should be able to capture the complex dynamics of those features across time. We must not forget also that each time sequence as a variable length associated. But being a challenging task, does not mean it is impossible! and have proposed, in 2019, — — that I’ll be covering with a practical example throughout this blog post. Jinsung Yoon Daniel Jarret a novel GAN architecture to model sequential data TimeGAN Time-series Generative Adversarial Networks TGAN or Time-series Generative Adversarial Networks, was proposed in 2019, as a GAN based framework that is able to generate realistic time-series data in a variety of different domains, meaning, sequential data with different observed behaviors. Different from other GAN architectures (eg. ) where we have implemented an unsupervised adversarial loss on both real and synthetic data, TimeGAN architecture introduces the concept of the model is encouraged to capture time conditional distribution within the data by using the original data as a supervision. WGAN supervised loss — Also, we can observe the introduction of an that is responsible to reduce the adversarial learning space dimensionality. embedding network block diagram and training scheme as depicted in the paper TimeGAN As mentioned above, TimeGAN is a framework to synthesize sequential data compose by 4 networks, that play distinct roles in the process of modelling the data: the expected and , but also, by a and models. generator discriminator recovery embedder Implementation with TensorFlow 2 What’s new about TimeGAN? Different from other the proposed framework is able to generate it’s training to handle a mixed-data setting, where both static and sequential data are able to be generated at the same time. GAN architectures for sequential data, (attributes) (features) Less sensitive to hyper parameters changes A more stable training process, when compared to other architectures. framework instantiated with RNNs For the purpose of this example, I’ve decided to keep it simple with a very similar architecture for all the 4 elements: A 3 layers network. But it’s possible to have this architectures change to more or less layers, and also to chose between and . TimeGAN GRU GRU LSTMs net_type== : i range(n_layers): model.add(GRU(units=hidden_units, return_sequences= , name= )) : i range(n_layers): model.add(LSTM(units=hidden_units, return_sequences= , name= )) model.add(Dense(units=output_units, activation= , name= )) model : def make_net (model, n_layers, hidden_units, output_units, net_type= ) 'GRU' if 'GRU' for in True f'GRU_ ' {i + } 1 else for in True f'LSTM_ ' {i + } 1 'sigmoid' 'OUT' return Which results in the definition of each of the networks elements as per the code snippet below. self.hidden_dim=hidden_dim model = Sequential(name= ) model.add(Input(shape=input_shape)) model = make_net(model, n_layers= , hidden_units=self.hidden_dim, output_units=self.hidden_dim) model : class Supervisor (Model) : def __init__ (self, hidden_dim) : def build (self, input_shape) 'Supervisor' 2 return self.hidden_dim = hidden_dim self.net_type = net_type model = Sequential(name= ) model.add(Input(shape=input_shape)) model = make_net(model, n_layers= , hidden_units=self.hidden_dim, output_units=self.hidden_dim, net_type=self.net_type) model self.hidden_dim = hidden_dim self.net_type=net_type model = Sequential(name= ) model = make_net(model, n_layers= , hidden_units=self.hidden_dim, output_units= , net_type=self.net_type) model self.hidden_dim=hidden_dim self.n_seq=n_seq recovery = Sequential(name= ) recovery.add(Input(shape=input_shape, name= )) recovery = make_net(recovery, n_layers= , hidden_units=self.hidden_dim, output_units=self.n_seq) recovery self.hidden_dim=hidden_dim embedder = Sequential(name= ) embedder.add(Input(shape=input_shape, name= )) embedder = make_net(embedder, n_layers= , hidden_units=self.hidden_dim, output_units=self.hidden_dim) embedder self.hidden_dim=hidden_dim model = Sequential(name= ) model.add(Input(shape=input_shape)) model = make_net(model, n_layers= , hidden_units=self.hidden_dim, output_units=self.hidden_dim) model : class Generator (Model) : def __init__ (self, hidden_dim, net_type= ) 'GRU' : def build (self, input_shape) 'Generator' 3 return : class Discriminator (Model) : def __init__ (self, hidden_dim, net_type= ) 'GRU' : def build (self, input_shape) 'Discriminator' 3 1 return : class Recovery (Model) : def __init__ (self, hidden_dim, n_seq) return : def build (self, input_shape) 'Recovery' 'EmbeddedData' 3 return : class Embedder (Model) : def __init__ (self, hidden_dim) return : def build (self, input_shape) 'Embedder' 'Data' 3 return : class Supervisor (Model) : def __init__ (self, hidden_dim) : def build (self, input_shape) 'Supervisor' 2 return In what concerns the losses, the TimeGAN is composed by three: The , which refers to the auto-encoder (embedder & recovery), that in a nutshell compares how well was the reconstruction of the encoded data when compared to the original one. reconstruction loss The that, in a nutshell, is responsible to capture how well the generator approximates the next time step in the latent space. The , this one it’s already familiar to us, a it reflects the relation between the generator and discriminator networks (min-max game) supervised loss unsupervised loss Given the architecture choice and the defined losses we have three training phases: Training the autoencoder on the provided sequential data for optimal reconstruction; Training the supervisor using the real sequence data to capture the temporal behavior of the historical information, and finally; The combined training of four components while minimizing all the three loss functions mentioned previously. The full code detailing the training phases can be found at . ydata-synthetic The original implementation of TimeGAN can be found using TensorFlow 1 here . Synthetic stock data The data used to evaluate the synthetic data generated by the framework, refers to . The data has : Open, High, Low, Close, Adj Close and Volume. TimeGAN Google stock data 6 time dependent variables Prior to synthesize the data we must, first, ensure some preprocessing: Scale the series to a range between [0,1]. For convenience, I’ve decided to leverage scikit-learn’s MinMaxScaler; Create rolling windows — following the original paper recommendations, I’ve create rolling windows with overlapping sequences of 24 data points. ori_data = data[:: ] scaler = MinMaxScaler().fit(ori_data) ori_data = scaler.transform(ori_data) temp_data = [] i range( , len(ori_data) - seq_len): _x = ori_data[i:i + seq_len] temp_data.append(_x) idx = np.random.permutation(len(temp_data)) data = [] i range(len(temp_data)): data.append(temp_data[idx[i]]) data # Method implemented here: https://github.com/jsyoon0823/TimeGAN/blob/master/data_loading.py # Originally used in TimeGAN research : def real_data_loading (data: np.array, seq_len) """Load and preprocess real-world datasets. Args: - data_name: Numpy array with the values from a a Dataset - seq_len: sequence length Returns: - data: preprocessed data. """ # Flip the data to make chronological data -1 # Normalize the data # Preprocess the dataset # Cut data by sequence length for in 0 # Mix the datasets (to make it similar to i.i.d) for in return Following the recommendations from the original paper, I’ve decided to train the synthesizer for 10000 iterations nevertheless, bare in mind, that these values must the optimized for each data set in order to return optimal results. You can find in the full flow including the data download, processing and synthesizing. this notebook Measuring synthetic data fidelity and utility Now that we were able to synthesize our data, it’s time to check whether the new data is able to reproduce properly the behavior observed in the original stock data data. Visual comparison One of my favorites ways to compare real with synthetic data is through visualization. Of course that in terms of automation is not the ideal to validate the quality of the new synthetic data, but it gives us already a pretty good idea. To ensure a 2D visualization of the results, it was applied both a and a with 2 components. TSNE PCA The results are pretty promising, as we see an almost perfect overlap between the synthetic and the real data points. Synthetic data utility — Train synthetic test real There are a plenty of different metrics that can be used to measure the utility of the data such as the . SRA and the TSTR In this example we’ve decided to go for the to validate the utility of our synthetic data. To be used as a regressor model, I’ve decided for a simple architecture with a single layer GRU with 12 units. TSTR method opt = Adam(name= ) loss = MeanAbsoluteError(name= ) model = Sequential() model.add(GRU(units=units, name= )) model.add(Dense(units= , activation= , name= )) model.compile(optimizer=opt, loss=loss) model : def RNN_regression (units) 'AdamOpt' 'MAE' f'RNN_1' 6 'sigmoid' 'OUT' return Prior to train the model, we had to prepare the inputs: 75% of the data was used as train data (for both synthetic and real datasets) 25% of the real data was used as the test set. For both the model trained on synthetic and the one trained on the real data, I’ve used as the validation set loss as the early stopping criteria. Below are the summarized results obtained for the test set. Conclusion The results obtained with this experiment are very promising and exciting in what concerns the generation of synthetic sequential data. Nevertheless, there are a few caveats that are needed to be pointed: the data under study can be considered to be quite simple, as it was relatively small, with a daily time frequency, there were no surprises in terms of missing values and the dimensionality (columns wise) was relatively low. Not to mention, the very intensive training time. But without a doubt, Generative Adversarial Networks are impressive frameworks, that can be leveraged for much more than synthetic data generation. For those of you, that are looking to keep exploring synthetic data generation have a look into our . We’ll be updating it with new generative algorithms as well as new data, and we invite you to collaborate! GitHub repository Fabiana Clemente is CDO at YData . Improved and synthetic data for AI. YData provides the first dataset experimentation platform for Data Scientists. Also published here