Comparative Study of Different Adversarial Text to Image Methods Introduction synthesis of images from text has become popular with deep and neural network architectures to aid in learning discriminative text feature representations. Automatic realistic convolutional recurrent power and strong properties of representations even though attractive, its a complex process and requires -specific knowledge. Over the years the techniques have evolved as auto-adversarial networks in space of machine learning algorithms continue to evolve. Discriminative generalization attribute domain In comparison, offers an easy, general, and flexible plugin that can be used to identify and describing objects across multiple domains by means of visual categories. natural language The best thing is to combine the generality of text descriptions with the discriminative power of attributes. This blog addresses different text to image synthesis algorithms using that aims to directly map words and characters to image pixels with GAN (Generative Adversarial Network) natural language representation and image synthesis techniques. The featured algorithms . learn a text feature representation that captures the important visual details and then use these features to synthesize a compelling image that a human might mistake for real Generative Adversarial Text to Image Synthesis This image synthesis mechanism uses to learn a correspondence function with images by conditioning the model conditions on instead of class labels. deep convolutional and recurrent text encoders text descriptions An effective approach that enables text-based image synthesis using a character-level text encoder and . The purpose of the GAN is to class-conditional GAN view (text, image) pairs as joint observations and train the discriminator to judge pairs as real or fake . Equipped with a manifold (regularization procedure which encourages outputs to appear more realistic) for the GAN generator that significantly improves the quality of generated samples. interpolation regularizer interpolated The objective of GAN is to view (text, image) and train the discriminator to judge pairs as or . pairs as joint observations real fake Both the and the perform has been trained to enable conditioning tightly only on textual features. generator network G discriminator network D feed-forward learning and inference by e, LICENSE- Apache 2.0 Sourc , has several layers of with followed by Discriminator D stride2 convolution spatial batch normalization leaky ReLU . The GAN is trained in with mini-batches SGD (Stochastic Gradient Descent). In addition to the real/fake inputs to the discriminator during training, it is also fed with the third type of input consisting of real images with mismatched text, which aids the discriminator to score it as fake. The below figure illustrates text to image generation samples of different types of birds. — (Open Source Apache 2.0 License) Source Library and Usage git clone https://github.com/zsdonghao/text-to-image.git [TensorFlow 1.0+, TensorLayer 1.4+, NLTK : for tokenizer] python downloads.py [download Oxford-102 flower dataset and caption files(run this first)] python data_loader.py [load data for further processing] python train_txt2im.py [train a text to image model] python utils.py [helper functions] python models.py [models] 2. Multi-Scale Gradient GAN for Stable Image Synthesis ) is responsible for handling in gradients passing from the discriminator to the generator that become , due to a learning imbalance during training. It uses an effective technique that allows the flow of gradients from the discriminator to the generator at multiple scales helping to generate Multi-Scale Gradient Generative Adversarial Network (MSG-GAN instability uninformative synchronized multi-scale images. The discriminator not only looks at the final output (highest resolution) of the generator but also at the outputs of the as illustrated in the below figure. As a result, the discriminator becomes a function of of the generator (by using concatenation operations) and importantly, passes to all the simultaneously. The architecture of MSG-GAN for generating synchronized multi-scale images. intermediate layers multiple scale outputs gradients scales — (Open Source MIT License) Source The architecture of MSG-GAN for generating synchronized multi-scale images. — (Open Source MIT License) Source MSG-GAN is robust to changes in the learning rate and has a more consistent increase in when compared to MSG-GAN shows the same trait and consistency for all the resolutions and images generated at higher resolution maintain the symmetry of certain features such as the same color for both eyes, or earrings in both ears. Moreover, the training phase allows a better understanding of image properties (e.g., and ). image quality progressive growth (Pro-GAN). convergence quality diversity Library and Usage git clone https://github.com/akanimax/BMSG-GAN.git [PyTorch] python train.py --depth=7 \ --latent_size=512 \ --images_dir=<path to images> \ --sample_dir=samples/exp_1 \ --model_dir=models/exp_1 3 . T2F-text-to-face-generation-using-deep-learning ( StackGAN ++ and ProGAN ) In the architecture, works on the principle of adding new layers that model increasingly fine details as training progresses. Here both the generator and discriminator start by images of low resolution and adds images’ in-depth details in steps. It helps in a more and training process. ProGAN creating subsequent stable faster architecture consists of and The different branches of the tree represent images of varying scales, all belonging to the same scene. StackGAN has been known for yielding different types of approximate distributions. These multiple related distributions include multi-scale image distributions and joint conditional and unconditional image distributions. StackGAN multiple generators discriminators in a tree-like structure . uses a combined architecture of . is the principle working methodology. The textual description is encoded into a summary vector using an . The summary vector i.e. Embedding as illustrated in the below diagram is passed through the to obtain the textual part of the latent vector (uses like technique) for the GAN as input. T2F ProGAN and StackGAN ProGAN is known for the synthesis of facial images , while StackGAN is known for text encoding , where conditioning augmentation LSTM network Conditioning Augmentation block (a single linear layer ) VAE parameterization The second part of the latent vector is The latent vector yielded is then fed to the generator part of the . The embedding thus formed is finally fed to the final layer of the discriminator for . The training of the GAN proceeds layer by layer. Every next layer adds spatial resolutions at an increasing level. random Gaussian noise . GAN conditional distribution matching is used to introduce any new layer. This step helps to remember and restore previously learned information.T2F architecture for generating face from textual descriptions. The fade-in technique , LICENSE-MIT Source The below figure illustrates mechanism of facial image generation from textual captions for each of them. Source — , LICENSE-MIT https://github.com/akanimax/T2F.git Library and Usage git clone https://github.com/akanimax/T2F.gitpip install -r requirements.txtmkdir training_runsmkdir training_runs/generated_samples training_runs/losses training_runs/saved_modelstrain_network.py --config=configs/11.comf 4. Object-driven Text-to-Image Synthesis via Adversarial Training AttnGAN LICENSE — MIT Source performs fine-grained text-to-image synthesis. Such occurs in two steps. At first, a semantic layout (class labels, bounding boxes, shapes of salient objects) is generated and then the generating images are synthesized by a . Object-driven Attentive GAN (Obj-GAN) in-depth granular image synthesis de-convolutional image generator However is accomplished with the sentence being served as input to Obj-GAN. This facilitates the to generate a sequence of objects specified by their bounding boxes (with class labels) and shapes. semantic layout generation Obj-GAN The is trained as an to generate a sequence of bounding boxes, followed by a shape generator to . box generator attentive seq2seq model predict and generate the shape of each object in its bounding box In the image generation step, the and are designed to enable image generation conditioned on the semantic layout generated in the first step. The generator concentrates on synthesizing the image region within a by focusing on words that are most relevant to the object in that bounding box. object-driven attentive generator object-wise discriminator bounding box vectors serve as an important tool encode information from the words that are most relevant to that image region. This is accomplished with the help of both -wise and -wise context vectors for defined image regions. Attention-driven context patch object A is also used. It is able to offer rich object-wise signals. These signals help to determine whether the synthesized object matches the text description and the pre-generated layout. Fast R-CNN based object-wise discriminator discrimination (paying to most and performs better than traditional grid attention, capable of . Object-driven attention attention relevant words pre-generated class labels) generates complex scenes in high quality The open-source code for Obj-GAN from Microsoft is not available yet. - (License-OpenSource) Source 5. MirrorGan MirrorGAN is built to emphasize features. It helps in the -preserving text-to-image-to-text framework. global-local attentive semantic MirrorGAN is equipped to learn text-to-image generation by re-description. It is composed of three modules: “a semantic text embedding module ( ), a global-local for cascaded image generation ( ), and a semantic text regeneration and module ( )”. STEM collaborative attentive module GLAM alignment STREAM STEM generates using recurrent neural network ( ) to embed the given text description into local and . word-and sentence-level embeddings RNN word-level features global sentence-level features GLAM has a . It is designed by three image generation networks sequentially for generating target images from coarse to fine scales. During target image generation, it leverages both and This helps to the diversity and of the generated images. multi-stage cascaded generator stacking local word attention global sentence. progressively enhance semantic consistency STREAM purposes to the text description from the generated image. The image aligns with the given text description. regenerate semantically Word-level attention model takes in related words. This helps to generate an attentive word-context feature. Word embedding and the visual feature is taken as the input in each stage. The word embedding is first converted into an underlying common semantic space of visual features by a and with the to obtain the . Finally, the attentive word-context feature is obtained by calculating the inner product between the and r along with neighboring contextual high perception layer multiplied visual feature attention score attention score perception laye word embedding. MirrorGAN’s two most important components semantic text regeneration and alignment module maintains overall sync between input text and output image. These two modules help to the from the The output finally semantically aligns with the given text description. In addition, an encoder decoder-based image caption framework is used to generate captions in the architecture. The encoder is a and the is an . regenerate text description generated image. convolutional neural network (CNN) decoder RNN performs better than at all settings by a large margin, demonstrating the of the proposed text-to-image-to-text framework and the since MirrorGAN generated high-quality images with semantics consistent with the input text descriptions. MirrorGAN AttnGAN superiority global-local collaborative attentive module The following figure illustrates how a generates images modifying some words of the text descriptions and the corresponding top-2 attention maps in the last stage preserving the semantic similarity. Mirror GAN Source Library and Usage git clone git@github.com:komiya-m/MirrorGAN.git [python 3.6.8, keras 2.2.4, tensor-flow 1.12.0] Dependencies : easydict, pandas, tqdm python main_clevr.py cd MirrorGAN python pretrain_STREAM.py python train.py 6. StoryGAN Story visualization takes as input a and generates at its output , one for each sentence. multi-sentence paragraph sequence of images Story visualization task is a sequential conditional generation problem where it the with the . jointly considers current input sentence contextual information Story GAN gives less focus on the continuity in generated images ( ), but more on the global consistency across and . frames dynamic scenes characters Relies on the component in the , where the Context Encoder the story flow in addition to providing the image generator with both local and . Text2Gist Context Encoder dynamically tracks global conditional information and the structure on the inputs help to enhance the image quality and ensure consistency across the generated images and the story to be visualized. Two-level discriminator recurrent The below figure illustrates a architecture. The variables represented in gray solid circles serves as an input story S and individual sentences with random noise . The generator network is built using specific customized components – There are two discriminators on top that actively serve its primary task to and each is real or fake. StoryGAN s1, . . . , sT 1, . . . , T Story Encoder, Context Encoder and Image Generator . discriminate each image sentence pair image-sequence-story pair The framework of StoryGAN, - LICENSE-MIT Source The architecture is capable of with the of the images/sentences in the story when they are concatenated. The product of image and text features is to have a that serves as an input to a fully connected layer. The is employed with a to predict whether it is a fake or real story pair. Story GAN distinguishing real/fake stories feature vectors embedded compact feature representation fully connected layer sigmoid non-linearity Library and Usage git clone https://github.com/yitong91/StoryGAN.git [Python 2.7, PyTorch, cv2]python main_clevr.py 7. Keras-text-to-image : Sample DCGAN Architecture to generate 64×64 RGB pixel images from the LSUN dataset, , License -MIT In Keras text to image translation is achieved using and as well as . Source GAN Word2Vec recurrent neural networks It uses which has been a breakthrough in GAN research as it introduces major architectural changes to tackle problems like training instability, mode collapse, and internal covariate shift. ( ) DCGan Deep Convolutional Generative Adversarial Network Sample DCGAN Architecture to generate 64x64 RGB pixel images from the LSUN dataset, , License -MIT Source Library and Usage git clone https://github.com/chen0040/keras-text-to-image.git import os import sys import numpy as np from random import shuffle def train_DCGan_text_image(): seed = 42 np.random.seed(seed) current_dir = os.path.dirname(__file__) # add the keras_text_to_image module to the system path sys.path.append(os.path.join(current_dir, '..')) current_dir = current_dir if current_dir is not '' else '.' img_dir_path = current_dir + '/data/pokemon/img' txt_dir_path = current_dir + '/data/pokemon/txt' model_dir_path = current_dir + '/models' img_width = 32 img_height = 32 img_channels = 3 from keras_text_to_image.library.dcgan import DCGan from keras_text_to_image.library.utility.img_cap_loader import load_normalized_img_and_its_text image_label_pairs = load_normalized_img_and_its_text(img_dir_path, txt_dir_path, img_width=img_width, img_height=img_height) shuffle(image_label_pairs) gan = DCGan() gan.img_width = img_width gan.img_height = img_height gan.img_channels = img_channels gan.random_input_dim = 200 gan.glove_source_dir_path = './very_large_data' batch_size = 16 epochs = 1000 gan.fit(model_dir_path=model_dir_path, image_label_pairs=image_label_pairs, snapshot_dir_path=current_dir + '/data/snapshots', snapshot_interval=100, batch_size=batch_size, epochs=epochs) def load_generate_image_DCGaN(): seed = 42 np.random.seed(seed) current_dir = os.path.dirname(__file__) sys.path.append(os.path.join(current_dir, '..')) current_dir = current_dir if current_dir is not '' else '.' img_dir_path = current_dir + '/data/pokemon/img' txt_dir_path = current_dir + '/data/pokemon/txt' model_dir_path = current_dir + '/models' img_width = 32 img_height = 32 from keras_text_to_image.library.dcgan import DCGan from keras_text_to_image.library.utility.image_utils import img_from_normalized_img from keras_text_to_image.library.utility.img_cap_loader import load_normalized_img_and_its_text image_label_pairs = load_normalized_img_and_its_text(img_dir_path, txt_dir_path, img_width=img_width, img_height=img_height) shuffle(image_label_pairs) gan = DCGan() gan.load_model(model_dir_path) for i in range(3): image_label_pair = image_label_pairs[i] normalized_image = image_label_pair[0] text = image_label_pair[1] image = img_from_normalized_img(normalized_image) image.save(current_dir + '/data/outputs/' + DCGan.model_name + '-generated-' + str(i) + '-0.png') for j in range(3): generated_image = gan.generate_image_from_text(text) generated_image.save(current_dir + '/data/outputs/' + DCGan.model_name + '-generated-' + str(i) + '-' + str(j) + '.png') Conclusion Here I have presented some of the popular techniques for generating images from text. You can explore more on some more techniques at . Happy Coding!! https://github.com/topics/text-to-image