Introduction
My name is Ankush Thakur, and as a Computer Science student fascinated by generative models, I’ve recently been diving deep into one of the most influential architectures in modern AI: StyleGAN. While exploring how neural networks synthesize photorealistic images, I kept returning to a system that fundamentally changed how we think about generative models — not just in terms of output quality, but in how the underlying latent space can be controlled, shaped, and disentangled. This article is my attempt to break down that architecture in the clearest way possible.
StyleGAN, introduced by NVIDIA’s research team in the groundbreaking paper “A Style-Based Generator Architecture for Generative Adversarial Networks” (Karras et al., 2019), represents a major shift in GAN design philosophy. Rather than treating the latent vector as a single source of all variation, StyleGAN reorganizes the generator into a series of style-controlled layers, giving us fine-grained control over features like pose, texture, color, and high-frequency details. The architecture separates “what to generate” from “how to render it”, allowing for unprecedented manipulation of generated faces, objects, and artistic styles.
In this article, I will break down its inner mechanics — from the learned constant input, to the Z-to-W mapping network, to the style modulation that gives StyleGAN its name — and explain how these innovations collectively solve the long-standing “entangled black-box” problem of traditional GANs.
Along with the theoretical structure of StyleGAN, this article also explores its practical side by implementing the core ideas using PyTorch. To make the discussion concrete, we will train and experiment with a face dataset that balances quality and accessibility. The most practical choice for this purpose is the CelebA-HQ dataset, which is widely used in generative model research due to its high resolution and clean facial alignment. Its controlled variations in expression, lighting, and pose make it ideal for observing how StyleGAN’s style layers influence image synthesis. By recreating the mapping network, constant input, AdaIN-based modulation, and noise injection in PyTorch, we can directly see how each component contributes to disentangling the latent space and improving image controllability. This hands-on section of the article allows readers not only to understand the architecture conceptually but also to experience how these abstract ideas become real, trainable code.
Overview of ProGAN
Before diving into StyleGAN, it’s essential to understand the foundation it’s built upon: the Progressive Growing of GANs (ProGAN). ProGAN was the breakthrough architecture that first solved the problem of stably generating convincing, high-resolution images (like 1024x1024 faces). Its core idea is brilliantly simple: instead of forcing a single, massive network to learn everything from coarse structure to fine detail all at once — a notoriously unstable process — ProGAN “grows” the network over time.
Press enter or click to view image in full size
As the diagram you’ll see illustrates, training begins at a very low resolution, like 4x4, allowing the network to first master the most basic, large-scale features (e.g., the general pose of a face). Once stable, new convolutional layers are added to both the generator and discriminator to double the resolution (e.g., to 8x8, then 16x16, and so on). A “fade-in” mechanism smoothly transitions the training to the new, higher-resolution layers, allowing the network to focus on learning progressively finer details at each stage. This “curriculum learning” approach provides incredible stability and quality, creating the powerful synthesis backbone that StyleGAN adopts and dramatically enhances.
The Black Box Problem in ProGAN
While ProGAN was a major breakthrough for stabilizing GAN training, it still suffered from what researchers call the black box problem. In ProGAN, the latent vector z is fed directly into the generator and becomes the single source of every visual attribute the model produces. Because this one vector controls pose, identity, lighting, texture, and micro-details all at once, the generator learns to internalize every feature in an extremely entangled way. There is no clear separation between “what” the model should generate and “how” it should render it; everything flows through the same route. As ProGAN grows layer by layer, this entanglement only becomes worse: early layers influence global structure while later layers add fine details, but all of them are driven by the same unstructured input. This means that even a tiny change in z can unpredictably alter multiple aspects of the output — adjusting hair color might unintentionally change the jawline, or modifying pose might affect texture. Because these relationships are hidden inside the network, the generator acts like a closed black box: powerful, but impossible to interpret, control, or manipulate in a meaningful way. This is the core limitation StyleGAN was designed to overcome.
How StyleGAN Fixes the Black Box Problem
StyleGAN solves ProGAN’s black box problem by changing how information flows inside the generator. Instead of giving the model one random vector zand letting it control everything at once, StyleGAN first passes z through a small network that turns it into a cleaner and more organized vector called w. This new vector is easier for the generator to understand, so each part of the image can be controlled separately. The important change is that w is not used only once; it is used at every layer of the generator. Because of this, early layers can shape the big things in the image like face position and head size, middle layers can handle features like eyes or hair, and the final layers can add tiny details like skin texture. StyleGAN also replaces the old “starting point” with a learned constant block, which removes hidden confusion from the very first step.
When you look at the original architecture diagram in the paper, you can clearly see how StyleGAN divides the work into stages, making the whole process easier to understand and more controllable. In simple words, StyleGAN takes one big, messy instruction and breaks it into smaller, clearer instructions — turning the generator from a black box into a system we can actually guide and modify.
The Mapping Network (f)
The Mapping Network (f) is the first major innovation in the StyleGAN generator. It is a separate, deep neural network whose sole purpose is to transform the input latent code z into a new, intermediate latent code w. In the official paper, this network is an 8-layer Multi-Layer Perceptron (MLP).
Here’s the process:
- Input (z): We start with the same 512-dimensional latent vector z that ProGAN would have used, drawn from a simple Gaussian (normal) distribution. This is the “entangled” space.
- Transformation: This z vector is passed not into the image generator, but into the 8-layer mapping network.
- Output (w): The network outputs a new 512-dimensional vector, w. This w vector exists in a new, learned latent space, W.
The entire point of this is disentanglement. The Z space is simple (a Gaussian), but the space of real faces is incredibly complex. The mapping network’s job is to learn the complex “warping” or “un-tangling” required to map from the simple Z space to a new W space that more cleanly and linearly represents the different, high-level features of a face (like pose, identity, or hairstyle). This w vector is a much cleaner set of instructions, and it is this vector, not z, that will be used to control the image synthesis.
AdaIN
Now that we have a clean w vector from the Mapping Network, we need a mechanism to inject this “style” information into the Synthesis Network. ProGAN’s method — feeding the latent code only at the 4x4 input — is what we’re avoiding, as it creates entanglement. StyleGAN’s solution is to apply w at every resolution block (4x4, 8x8, 16x16, etc.) using AdaIN (Adaptive Instance Normalization).
This process happens within each synthesis block and consists of two distinct steps:
- Normalization (Instance Norm): First, the output of the convolutional layer (a 3D feature map) is normalized. Specifically, Instance Normalization is used. This operation computes the mean and variance for each channel and for each sample independently. It then subtracts this mean and divides by the variance, effectively “wiping out” the feature map’s current statistical properties. This step is critical because it separates the content (the spatial structure of the features) from the style(the current mean and variance) of that feature map.
- Modulation (The “Adaptive” Part): Immediately after normalization, the “style” from w is applied. This is done through a “scale and shift” operation. The normalized feature map is multiplied by a new scale (ys) and has a new bias (yb) added to it. These ys and yb values are generated directly from our w vector. For each synthesis block, the w vector is passed through a unique, learned Affine transformation (a simple linear layer) that produces the precise scale and bias values needed for that specific resolution.
In short, AdaIN first resets the style of the content at a given layer and then applies a new style dictated by w.
This design is powerful because it is hierarchical. The w vector is used to generate different scale and bias parameters for each resolution level. This allows w to control the image features at different scales: the w applied at coarse layers (4x4, 8x8) controls high-level, structural styles like pose and face shape, while the w applied at fine layers (512x512, 1024x1024) controls fine-grained styles like skin texture and hair color.
Style Modulation
At the heart of AdaIN is the operation that gives StyleGAN its power: style modulation. This is the “adaptive” part of the process, where the abstract “style” from the w vector is used to directly manipulate the feature maps in the Synthesis Network.
While “AdaIN” describes the entire block (Normalization + Modulation), “Style Modulation” refers specifically to the two-step process of applying the style:
-
Generating the Style Parameters: First, the network needs to translate the master w vector (our 512-dim “theme”) into specific instructions for the current layer. It does this by passing w through a unique, learned Affine transformation (a simple, fully-connected linear layer). The onlyjob of this layer is to produce two new vectors: a scale vector (ys) and a bias vector (yb). These vectors have the same number of elements as the number of channels in the current feature map. This is how the single w vector creates a unique “style” for this specific resolution.
-
Applying the Style (The “Scale and Shift”): This is the core “modulation” step. The network takes the normalized (style-less) feature map from the Instance Norm step and applies these new style parameters. For each channel in the feature map, it performs a simple, powerful operation:
- It multiplies the entire channel by its corresponding scale value (ys).
- It adds its corresponding bias value (yb).
That’s it. This “scale and shift” is the entire mechanism.
By scaling the activations, the w vector gains direct control over the variance(or “contrast”) of each feature. By shiftingthe activations (adding the bias), it gains direct control over the mean (or “brightness”) of each feature.
This is why it’s called modulation. Just like modulating a radio signal, we are taking our “content” signal (the normalized feature map) and “modulating” it with a new “style” signal (the scale and bias from w) to create the final output. This elegant operation is what allows w to define the texture, color, and overall style of the image, channel by channel, and layer by layer.
Noise Injection
While the w vector gives us incredible control over the style of an image, there’s a second class of features StyleGAN needs to model: stochastic details.
These are the elements in an image that are inherently random and don’t depend on the person’s identity or the overall style. Think about the precise placement of individual freckles, the exact curl of a single strand of hair, or the fine-grained texture of skin pores. These are details that should not be controlled by w. If w had to manage every single hair, it would waste its capacity and couldn’t focus on high-level features like hairstyle.
To solve this, StyleGAN introduces a new, explicit mechanism: direct noise injection.
Here’s how it works:
- Create Noise: At each resolution level (4x4, 8x8, etc.), the network creates a new, single-channel tensor of random noise, which is then broadcast to match the number of channels in the feature map.
- Scale the Noise: This noise is multiplied by a learned, per-channel scaling factor. This allows the network to learn how much randomness to apply at each layer (e.g., more noise for hair texture, less for the chin).
- Add to Features: This scaled noise is then simply added to the feature map, right after the AdaIN style modulation.
This mechanism is brilliant because it gives the network a separate, dedicated “side-channel” just for handling randomness. The network learns to rely on the w vector (via AdaIN) for the “global” stylistic choices (like hair color, skin tone, pose) and to rely on this injected noise for the “local,” fine-grained, random details (like the exact placement of that hair).
This further improves disentanglement. By “outsourcing” the job of random variation to the noise, it frees up the w vector to be a pure, clean controller of style, which is exactly what we wanted to achieve.
Learned Constant Input
In previous architectures like ProGAN, the synthesis network’s input was a 4x4x512 feature map derived directly from the z latent vector. This meant the initial input carried entangled information. StyleGAN replaces this entirely. The synthesis network instead begins with a single, 4x4x512 tensor of learnable parameters. This tensor is constant in the sense that it is identical for every image generated; it is not derived from z or w. It is simply a set of free parameters, optimized via backpropagation like the network’s convolutional weights.
The function of this constant input is to serve as a fixed starting point. The generator learns this optimal “base” representation. From this point forward, the network relies exclusively on the AdaIN style modulation (from w) and the injected noise at each subsequent layer to introduce all variation, style, and content. This design choice enforces a clear separation: the w vector controls all global stylistic and structural features via AdaIN, while the constant input simply provides a static foundation for these modulations to act upon.
Style Mixing Regularization
Style Mixing is a powerful regularization technique used during the training of StyleGAN to further enforce the disentanglement of the W space. The primary goal is to prevent the network from learning to correlate styles at adjacent resolution levels (e.g., assuming a specific pose always co-occurs with a specific hairstyle).
The mechanism is implemented as follows:
- Generate Two W Vectors: During a training step, two separate latent codes, z1 and z2, are sampled. Both are passed through the Mapping Network (f) to produce two intermediate latent vectors, w1 and w2.
- Select Crossover Point: A random “crossover” point is chosen among the synthesis network’s style layers. (Note: StyleGAN has 18 style layers in total — two for each resolution block from 4x4 to 1024x1024).
- Apply Mixed Styles: The generator then creates an image using both w vectors.
- Before Crossover: All AdaIN layers before the crossover point (e.g., the coarse layers from 4x4 to 16x16) are modulated using w1.
- After Crossover: All AdaIN layers after the crossover point (e.g., the fine layers from 32x32 to 1024x1024) are modulated using w2.
By feeding the discriminator images created with this “mixed style,” the network is forced to learn that each style layer is independent and cannot rely on learned correlations from adjacent layers.
This regularization technique has a significant benefit at inference time: it allows for the deliberate mixing of styles from two different source images. We can take the w1 from “Image A” and apply it to the coarse layers to borrow its pose and face shape, while taking the w2 from “Image B” and applying it to the fine layers to borrow its hair color and skin texture. This creates a new, plausible hybrid image and serves as a powerful visual proof of the generator’s disentanglement.
Summary of the Architecture
So far, we have broken down the core architecture of the StyleGAN generator, which is designed to solve the “black box” entanglement problem inherent in its predecessor, ProGAN. We’ve seen that instead of a single latent input, StyleGAN’s design is based on a new, disentangled generator. It starts with a Learned Constant Input, a “blank canvas” that provides a stable foundation. The “style” of the image is then dictated by an intermediate vector, w, which is produced by the Mapping Network (f) — a “translator” that un-tangles the initial z vector. This w vector is then injected at every resolution level of the generator using AdaIN (Adaptive Instance Normalization). This mechanism’s core operation is Style Modulation, a “scale and shift” that gives w precise, hierarchical control over the image’s features. To handle fine, random details like freckles and hair, the network uses a separate Noise Injection pathway. Finally, these components are trained using Style Mixing Regularization, a technique that forces the network to keep the coarse, medium, and fine styles independent, ensuring true disentanglement
Implementing StyleGAN in PyTorch
Now it’s time to go deeper and translate these powerful concepts into code. We’ve explored the “what” and “why” of the StyleGAN architecture; this section will focus on the “how.” We will build the key components one by one, starting, as all projects do, with importing the essential libraries. To build a model as complex as StyleGAN, we’ll need a robust deep learning framework like PyTorch, along with several utilities for data handling and visualization.
This first cell sets up our complete environment for the project. We are importing torch and its core neural network modules, torch.nn, torch.optim, and torch.nn.functional, which provide the foundational tools for building our layers, defining our models, and setting up the Adam optimizer. For handling our CelebA-HQ dataset, we import Dataset and DataLoader to create an efficient data pipeline, along with torchvision.transforms and PIL.Image to load and preprocess our images into the correct tensor format. To visualize and save our results, we'll use torchvision.utils for creating image grids and matplotlib.pyplot for plotting training progress. Finally, we import standard utilities like os for file path management, numpyfor numerical operations, and datetime to timestamp our saved models. Crucially, we set the random seeds for both PyTorch and NumPy to ensure our experiments are reproducible.
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from torchvision.utils import make_grid, save_image
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
torch.manual_seed(42)
np.random.seed(42)
Next, we define our CelebADataset class, which is a standard PyTorch Dataset tailored to load our CelebA-HQ images. This class is essential for connecting our model to the data on disk. In the __init__ method, it scans the specified root_dir to find all image files and stores their names. It also defines our critical image transformation pipeline. If no custom transform is provided, it creates a default one: it first resizes every image to the desired img_size, then converts it to a PyTorch tensor, and finally normalizes the pixel values to the [-1, 1] range, which is the standard practice for training GANs as it centers our data around 0. The __getitem__ method is responsible for loading a single image: it constructs the full file path, opens the image using PIL, ensures it's in 'RGB' format, and applies our transformation pipeline. It returns the processed image tensor and a dummy label of 0, as our GAN is trained in an unsupervised manner and the DataLoader expects a tuple.
class CelebADataset(Dataset):
def __init__(self, root_dir='/kaggle/input/celeba-hq', img_size=64, transform=None):
self.root = root_dir
self.files = [f for f in os.listdir(root_dir) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
self.files.sort()
if len(self.files) == 0:
raise ValueError(f"No images found in {root_dir}. Upload CelebA-HQ to /kaggle/input/celeba-hq/")
print(f"Loaded {len(self.files)} images from {root_dir}")
self.transform = transform or transforms.Compose([
transforms.Resize((img_size, img_size)),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
def __len__(self):
return len(self.files)
def __getitem__(self, idx):
img_path = os.path.join(self.root, self.files[idx])
image = Image.open(img_path).convert('RGB')
if self.transform:
image = self.transform(image)
return image, 0
This cell defines a simple helper function, get_dataset. It acts as a convenient "factory" for our CelebADataset class. It takes the dataset root_dir and the img_size as arguments and simply returns a fully initialized CelebADataset object with the correct transformations, ready to be used by our DataLoader. This small abstraction simplifies our main training script by providing a single, clear function call to instantiate the dataset.
def get_dataset(root_dir='/kaggle/input/celeba-hq', img_size=64):
return CelebADataset(root_dir, img_size=img_size)
Now we define the MappingNetwork, which is the first and one of the most critical components of the StyleGAN architecture. This class implements the 8-layer MLP we discussed, acting as the "translator" that maps the initial, entangled latent code z (from a simple Gaussian distribution) into the disentangled intermediate latent space W. In the __init__ method, we programmatically create a list of layers. A loop correctly sets the input dimension of the first layer to z_dim while all subsequent num_layers - 1layers take w_dim as input. Each nn.Linear layer is followed by a nn.ReLUactivation function to introduce non-linearity. Finally, nn.Sequential(*layers[:-1]) is used to build the network; the [:-1] is a deliberate choice to remove the final activation function, ensuring that the output w vector is not constrained and can represent any real-valued latent space, which is exactly what we want for our "style" code.
class MappingNetwork(nn.Module):
def __init__(self, z_dim=512, w_dim=512, num_layers=8):
super().__init__()
layers = []
for i in range(num_layers):
in_features = z_dim if i == 0 else w_dim
layers += [nn.Linear(in_features, w_dim), nn.ReLU()]
self.mapping = nn.Sequential(*layers[:-1])
def forward(self, z):
return self.mapping(z)
This module, ConstantInput, implements the "Learned Constant Input" we discussed, which serves as the "blank canvas" for our generator. Instead of taking the latent code as input, the generator will start with this. In the __init__ method, we create a single tensor, self.input, with the shape [1, channels, size, size] (e.g., [1, 512, 4, 4]). By wrapping it in nn.Parameter, we register this tensor as a learnable parameter of the model, meaning it will be optimized and updated during training. The forwardmethod's sole purpose is to take the desired batch_size and use the .repeat() function to "copy" this single, learned tensor along the batch dimension. This ensures that every single image generated in a batch begins from the exact samelearned starting point, ready to be "sculpted" by the style and noise inputs.
class ConstantInput(nn.Module):
def __init__(self, channels, size=4):
super().__init__()
self.input = nn.Parameter(torch.randn(1, channels, size, size))
def forward(self, batch_size):
return self.input.repeat(batch_size, 1, 1, 1)
This cell defines the NoiseInjection module, which is StyleGAN's dedicated mechanism for adding stochastic (random) details like freckles or hair strands. In its __init__ method, it creates a single learnable nn.Parameter, self.weight, initialized to zeros with a shape of [1, channels, 1, 1]. This parameter allows the network to learn a per-channel scaling factor for the noise, controlling its intensity. The forward method contains the main logic: if no external noise is provided, it generates a new, single-channel random noise tensor ([batch, 1, h, w]) that matches the input image's spatial dimensions. The core operation is image + self.weight * noise. Here, the per-channel self.weight is broadcast-multiplied with the single-channel noise, scaling the same random pattern differently for each channel. This scaled noise is then added directly to the feature map, allowing the network to learn to add just the right amount of randomness at each resolution level.
class NoiseInjection(nn.Module):
def __init__(self, channels):
super().__init__()
self.weight = nn.Parameter(torch.zeros(1, channels, 1, 1))
def forward(self, image, noise=None):
if noise is None:
batch, _, h, w = image.shape
noise = torch.randn(batch, 1, h, w, device=image.device)
return image + self.weight * noise
This cell defines the StyleMod class, which implements the core style modulation mechanism—this is the "adaptive" part of the AdaIN operation. This module's job is to take the disentangled w vector and use it to control the "style" of a feature map x. The __init__ method sets up the "Affine transformation" (a simple nn.Linear layer) that learns to map the w_dim style vector to a new vector of size num_features * 2. This output will be split to create the "scale" (gamma) and "bias" (beta) parameters for each channel. The forward method executes this: it generates the style vector from w, then uses a dynamic view and chunk operation to reshape and split it into gammaand beta tensors. These tensors are shaped [B, C, 1, 1], allowing them to be broadcast across the spatial dimensions of the feature map x. The final line, return x * (1 + gamma) + beta, applies the modulation. Using (1 + gamma) instead of just gamma is a common and important detail, as it centers the initial scaling factor around 1, which helps stabilize training by starting with an identity-like mapping.
class StyleMod(nn.Module):
def __init__(self, w_dim, num_features):
super().__init__()
self.linear = nn.Linear(w_dim, num_features * 2)
def forward(self, x, w):
style = self.linear(w)
shape = [-1, 2, x.size(1)] + [1] * (x.dim() - 2)
style = style.view(*shape)
gamma, beta = style.chunk(2, 1)
return x * (1 + gamma) + beta
Here, we define the StyleBlock, which is the core repeating unit of our generator's Synthesis Network. This class encapsulates all the key StyleGAN operations for a single resolution level. In the __init__ method, it sets up a nn.Conv2dlayer to process spatial features, our NoiseInjection module for adding stochastic details, a LeakyReLU activation, our StyleModmodule for applying the w vector, and an nn.Upsample layer to double the spatial resolution. The forward method defines the critical order of these operations: the input x is first passed through the conv layer, then the noiseis added. After the activatefunction, the style modulation is applied, "styling" the feature map according to w. Finally, the block is upsampled, making it ready to be fed into the next StyleBlock at a higher resolution.
class StyleBlock(nn.Module):
def __init__(self, in_channels, out_channels, w_dim):
super().__init__()
self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
self.noise = NoiseInjection(out_channels)
self.activate = nn.LeakyReLU(0.2, inplace=True)
self.style = StyleMod(w_dim, out_channels)
self.upsample = nn.Upsample(scale_factor=2, mode='nearest')
def forward(self, x, w, noise_modes=None):
x = self.conv(x)
x = self.noise(x, noise_modes) if noise_modes is not None else self.noise(x)
x = self.activate(x)
x = self.style(x, w)
x = self.upsample(x)
return x
This SynthesisNetwork class assembles all our generator components into the final "image factory." In the __init__ method, it first calculates the number of channels for each resolution stage, starting with a high start_channels(e.g., 512) and progressively halving it at each level. It then initializes the ConstantInput to serve as the 4x4 "blank canvas." The core of the network is self.blocks, an nn.ModuleList that iteratively appends our StyleBlockmodules, with each new block handling a higher resolution. Finally, a to_rgb 1x1 convolutional layer is defined to map the final feature map to the 3 output image channels. The forward method clearly illustrates the entire process: it starts by getting the constant_input, then passes it and the w vector through each StyleBlock in sequence. This progressively builds the image from 4x4 up to the target size. At the very end, it applies the to_rgblayer and a torch.tanh activation function to scale the output pixels to the [-1, 1] range, matching our dataset's normalization.
class SynthesisNetwork(nn.Module):
def __init__(self, w_dim=512, img_channels=3, num_layers=4, start_channels=512):
super().__init__()
self.num_layers = num_layers
self.w_dim = w_dim
self.channels = {2**i: max(start_channels // 2**i, 8) for i in range(num_layers + 1)}
self.constant_input = ConstantInput(self.channels[1])
self.blocks = nn.ModuleList()
for layer in range(num_layers):
in_ch = self.channels[2**layer]
out_ch = self.channels[2**(layer + 1)]
self.blocks.append(StyleBlock(in_ch, out_ch, w_dim))
self.to_rgb = nn.Conv2d(self.channels[2**num_layers], img_channels, 1)
def forward(self, w):
batch = w.shape[0]
x = self.constant_input(batch)
for i in range(self.num_layers):
x = self.blocks[i](x, w)
x = self.to_rgb(x)
x = torch.tanh(x)
return x
This Generator class ties our two main components together into the final, complete model. Its __init__ method is very simple: it just instantiates our MappingNetwork and SynthesisNetwork and holds them as sub-modules. The forward method then defines the generator's end-to-end data flow. It takes the initial latent vector z as input, passes it through the mappingnetwork to get the disentangled intermediate vector w, and then passes w through the synthesis network to produce the final image. This clean wrapper encapsulates the entire generator architecture, from the initial random noise to the final generated image.
class Generator(nn.Module):
def __init__(self, z_dim=512, w_dim=512, img_channels=3, num_layers=4):
super().__init__()
self.mapping = MappingNetwork(z_dim, w_dim)
self.synthesis = SynthesisNetwork(w_dim, img_channels, num_layers)
def forward(self, z):
w = self.mapping(z)
img = self.synthesis(w)
return img
This cell defines the Discriminator, our "adversary" model. It's built as a classic downsampling convolutional network, similar in structure to a PatchGAN. The __init__ method constructs the model by looping num_layers times, corresponding to the generator's upsampling stages. In each loop, it adds a nn.Conv2d with a 4x4 kernel and stride 2, effectively halving the spatial dimensions while doubling the channel count (up to a maximum of 512). The code includes two specific design choices: first, the padding is set to 0 for the very first layer and 1 for all subsequent layers. Second, nn.BatchNorm2d is applied to all layers except the first, where nn.Identity is used as a placeholder. After the main downsampling loop, a final nn.Conv2dwith a 4x4 kernel maps the resulting feature map to a single output channel. The forward method simply passes the image through this Sequential model and flattens the output to produce the "realness" score for each image in the batch.
class Discriminator(nn.Module):
def __init__(self, img_channels=3, num_layers=4):
super().__init__()
layers = []
ch = img_channels
for i in range(num_layers):
next_ch = min(512, ch * 2)
layers += [
nn.Conv2d(ch, next_ch, 4, 2, 1 if i > 0 else 0),
nn.LeakyReLU(0.2, inplace=True),
nn.BatchNorm2d(next_ch) if i > 0 else nn.Identity()
]
ch = next_ch
layers += [nn.Conv2d(ch, 1, 4, 1, 0)]
self.model = nn.Sequential(*layers)
def forward(self, img):
return self.model(img).view(img.size(0), -1)
This function brings all our components together to prepare for the training loop. First, it determines the correct device(preferring "cuda" if a GPU is available). It then initializes our CelebADataset and wraps it in a DataLoaderto handle batching and shuffling. The most important step is the dynamic initialization of our models: it calculates the num_layers for both the Generator and Discriminator based on the target img_size using a logarithmic formula. This allows our code to be flexible and train on different resolutions (e.g., 32x32, 64x64, 128x128) without changing the model definitions. Finally, it sets up the Adam optimizers for both models, using the betas=(0.0, 0.99) recommended in the StyleGAN paper, and initializes our BCEWithLogitsLoss as the criterion. The function then returns all these objects, neatly packaged and ready to be used in our training loop.
def setup_training(img_size=64, batch_size=8, num_workers=2):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
dataset = get_dataset(img_size=img_size)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers)
z_dim = 512
G = Generator(z_dim=z_dim, num_layers=int(np.log2(img_size // 4)) + 1).to(device)
D = Discriminator(num_layers=int(np.log2(img_size)) - 1).to(device)
optG = optim.Adam(G.parameters(), lr=0.0002, betas=(0.0, 0.99))
optD = optim.Adam(D.parameters(), lr=0.0002, betas=(0.0, 0.99))
criterion = nn.BCEWithLogitsLoss()
print(f"Setup complete: {len(dataset)} samples, batch_size={batch_size}, img_size={img_size}x{img_size}")
return device, dataloader, G, D, optG, optD, criterion, z_dim
Here is the explanatory paragraph for your train_epoch function.
This function encapsulates the core training loop for a single epoch, implementing the adversarial “minimax” game. It iterates through the dataloader, and for each batch, it performs the two essential updates. First, the Discriminator is trained: its gradients are zeroed (optD.zero_grad()), and it computes its loss on the batch of real_imgs, comparing its predictions to "real" labels (all ones). A batch of fake_imgs is then generated, and .detach()is called. This is a critical step, as it prevents gradients from flowing back into the Generator during the discriminator's update. These "detached" fakes are passed to the Discriminator, and the resulting loss is calculated against "fake" labels (all zeros). The real and fake losses are summed, backpropagation is performed, and the discriminator's weights are updated (optD.step()).
Second, the Generator is trained: its gradients are zeroed (optG.zero_grad()), and the exact same (and still detached) fake_imgs from the previous step are passed through the Discriminator again. This time, the Generator's loss is calculated by checking how well its fakes fooled the Discriminator (i.e., how close the predictions were to the real_label of all ones). This g_loss is then backpropagated, and the Generator's weights are updated (optG.step()). The function also handles logging progress, saving image samples, and returning the average losses for the epoch.
def train_epoch(dataloader, G, D, optG, optD, criterion, device, z_dim, epoch, save_dir='/kaggle/working/', visualize_every=1):
G.train()
D.train()
d_losses, g_losses = [], []
for i, (real_imgs, _) in enumerate(dataloader):
batch = real_imgs.size(0)
real_imgs = real_imgs.to(device)
real_label = torch.ones(batch, 1, device=device)
fake_label = torch.zeros(batch, 1, device=device)
optD.zero_grad()
real_pred = D(real_imgs)
d_real_loss = criterion(real_pred, real_label)
z = torch.randn(batch, z_dim, device=device)
fake_imgs = G(z).detach()
fake_pred = D(fake_imgs)
d_fake_loss = criterion(fake_pred, fake_label)
d_loss = d_real_loss + d_fake_loss
d_loss.backward()
optD.step()
d_losses.append(d_loss.item())
optG.zero_grad()
fake_pred = D(fake_imgs)
g_loss = criterion(fake_pred, real_label)
g_loss.backward()
optG.step()
g_losses.append(g_loss.item())
if i % 10 == 0:
print(f'Epoch {epoch+1} | Batch {i}/{len(dataloader)} | D: {d_loss.item():.4f} | G: {g_loss.item():.4f}')
avg_d_loss = np.mean(d_losses)
avg_g_loss = np.mean(g_losses)
print(f'Epoch {epoch+1} Avg | D: {avg_d_loss:.4f} | G: {avg_g_loss:.4f}')
if (epoch + 1) % visualize_every == 0:
visualize_training(G, device, z_dim, epoch, save_dir)
return avg_d_loss, avg_g_loss
This train_stylegan function is the main driver that orchestrates the entire training process. First, it ensures the save_dir exists and then calls our setup_training function to initialize the device, dataloader, models, optimizers, and loss function. It then enters the main training loop, which runs for the specified num_epochs. Inside this loop, it calls train_epoch to perform one full pass over the dataset, capturing and storing the average generator and discriminator losses for that epoch. At the end of each epoch, it saves the current state of both the Generator and Discriminator to .pthfiles. After the loop completes, it uses matplotlib to generate and save a plot of the 'D' and 'G' losses over time, allowing us to visualize the training dynamics, before returning the fully trained models and the complete loss history.
def train_stylegan(num_epochs=10, img_size=64, batch_size=8, save_dir='/kaggle/working/', visualize_every=1):
os.makedirs(save_dir, exist_ok=True)
device, dataloader, G, D, optG, optD, criterion, z_dim = setup_training(img_size, batch_size)
losses = {'D': [], 'G': []}
for epoch in range(num_epochs):
d_loss, g_loss = train_epoch(dataloader, G, D, optG, optD, criterion, device, z_dim, epoch, save_dir, visualize_every)
losses['D'].append(d_loss)
losses['G'].append(g_loss)
torch.save(G.state_dict(), os.path.join(save_dir, f'G_epoch_{epoch+1}.pth'))
torch.save(D.state_dict(), os.path.join(save_dir, f'D_epoch_{epoch+1}.pth'))
plt.figure(figsize=(10, 5))
plt.plot(losses['D'], label='Discriminator')
plt.plot(losses['G'], label='Generator')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Training Losses')
plt.savefig(os.path.join(save_dir, 'training_losses.png'))
plt.show()
print(f'Training complete! Check {save_dir} for models, images, and plots.')
return G, D, losses
This function is our primary tool for visualizing the Generator's capabilities at inference time. It first sets the generator to evaluation mode (G.eval()) and disables gradient calculations with torch.no_grad(), both of which are crucial for saving memory and ensuring consistent output. It then generates a batch of random z vectors and passes them through the Generator to create samples. These samples, which are on the device, are moved to the .cpu(). A critical step follows: the images are denormalized using (samples + 1) / 2, converting them from the network's [-1, 1] range to the [0, 1] range required for saving and viewing. The make_grid utility is then used to arrange the batch of images into a single grid. Finally, the function provides options to either save_path the grid to a file and/or show_plot it directly in the notebook using matplotlib, offering a flexible way to inspect the model's output.
def generate_samples(G, device, z_dim=512, num_samples=16, save_path=None, show_plot=True):
G.eval()
with torch.no_grad():
z = torch.randn(num_samples, z_dim, device=device)
samples = G(z).cpu()
samples = (samples + 1) / 2
grid = make_grid(samples, nrow=int(np.sqrt(num_samples)), normalize=True)
if save_path:
save_image(grid, save_path)
print(f'Saved samples to {save_path}')
if show_plot:
plt.figure(figsize=(8, 8))
plt.imshow(grid.permute(1, 2, 0).numpy())
plt.axis('off')
plt.title('Generated Samples')
plt.show()
return samples
This is a simple helper function designed to be called from within our train_epoch loop. Its sole purpose is to orchestrate the saving of sample images at the end of an epoch. It constructs a unique file name for the output image based on the current epoch number and the save_dir. It then calls our main generate_samples function, passing in the generator and the newly created sample_path. Critically, it sets show_plot=False to prevent matplotlib plots from being generated every single epoch, which would clutter the training log. Instead, it just saves the image grid quietly in the background and prints a confirmation of where the file was saved.
def visualize_training(G, device, z_dim, epoch, save_dir):
sample_path = os.path.join(save_dir, f'samples_epoch_{epoch+1}.png')
generate_samples(G, device, z_dim, num_samples=16, save_path=sample_path, show_plot=False)
print(f'Visualized samples for epoch {epoch+1} at {sample_path}')
Thank you for following along on this deep dive into the StyleGAN architecture. We’ve journeyed from the foundational concepts of ProGANand its “black box” limitations, through the brilliant series of innovations that define StyleGAN — from the Z-to-W Mapping Network and Learned Constant Input to the powerful AdaIN and Noise Injectionmechanisms. We saw how these components come together to create a generator that doesn’t just produce images, but offers unprecedented, disentangled control over the style of those images.
By translating that theory into a hands-on PyTorch implementation, we’ve demystified the architecture and built a tangible, working model from scratch. The code we’ve written is a starting point. I encourage you to take it and experiment: train it for more epochs, adapt it to higher resolutions, or implement the Style Mixing technique we discussed to create your own unique hybrid images.
I hope this article has not only helped you understand how StyleGAN works but also inspired you to appreciate the elegance of its design. Happy generating!
