paint-brush
AI System Uses Teamwork to Create Picture-Perfect Lightingby@autoencoder

AI System Uses Teamwork to Create Picture-Perfect Lighting

tldt arrow

Too Long; Didn't Read

Researchers at Beeble AI have developed a method for improving how light and shadows can be applied to human portraits in digital images.
featured image - AI System Uses Teamwork to Create Picture-Perfect Lighting
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Hoon Kim, Beeble AI, and contributed equally to this work;

(2) Minje Jang, Beeble AI, and contributed equally to this work;

(3) Wonjun Yoon, Beeble AI, and contributed equally to this work;

(4) Jisoo Lee, Beeble AI, and contributed equally to this work;

(5) Donghyun Na, Beeble AI, and contributed equally to this work;

(6) Sanghyun Woo, New York University, and contributed equally to this work.

Editor's Note: This is Part 11 of 14 of a study introducing a method for improving how light and shadows can be applied to human portraits in digital images. Read the rest below.


Appendix

A. Implementation Details


We pre-train a single U-Net architecture during this process. In the subsequent fine-tuning stage, the weights from this pre-trained model are transferred to multiple U-Nets - NormalNet, DiffuseNet, SpecularNet, and RenderNet. In contrast, IllumNet, which does not follow the U-Net architecture, is initialized with random weights. To ensure compatibility with the varying input channels of each network, we modify the weights as necessary. For example, weights pre-trained for RGB channels are copied and adapted to fit networks with 6 or 9 channels.


Data To generate the relighting training pairs, we randomly select each image from the OLAT dataset. Two randomly chosen HDRI lighting environment maps are then projected onto these images to form a training pair. The images undergo processing in linear space. For managing the dynamic range effectively, we apply logarithmic normalization using the log(1 + x) function.


Architecture SwitchLight employs a UNet-based architecture, consistently applied across its Normal Net, Diffuse Net, Specular Net, and Render Net. This approach is inspired by recent advancements in diffusion-based models [12]. Unlike standard diffusion methods, we omit the temporal embedding layer. The architecture is characterized by several hyperparameters: the number of input channels, a base channel, and channel multipliers that determine the channel count at each stage. Each downsampling stage features two residual blocks, with attention mechanisms integrated at certain resolutions. The key hyperparameters and their corresponding values are summarized in Table. 4.


IllumNet is composed of two projection layers, one for transforming the Phong lobe features and another for image features, with the latter using normal bottleneck features as a compact form of image representation. Following this, a cross-attention layer is employed, wherein the Phong lobe serves as the query and the image features function as both key and value. Finally, an output layer generates the final convolved source HDRI.


Table 4. Network Architecture Parameters. This table outlines the key hyperparameters and their corresponding values; initial input channels (In ch), base channels (Base ch), and channel multipliers (Ch mults) that set the stage-specific channel counts. It also indicates the number of residual blocks per stage (Num res), the number of channels per head (Head ch), the stages where attention mechanisms are applied based on feature resolution (Att res), and the final output channels (Out ch).


The Discriminator network is utilized during both pretraining and fine-tuning stages, maintaining the same architectural design, although the weights are not shared between these stages. This network is composed of a series of residual blocks, each containing two 3×3 convolution layers, interspersed with Leaky ReLU activations. The number of filters progressively increases across these layers: 64, 128, 256, and 512. Correspondingly, as the channel filter count increases, the resolution of the features decreases, and finally, the network compresses its output with a 3x3 convolution into a single channel, yielding a probability value.


Regarding the activation functions across different networks: NormalNet processes its outputs through ℓ2 normalization, ensuring they are unit normal vectors. IllumNet, DiffuseNet, and RenderNet utilize a softplus activation (with β = 20) to generate non-negative pixel values. SpecularNet employs a sigmoid activation fuction, ensuring that both the roughness parameter and Fresnel reflectance values fall within a range of 0 to 1.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.