paint-brush
AI Learns to Perfect Lighting in Photos Using Smart Masks and Creative Trainingby@autoencoder

AI Learns to Perfect Lighting in Photos Using Smart Masks and Creative Training

tldt arrow

Too Long; Didn't Read

Researchers at Beeble AI have developed a method for improving how light and shadows can be applied to human portraits in digital images.
featured image - AI Learns to Perfect Lighting in Photos Using Smart Masks and Creative Training
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Hoon Kim, Beeble AI, and contributed equally to this work;

(2) Minje Jang, Beeble AI, and contributed equally to this work;

(3) Wonjun Yoon, Beeble AI, and contributed equally to this work;

(4) Jisoo Lee, Beeble AI, and contributed equally to this work;

(5) Donghyun Na, Beeble AI, and contributed equally to this work;

(6) Sanghyun Woo, New York University, and contributed equally to this work.

Editor's Note: This is Part 7 of 14 of a study introducing a method for improving how light and shadows can be applied to human portraits in digital images. Read the rest below.


Appendix

4. Multi-Masked Autoencoder Pre-training

We introduce the Multi-Masked Autoencoder (MMAE), a self-supervised pre-training framework designed to enhance feature representations in relighting models. It aims to improve output quality without relying on additional, costly light stage data. Building upon the MAE framework [19], MMAE capitalizes on the inherent learning of crucial image features like structure, color, and texture, which are essential for relighting tasks. However, adapting MAE to our specific needs poses several non-trivial challenges. Firstly, MAE is primarily designed for vision transformers [15], while our focus is on a UNet, a convolution-based architecture. This convolutional structure, with its hierarchical nature and aggressive pooling, is known to simplify the MAE task, necessitating careful adaptation [50]. Further, the hyperparameters of MAE, particularly the fixed mask size and ratio, are also specific to vision transformers. These factors could introduce biases during training and hinder the model to recognize image features at various scales. Moreover, MAE relies solely on masked region reconstruction loss, limiting the model to understand the global coherence of the reconstructed region in relation to its visible context.


To address these challenges effectively, we have developed two key strategies within the MMAE framework:


Dynamic Masking. MMAE eliminates two key hyperparameters, mask size and ratio, by introducing a variety of mask types to generalize the MAE. These types, which include overlapping patches of various sizes, outpainting masks [46], and free-form masks [29] (see Fig.5), each contribute to the model’s versatility. With the ability to handle challenging masked regions, MMAE achieves a more comprehensive understanding of image properties.


Generative Target. In addition to its structural advancements, MMAE incorporates a new loss function strategy. We have adopted perceptual [24] and adversarial losses [22], along with the original reconstruction loss. As a result, MMAE is equipped not only to reconstruct missing image parts but also to ensure synthesis capabilities and their seamless integration with the original context. In practice, the weights for the three losses are equally set.


We pre-train the entire UNet architecture using MMAE, and, unlike MAE, we retain the decoder and fine-tune the entire model on relighting ground truths.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.