Authors:
(1) Hyosun park, Department of Astronomy, Yonsei University, Seoul, Republic of Korea;
(2) Yongsik Jo, Artificial Intelligence Graduate School, UNIST, Ulsan, Republic of Korea;
(3) Seokun Kang, Artificial Intelligence Graduate School, UNIST, Ulsan, Republic of Korea;
(4) Taehwan Kim, Artificial Intelligence Graduate School, UNIST, Ulsan, Republic of Korea;
(5) M. James Jee, Department of Astronomy, Yonsei University, Seoul, Republic of Korea and Department of Physics and Astronomy, University of California, Davis, CA, USA.
Table of Links
2 Method
2.1. Overview and 2.2. Encoder-Decoder Architecture
2.3. Transformers for Image Restoration
4 JWST Test Dataset Results and 4.1. PSNR and SSIM
4.3. Restoration of Morphological Parameters
4.4. Restoration of Photometric Parameters
5.2. Restoration of Multi-epoch HST Images and Comparison with Multi-epoch JWST Images
6 Limitations
6.1. Degradation in Restoration Quality Due to High Noise Level
6.2. Point Source Recovery Test
6.3. Artifacts Due to Pixel Correlation
7 Conclusions and Acknowledgements
Appendix: A. Image restoration test with Blank Noise-Only Images
2.3. Transformers for Image Restoration
In Transformer, the encoder consists of multiple layers of self-attention mechanisms followed by position-wise feed-forward neural networks. “Attention” refers to a mechanism that allows models to focus on specific parts of input data while processing it. It enables the model to selectively weigh different parts of the input, giving more importance to relevant information and ignoring irrelevant or less important parts. The key idea behind attention is to dynamically compute weights for different parts of the input data, such as words in a sentence or pixels in an image, based on their relevance to the current task. In self-attention, each element (e.g., word or pixel) in the input sequence is compared to every other element to compute attention weights, which represent the importance of each element with respect to others. These attention weights are then used to compute a weighted sum of the input elements, resulting in an attention-based representation that highlights relevant information.
The Transformer decoder also consists of multiple layers of self-attention mechanisms, along with additional attention mechanisms over the encoder’s output. The decoder predicts one element of the output sequence at a time, conditioned on the previously generated elements and the encoded representation of the input sequence.
The Transformer architecture was initially proposed and applied to the task of machine translation, which involves translating text from one language to another. The success of the Transformer in machine translation tasks demonstrated its effectiveness in capturing longrange dependencies in sequences and handling sequential data more efficiently than traditional architectures. This breakthrough sparked widespread interest in the Transformer architecture, leading to its adoption and adaptation for various image processing tasks. Transformers show promising results in tasks such as image classification, object detection, semantic segmentation, and image generation, traditionally dominated by CNNs. Transformer models capture long-range pixel correlations more effectively than CNN-based models.
However, using the Transformer model on large images becomes challenging with its original implementation, which applies self-attention layers on pixels. This is because the computational complexity escalates quadratically with the pixel count. Zamir et al. (2022) overcame this obstacle by substituting the original selfattention block with the MDTA block, which implements self-attention in the feature domain and makes the complexity increase only linearly with the number of pixels. We propose to use Zamir et al. (2022)’s efficient Transformer Restormer to apply deconvolution and denoising to astronomical images. We briefly describe the two core components of Restormer in §2.3.1 and §2.3.2. Readers are referred to Zamir et al. (2022) for more technical details.
2.3.1. MDTA block
MDTA stands as a crucial module within Restormer. By performing self-attention in the channel dimension, MDTA calculates interactions between channels in the input feature map, creating query-key interactions. Through this process, MDTA effectively models interactions between channels in the input feature map, facilitating the learning of the global context necessary for image restoration tasks.
MDTA also employs depth-wise convolution to accentuate local context. This enables MDTA to emphasize the local context of the input image, ultimately allowing for the modeling of both global and local contexts.
2.3.2. GDFN block
GDFN, short for Gated-Dconv Feed-Forward Network, stands as another crucial module within Restormer. Utilizing a gating mechanism to enhance the Feed-Forward Network, GDFN offers improved information flow, resulting in high-quality outcomes for image restoration tasks
GDFN controls the information flow through gating layers, composed of element-wise multiplication of two linear projection layers, one of which is activated by the Gaussian Error Linear Unit (GELU) non-linearity. This allows GDFN to suppress less informative features and hierarchically transmit only valuable information. Similar to the MDTA module, GDFN employs local content mixing. Through this, GDFN emphasizes the local context of the input image, providing a more robust information flow for enhanced results in image restoration tasks.
This paper is available on arxiv under CC BY 4.0 Deed license.