This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Mattia Atzeni, EPFL, Switzerland and [email protected];
(2) Mrinmaya Sachan, ETH Zurich, Switzerland;
(3) Andreas Loukas, Prescient Design, Switzerland.
This section provides additional details on the experimental setup of all our experiments, including further information on the generation of the synthetic tasks and the data annotation process for ARC.
We considered four categories of tasks, namely translation, rotation, reflection and scaling. Each task is defined in terms of input-output pairs, which are sampled from the set of all ARC grids and padded to the size of 30 × 30 cells. To each input grid, a synthetic transformation is applied in order to obtain the corresponding output grid. For each task in each category, we generated 2048 training pairs and 100 test pairs.
For translation tasks, we have a total of 900 possible translations in a 30 × 30 grid. However, generating data and training models on 900 tasks is computationally expensive, so we randomly sampled 5 translations in the interval [1, 29] × [1, 29], obtaining a total of 100 translation tasks. Rotation tasks include all 4-fold rotations except the identity. Similarly, reflection tasks involve horizontal, vertical and diagonal reflections. Scaling tasks include all possible up/down scaling transformations of the input grid by factors of [2, 5] × [2, 5] for a total of 32 scaling tasks.
The models are evaluated based on the mean accuracy on each category. For each task we compute the accuracy on the test set based on how many of the predicted images match exactly the ground truth.
In order to experiment with ARC, we first performed an annotation of the dataset to identify the underlying knowledge priors for each task. To this end, we built a user interface where the annotator could browse the tasks and label them by selecting any combination of the available knowledge priors. Figure 7b shows the user interface provided to the annotator, whereas Figure 7a shows the distribution of knowledge priors across the ARC tasks. Most tasks follow in more than one of the categories represented in Figure 7a.
ARC can be regarded as a meta-learning benchmark, as it provides a set of training tasks and a set of unseen tasks to evaluate the performance of the model learned on the meta-training data. It is important to emphasize that we do not target this use case, as we instead use the same setup as in the synthetic data and learn each task from scratch using only its training set.
Table 4: Results of the experiment on the robustness to noise
Though simple and elegant, the supervised-learning formulation prevents our models from reusing knowledge that can be shared between different tasks. In order to mitigate this issue, we rely on a data-augmentation strategy. At training time, for each model and every iteration, we augment each grid 10 times by mapping each color to a different color (using the same mapping across training examples). The rationale behind this data-augmentation strategy is that (1) we assume that for tasks involving only geometric knowledge priors to be not affected by color mapping and (2) all models (including LATFORMER) need to learn a function from d-dimensional color representations to categorical variables, hence it is beneficial if all colors are represented in the training set.
All models are evaluated based on the ratio of solved tasks and a task is considered solved if the model can predict the correct output grid for all examples in the test set.
All baselines relying on program synthesis for the experiment on LARC are taken from the work of Acquaviva et al. (2021). They share an underlying formulation based on the generate-and-check strategy. The program synthesizer generates a program prog given a natural program natprog (which can defined by either the input-output pairs alone or by input-output pairs and the corresponding natural language description) from the following distribution:
In order to assess the robustness of our method, we performed an additional experiment with synthetic tasks, where we introduced noise based on the following process. Input-output grids in our tasks contain categorical values from 1 to 10. In our experiments of Section 5, we represented each categorical value using an embedding layer, which essentially applies a linear transformation on a one-hot encoding of the categorical value. Given that we use one-hot vectors to represent categorical values, we apply noise directly to the one-hot vectors as follows:
Table 5: Results of the experiment on image registration. The rows represent different models trained to translate images from modality A to B (A −→ B) or viceversa (B −→ A).
As an additional experiment, to assess the applicability of our LATFORMER on natural images, we performed experiments on multimodal image registration, namely the problem of spatially aligning images from different modalities. Image registration is a well-studied problem in computer vision and we do not aim to establish state-of-the-art performance. The main purpose of this experiment is giving a hint on the applicability of our method to natural images beyond ARC. We refer the reader to SuperGlue (Sarlin et al., 2020) and COTR (Jiang et al., 2021) to have a sense of approaches specifically designed for this task.
Popular approaches to multimodal image registrations work in two stages: first, they learn a model that converts one modality into the other (or to transfer both modalities in the same representation as proposed by Pielawski et al. (2020)), then they align the two images using traditional techniques. We follow the experimental setup of Lu et al. (2021) and experiment with two datasets, one containing aerial views of a urban neighborhood and one containing cytological images. The images we employ are views of the same scene, but they are taken with different modalities and they are translated with respect to one another. We use the code of the authors to generate data involving only translations. Lu et al. (2021) additionally consider small rotations, but these transformations are not actions in the symmetry group of a lattice, so we are not interested in resolving them.
We employ several state-of-the-art methods for modality translation and we compare our method to α-AMD (Lindblad & Sladoje, 2014) and SIFT (Lowe, 1999) based on the success rate metric defined by Lu et al. (2021). A registration is considered successful if the relative registration error (i.e., the residual distance between the reference patch and the transformed patch after registration normalized by the height and width of the patch) is below 2%. Table 5 reports our results on the image registration tasks and shows that our approach performs well on both datasets coupled with different methods for modality translation. We use the same models of Lu et al. (2021) for the modality translation task. Then, in order to solve the image registration task with LATFORMER, we divide each image into 30 × 30 patches and we run our model to predict the translation from one patch in an image to its counterpart in the corresponding image.