(1) Vojtech ˇ Cerm ˇ ak, Czech Technical University in Prague (Email: [email protected]);
(2) Lukas Picek, University of West Bohemia & INRIA (Email: [email protected]/[email protected]);
(3) Luka´s Adam, Czech Technical University in Prague (Email: [email protected]);
(4) Kostas Papafitsoros, Queen Mary University of London (Email: [email protected]).
Similarly, as in other fields, the development of methods and datasets for automated animal re-identification has been influenced by the progress in machine learning. Currently, many studies exist, although the differences in terms of their approach, prediction output, and evaluation methodologies result in several drawbacks.
Firstly, methods are usually inspired by trends in machine learning rather than being motivated by real-world reidentification scenarios. A prominent example is performing classification tasks on a closed-set, which is typical for benchmarking in deep learning but is, in general, not realistic in ecology, as new individuals are constantly being recruited to populations.
Second, many studies focus on a single dataset and develop species-specific methods evaluated on the given dataset rather than on a family of datasets [6, 10, 20, 25, 31, 52], making reproducibility, transferability, and generalization challenging.
Third, datasets are poorly curated and usually include unwanted training-to-test data leakage, which leads to inflated performance expectations.
All this leads to the repetition of poor practices both in dataset curation and method design. As such, much of the current research suffers from a lack of unification, which, we argue, constitutes an obstacle to further development, evaluation, and applications to real-world situations.
There are three primary approaches commonly used for wildlife re-identification – (i) local descriptors [9, 21, 43], (ii) deep descriptors [12, 16, 31, 34, 49], and (iii) speciesspecific methods [6, 10, 25, 29, 52].
Local-feature-based methods find unique keypoints and extract their local descriptors for matching. The matching is usually done on a database of known identities, i.e., for each given image sample, an identity with the highest number of descriptor matches is retrieved. The most significant benefit of these methods is their plug-and-play nature, without any need for fine-tuning, which makes them comparable in a zero-shot setting to large foundation models, such as CLIP [42] or DINOv2 [37], etc.
Even though approaches based on SIFT, SURF, or ORB descriptors exhibit limitations in scaling efficiently to larger datasets and their performance, all available software products, e.g., WildID [11], HotSpotter [15], and I 3S, are based on local-feature-based methods. Naturally, even with such limitations, those systems are popular among ecological researchers without a comprehensive technical background and find a wide range of applications, most likely due to their intuitive graphical user interfaces (GUIs).
Deep feature-based approaches are based on vector representation of the image learned through optimizing a deep neural network. Similarly, as in local feature-based methods, the resulting deep embedding vector (usually 1024 or 2048d) is matched with an identity database.
Applying deep learning to wildlife re-identification bears similarities with human or vehicle re-identification. Therefore, similar methods can be easily repurposed. However, it is important to note that deep learning requires fine-tuning models on the specific target domain, i.e., species, which makes the model’s performance dependent on a species it was fine-tuned for. Another approach is to use publicly available large-scale, foundational models pre-trained on large datasets (e.g., CLIP [42] and DINOv2 [37]). These models are primarily designed for general computer vision tasks. Therefore, they are not adapted nor tested for the nuances of wildlife re-identification, which heavily relies on fine-grained features.
Species-specific methods are tailored to an individual species or groups of closely related species, particularly those with visually distinct patterns. These methods typically focus on visual characteristics unique to the target species, restricting their applicability beyond the species they were developed for. Moreover, they often entail substantial manual preprocessing steps, such as extracting patches from regions of interest or accurately aligning compared images. For instance, one such approach involves employing Chamfer distance to measure the distance between greyscale patterns in polar bear whiskers [6]. Other examples include computing correlation between aligned patches derived from cheetah spots [29] or similarity between two images based on the count of matching pixels within newt patterns [20].
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.