This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Matu´s Medo, Department for BioMedical Research, Inselspital, Bern University Hospital, University of Bern, Department of Radiation Oncology, Inselspital, Bern University Hospital, University of Bern and [email protected];
(2) Michaela Medova, Department for BioMedical Research, Inselspital, Bern University Hospital, University of Bern, and Department of Radiation Oncology, Inselspital, Bern University Hospital, University of Bern.
Mutational signatures connect characteristic mutational patterns in the genome with biological processes that take place in the tumor tissues. Analysis of mutational signatures can help elucidate tumor evolution, prognosis, and therapeutic strategies.
Although tools for extracting mutational signatures de novo have been extensively benchmarked, a similar effort is lacking for tools that fit known mutational signatures to a given catalog of mutations. We fill this gap by comprehensively evaluating eleven signature fitting tools (well-established as well as recent) on synthetic input data.
To create realistic input data, we use empirical signature weights in tumor tissue samples from the COSMIC database. The study design allows us to assess the effects of the number of mutations, type of cancer, and the catalog of reference signatures on the results obtained with various fitting tools. We find substantial performance differences between the evaluated tools.
Averaged over 120,000 simulated mutational catalogs corresponding to eight different cancer types, SigProfilerSingleSample and SigProfilerAssignment perform best for small and large numbers of mutations per sample, respectively.
We further show that ad hoc constraining the list of reference signatures is likely to produce inferior results and that noisy estimates of signature weights in samples with as few as 100 mutations can still be useful in downstream analysis.
Keywords: Cancer genomics, statistical methods, mutational signatures, computational models
Since their introduction a decade ago by Nik-Zainal et al. (2012), Alexandrov et al. (2013), mutational signatures have become a widely used tool in genomics (Maura et al., 2019, Koh et al., 2021).
They allow researchers to move from individual mutations in the genome to biological processes that take place in living tissues (Kim et al., 2021). In (Cannataro et al., 2022), for example, signature activities have been used to attribute mutations to endogenous, exogenous, and preventable mutational processes.
The activity of various mutational signatures can also serve as prognostic or therapeutic biomarkers (Van Hoeck et al., 2019, Brady et al., 2022). Homologous recombination defficiency leads to the accumulation of DNA damage and manifests itself in a specific mutational signature (signature SBS3 from the COSMIC catalog) (Nik-Zainal et al., 2016, Gulhan et al., 2019).
Mutational signatures can be introduced for single base substitutions (SBS), doublet base substitutions (DBS), small insertions and deletions (ID), as well as for copy number variations (CN).
We focus here on SBS-based mutational signatures which are most commonly used in the literature. Current SBS signatures are defined using 6 possible classes of substitutions (C>A, C>G, C>T, T>A, T>C, and T>G) together with their two immediate neighboring bases, thus giving rise to 6 × 4 × 4 = 96 different nucleotide contexts into which all SBS mutations in a given sample are classified.
De novo extraction of signatures from somatic mutations in sequenced samples has been used to gradually map the landscape of mutational signatures in cancer tissues. Over time, the initial catalogue of 22 SBS-based mutational signatures in the first version of the Catalogue Of Somatic Mutations In Cancer (COSMIC) released in August 2013 has expanded to 80 signatures in the COSMICv3.3 version released in June 2022.
This expansion was possible owing to the increased availability of WES and WGS sequencing data as well as improved tools for signature extraction. Extensive benchmarking of extraction tools on synthetic data has recently shown that SigProfilerExtractor outperforms other methods in terms of sensitivity, precision, and false discovery rate, particularly in cohorts with more than 20 active signatures (Islam et al., 2022).
A two-step process consisting of first extracting common signatures and then rare signatures has been recently recommended in (Degasperi et al., 2022).
Nevertheless, with the existing signatures extracted from cohorts including thousands of sequenced samples (Alexandrov et al., 2020, Islam et al., 2022, Degasperi et al., 2022), the authors of smaller studies are unlikely to discover new signatures.
For example, the analysis of more than 23,000 WGS and WES cancers in (Islam et al., 2022) has only discovered four new signatures. The much more relevant task is thus that of fitting existing signatures to given sequenced samples. In this process, the catalogs of somatic mutations are used to determine the signature contributions to each individual sample.
Many different tools have been developed for this task (Koh et al., 2021) (see Methods for their description and classification). However, while tools for extracting mutational signatures have been recently extensively benchmarked on synthetic data by various studies (Omichessan et al., 2019, Alexandrov et al., 2020, Islam et al., 2022, Wu et al., 2022), a similar quantitative comparison is lacking for tools for fitting mutational signatures.
This need is further exacerbated by substantial variations between results obtained by different methods (Pandey et al., 2022). Furthermore, newly introduced tools for fitting mutational signatures are commonly compared with only a few existing tools, rarely the most recent ones, and a standardized comparison methodology is lacking.
In this study, our aim is to fill this gap and provide a comprehensive evaluation of a broad range of fitting tools on synthetic data motivated by various types of cancer.
We constrain on fitting SBS signatures as they are the most widely used signature type. Many of the included tools can be nevertheless used for other types of signatures as their fitting is mathematically not different from the fitting of SBS signatures.
We generate two classes of synthetic mutational catalogs. In the first one, only one mutational signature is responsible for all mutations (single-signature cohorts).
In the second one, signature activities in each sample are modeled after the activities found in real tumor samples from various cancers (heterogeneous cohorts). We have collected eleven tools for fitting mutational signatures, from earlier tools such as deconstructSigs to very recent ones such as SigProfilerAssignment.
We assess their performance by comparing the known true signature activities in the created catalogs with results obtained by various fitting tools. We find that there is no single fitting tool that performs best regardless of how many mutations are in the samples and which cancer type is chosen to model signature activities.
Averaged across all cancer types, two related tools (SigProfilerSingleSample and SigProfilerAssignment) perform best when the number of mutations per sample is small (below 1000, roughly) and large (above 1000), respectively.
We also compare the tools by how prone they are to overfitting caused by increasing the size of the reference signature catalog and evaluate whether it is beneficial to constrain the reference catalog based on which signatures seem to be absent or little active in the analyzed samples. We close with a discussion of open problems in fitting mutational signatures.
Figure 1: The effect of the number of mutations in single-signature cohorts. (a, b) The bars show the 95th percentile range of the observed fraction of mutations in synthetic data for SBS mutational contexts (horizontal axis) and different total mutation counts (100, 1,000 and 10,000 represented with red, blue and gray bars, respectively).
Panels a and b show signature SBS1 with four distinct C>T peaks and signature SBS5 with a flat mutational spectrum, respectively. (c) The fitting errors of the evaluated fitting tools for synthetic data corresponding to 100 samples with 100% contributions of a single signature; the catalog of all 67 COSMICv3 signatures was used for fitting.
We show results for ten common signatures (horizontal axis) and various mutation counts (vertical axis). The tools are ordered by the error averaged over the displayed signatures achieved for 51,200 mutations.
This paper is available on arxiv under CC 4.0 license.