This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Matu´s Medo, Department for BioMedical Research, Inselspital, Bern University Hospital, University of Bern, Department of Radiation Oncology, Inselspital, Bern University Hospital, University of Bern and [email protected];
(2) Michaela Medova, Department for BioMedical Research, Inselspital, Bern University Hospital, University of Bern, and Department of Radiation Oncology, Inselspital, Bern University Hospital, University of Bern.
The broad range of tools for fitting mutational signatures makes it difficult to understand which tool to choose for a given project. In this work, we provide a comprehensive assessment of eleven different tools on synthetic mutational catalogs.
Using mutational catalogs where only one signature is active allows us to quantify the differences in the fitting difficulty of individual signatures. We find that flat signatures whose average similarity with other signatures is high are the most difficult to fit.
To assess fitting tools, we use synthetic mutational catalogs whose signature activities are modeled based on eight distinct cancer types. We find that when the number of mutations is small (100 mutations per sample), SigProfilerSingleSample is the best tool to use for all cancer types.
As the number of mutations increases, SigProfilerAssignment becomes the best tool for all cancer types with mmsig best for some cancer types for an intermediate number of mutations (2,000 mutations per sample). The results change little when Pearson correlation is used instead of fitting error for the evaluation (Figs. S22 and S23 in SI).
The risk of overfitting the data by including too many signatures in the reference catalog is often discussed in the literature (Maura et al., 2019, Koh et al., 2021, Degasperi et al., 2022). We show that three methods (SigProfilerSingleSample, SigProfilerExtractor, and mmsig) are robust to such overfitting when the number of mutations per sample is not too small (Figs. S12 and S13 in SI).
Crucially, we find that the common practice of excluding signatures from the reference catalog because they do not seem to be active (or are little active) in the analyzed data is little effective or even harmful, unless the number of mutations per sample is small (Fig. S16 in SI).
In most cases it is preferable to let the built-in statistical methods of these tools to decide using the complete COSMIC catalog as a reference.
While our work gives clear recommendations on which tools to consider and which to avoid, many issues can be addressed in the future to further improve the fitting of mutational signatures. First, a similar assessment of fitting tools can be done for other types of signatures: doublet base substitutions, small insertions and deletions, and copy number variations. Then, some tools (e.g., sigfit and mmsig) can compute confidence intervals for their estimates.
For sigfit, we used these estimates in the way recommended by the authors: When the lower bound of the confidence interval for the relative signature weight is below 0.01, the signature is marked as absent in the sample (its relative weight is set to zero). Our results show that this recommended practice indeed improves the results achieved by sigfit.
By bootstrapping (resampling the original mutational catalogs with replacement), confidence intervals could be computed also for the tools that do not compute them by themselves. It would be worth investigating whether thus-determined confidence intervals could also improve the performance of other fitting tools.
To fit mutational signatures, it is commonly required that each sample has at least 50 (Riva et al., 2020) or 200 (Blokzijl et al., 2018) single base substitutions. However, different signatures are differently difficult to fit (Figure 1), so it is unlikely that such a universal threshold is meaningful.
Based on the simulation framework established here, it would be possible to study the minimum necessary number of mutations for different signatures of interest and different cancer types.
To best match the needs of practitioners, it would be possible to extend the simulation framework so that it finds the best-performing fitting tool for a given cancer type, a list of signatures of interest, and a given distribution of sample mutation counts.
When an active signature is missing from the reference catalog because it is new or is falsely deemed inactive in a given set of samples, most fitting tools cannot cope with this situation and distribute all (or nearly all) mutations among signatures from the reference set (Fig. S15 in SI).
Three tools do not suffer from this drawback—deconstructSigs, signature.tools.lib, and sigfit—but their performance is generally weak.
One could aim to enhance one of the bestperforming tools, such as SigProfilerAssignment, with an additional step that would quantify whether given signatures are indeed likely to produce given mutational catalogs. Another approach to this problem is offered in (Maura et al., 2019) where the authors suggest a multistep process to simultaneously avoid under- and over-fitting.
Mutational signatures are first extracted de novo, then they are assigned to existing signatures from a reference catalog, and thusdetermined active signatures are finally used to fit the original input data. While there are well-defined statistical principles for the first and third steps of this process, the second assignment step is largely ad hoc. A more principled approach could improve the outcome.
More generally, COSMIC signatures are the result of analyzing tumor tissues from many different organs, so they can be viewed as an average across them (Degasperi et al., 2022). It might be more appropriate to construct organ-specific reference catalogs that would better reflect mutational processes that occur in these biologically diverse systems. The fast-growing number of sequenced samples would make such an effort possible.
This paper is available on Arxiv under CC 4.0 license.