This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Matu´s Medo, Department for BioMedical Research, Inselspital, Bern University Hospital, University of Bern, Department of Radiation Oncology, Inselspital, Bern University Hospital, University of Bern and [email protected];
(2) Michaela Medova, Department for BioMedical Research, Inselspital, Bern University Hospital, University of Bern, and Department of Radiation Oncology, Inselspital, Bern University Hospital, University of Bern.
Mutational signatures based on single base substitutions are defined using the mutated base and its two neighboring bases. The total number of different “neighborhoods” (nucleotide contexts) to which each individual SBS can be attributed is 96 (6 different possible substitutions times four possible 5′ neighbors times four possible 3′ neighbors).
This high number of contexts allows for a fine-grained classification of mutations and a detailed differentiation of many different mutational processes. On the other hand, it is a source of considerable sampling noise when the total number of mutations is small.
This is illustrated in Figure 1a,b which shows the fraction of mutations in different contexts for two well-known signatures: Signature SBS1 with four distinct peaks among the C>T mutational contexts and signature SBS5 that lacks such peaks.
While the peaks of SBS1 are clearly distinguished with as few as 100 mutations, the relative variations are much greater for SBS5 where the same number of mutations is effectively distributed among a larger number of contexts. As we will see, these variations then cause problems for signature fitting tools.
We first evaluate the performance of signature fitting methods in a scenario where all samples have 100% contributions of one given signature (single-signature cohorts). This scenario is clearly unrealistic: Depending on the type of cancer, the number of contributing signatures is 1–11 in individual samples and 5–22 in the cohorts available on the COSMIC website https://cancer.sanger.ac.uk/signatures/sbs/.
Nevertheless, single-signature cohorts allow us to quantify substantial differences between the signatures as well as between signature-fitting tools. Figure 1c confirms our previous hypothesis by showing that signature SBS5 is one of the most difficult signatures to fit for all fitting tools and the numbers of mutations, followed by SBS40.
Beyond substantial differences between the signatures, Figure 1c reveals numerous differences between the fitting tools. For 50 and 100 mutations, for example, the lowest error is not always obtained by the same tool. While sigLASSO, SigProfilerSingleSample and SigProfilerAssignment
Figure 2: A comparison of signature fitting methods for single-signature cohorts. Mean fitting error (top row) and mean total weight assigned to false positive signatures (bottom row) are shown as a function of the number of mutations for three difficult signatures (SBS 3–5) and averaged over 49 non-artifact signatures from COSMICv3.
Solid lines and shaded areas mark mean values and standard errors of the means, respectively, obtained from synthetic cohorts with 100 samples where one signature contributed all mutations. The catalog of all 67 COSMICv3 signatures was used for fitting by all tools.
are among the best tools for “difficult” signatures SBS3 and SBS5, they are lacking for “easy” signatures SBS1 and SBS2, for example.
To understand which signature properties determine how difficult are the signatures to fit, we compute their exponentiated Shannon index and find that it correlates highly (Spearman’s rho 0.77–0.83, depending on the number of mutations) with the average fitting error achived by the evaluated fitting tools (Fig. S1 in Supporting Information, SI).
The correlation further improves (Spearman’s rho 0.86–0.90) when the exponentiated Shannon index is multiplied with a measure of signature similarity with other signatures.
We can conclude that while the fitting difficulty of a signature is mainly determined by the flatness of its profile (as measured by the Shannon index), the level of difference from the other signatures contributes as well.
Figure 2 provides a more quantitative view of the fitting error (top row) as well as the total weight assigned to the signatures that are absent in the samples (False positive weight in the bottom row).
We show here the results for three difficult signatures that are all important for various reasons—SBS3 is related to DNA damage repair, SBS4 is associated with tobacco smoking, and SBS5 is present in virtually all samples—and the average over all non-artifact signatures (see also Fig. S2 in SI).
The error bars in the last column are much wider than in the first three columns; this shows that the result variation between signatures is large compared to the variation between samples characterized by the same signature.
Although all methods improve their fitting error with the number of mutations, m (as shown in Fig. S3 in SI, MAE decreases approximately as m −β with β close to 0.5 for all fitting tools), there are large differences.
For 400 mutations, for example, three tools (mmsig, SigProfilerSingleSample, and SigProfilerAssignment) outperform the other tools by the factor of two or more, on average. For signature SBS5 in particular, SigProfilerSingleSample and SigProfilerAssignment are the best methods by a sizeable margin for the number of mutations between 100 and 1000.
Averaged over all signatures, mmsig is the best method for all numbers of mutations except for 100 and 200 where SigProfilerAssignment is best by a small margin (see Fig. S4 in SI for the best method for each signature and the number of mutations).
Interestingly, the ranking of methods by the total weight assigned to false positives is substantially different, with SigLASSO, signature.tools.lib, and sigfit as the best performers when the number of mutations is small. All three methods have one characteristic in common: When the number of mutations is small, they cautiously assign low relative weights to all signatures, which directly leads to a low false positive weight despite a substantial fitting error.
Other methods produce relative signature weights that sum to one, and their false-positive rates are then often higher. Finally, it should be noted that the running time of the evaluated tools spans over nearly four orders of magnitude (Fig. S5 in SI) between SigsPack (the fastest method) and mmsig (the slowest method).
For some tools (SigProfilerSingleSample, deconstructSigs, mmsig), the running time increases with the signature fitting difficulty represented by the fitting error (Fig. S6 in SI) and with the number of signatures in the used catalog (Fig. S7 in SI).
Nevertheless, even the longest running times of several minutes per sample are acceptable for common cohorts comprising tens or hundreds of samples; fitting mutational signatures is not a bottleneck in genomic data analysis (Berger and Yu, 2022).
We now move to synthetic datasets with empirical heterogeneous signature weights. Here, we use absolute signature contributions (i.e., the number of mutations attributed to a signature) in WGS-sequenced tissues from various cancers as provided by the COSMIC website (see Methods and Fig. S8 in SI).
For further evaluation, we choose eight types of cancer with mutually distinct signature profiles (Fig. S9 in SI): Hepatocellular Carcinoma (LiverHCC), Stomach Adenocarcinoma (Stomach-AdenoCA), Head and Neck Squamous Cell Carcinoma (Head-SCC), Colorectal Carcinoma (ColoRect-AdenoCA), Lung Adenocarcinoma (Lung-AdenoCA), Cutaneous Melanoma (SkinMelanoma), Non-Hodgkin Lymphoma (Lymph-BNHL), and Glioblastoma (CNS-GBM).
The first four cancer types all have SBS5 as the main contributing signature but substantially differ in the subsequent signatures. The following four cancer types have different strongly contributing signatures: SBS4, SBS7a, SBS40, and SBS1.
The relative signature weights in individual samples were then used to construct synthetic datasets with a given number of mutations, allowing us to assess the performance of the fitting tools in realistic settings.
Compared with the previous scenario with single-signature samples, there are now two main differences.
First, nearly all samples have more than one active signature (the highest number of active signatures in one sample is eleven).
Second, signature contributions differ from one sample to another; the average cosine distance between signature contributions in different samples ranges from 0.19 for Liver-HCC to 0.50 for ColoRect-AdenoCA. This scenario is thus referred to as heterogeneous cohorts.
Figure 3 shows that heterogeneous cohorts are more difficult to fit than the previously-studied single-signature cohorts. For 2,000 mutations, for example, the lowest fitting error is above 0.05 (achieved by SigProfilerAssignment) whereas the fitting error of mmsig for the same number of mutations is below 0.01 for all signatures.
This is a direct consequence of heterogeneous signature weights: Even when the total number of mutations is as high as 10,000, a signature with a relative weight of 2% contributes only 200 mutations and, as we have seen, the fitting errors are high for such a small number of mutations.
We see that for 100 mutations per sample, SigProfilerSingleSample has the lowest fitting error. For 2,000 mutations, SigProfilerAssignment becomes the best method, with mmsig close second. Finally, for 50,000 mutations, SigProfilerAssignment is the best method by a large margin.
False positive weights are substantial for all methods except for sigfit and signature.tools.lib (both methods have nevertheless high fitting errors) for 100 mutations per sample and remain so for the methods based on non-negative least squares (represented by MutationalPatterns in Figure 3) even at 2,000 and, partially, 50,000 mutations. Interestingly, the performance differs greatly by cancer type (Fig. S10 in SI).
For 2,000 mutations, three different methods (SigProfilerAssignment, SigProfilerSingleSample, and mmsig) perform best for specific cancer types. For 50,000 mutations, the mean fitting error achieved by mmsig differs between Skin-Melanoma and CNS-GBM by a factor of 5.
One of the best-performing tools, SigProfilerSingleSample, reports a sample reconstruction similarity score that is often used to remove the samples whose reconstruction score is low.
Our results show that this score is strongly influenced by the number of mutations and its absolute value is not a good indicator of the fitting accuracy (Fig. S11 in SI). The same is the case of SigProfilerAssignment (Fig. S12 in SI).
The chosen reference catalog of signatures can significantly impact the performance of fitting algorithms (Koh
Figure 3: A comparison of signature fitting methods for heterogeneous cohorts. Mean fitting error (top row) and mean total weight assigned to false positive signatures (bottom row) for different numbers of mutations per sample (columns) for all evaluated fitting tools.
The results are averaged over 50 cohorts from eight cancer types (see Methods), all 67 COSMICv3 signatures were used for fitting. The best-performing tool in each panel is marked with a frame.
Results are not shown for SigsPack, YAPSA, and all three variants of sigminer as they are close (fitting error correlation above 0.999) to the results of MutationalPatterns.
et al., 2021, Rustad et al., 2021). When the newer COSMICv3.2 catalog of mutational signatures is used as a reference instead of COSMICv3, the fitting error increases for most methods (Fig. S13 in SI) due to an increased number of signatures (from 67 to 78) which makes the methods more prone to overfitting.
The more sophisticated methods (e.g., mmsig and signature.tools.lib) nevertheless manage to maintain their performance when the number of mutations per sample is high.
On the other hand, when the reference signatures are constrained to the signatures that have been previously observed for a given cancer type and to artifact signatures (Koh et al., 2021), the fitting error decreases substantially for most methods (Fig. S14 in SI).
Differences between methods then become smaller (Fig. S15 in SI) because well-performing sophisticated methods profit less from reducing the reference catalog, and this is particularly true when the number of mutations per sample is large.
To better illustrate the effect of reducing the number of reference signatures, we use three selected methods (SigsPack which together with MutationalPatterns and sigminer is the most sensitive to the reference catalog, SigProfilerSingleSample, and sigLASSO) and classify their output to true positive signatures, false positive signatures that are relevant to a given cancer type and false positive signatures that are irrelevant to a given cancer type (Figure 4).
Using the whole COSMICv3 as a reference (top row), a simple method (SigsPack) starts with 17% of relevant false positives and 56% of irrelevant false positives. For comparison, these numbers are only 21% and 27% for SigProfilerSingleSample and 8% and 23% for SigLASSO (which, furthermore, leaves 50% of the mutations unassigned).
When only relevant signatures are used as reference (bottom row), SigsPack improves much more than the two other methods. This further demonstrates how simple methods are particularly sensitive to the reference catalog and overfitting.
Another interesting observation is that while, regardless of the reference catalog, SigsPack and sigLASSO tend to zero false positives as the number of mutations increases, this is not the case for SigProfilerSingleSample (as shown in Fig. S9 in SI, SigProfilerSingleSample performs poorly for Stomach-AdenoCA used in Figure 4).
SigProfilerSingleSample evidently has a powerful algorithm to infer the active signatures from few mutations, but it does not converge to true signature weights when the mutations are many.
Figure 4: The effect of the reference signature catalog on the fitting results. As the number of mutations per sample increases, weights assigned to false positives decrease (relevant false positives: excess weights assigned to signatures that are actually active in the samples; irrelevant false positives: weights assigned to signatures that are not active).
We used simulated input data with 100 samples and Stomach-AdenoCA signature weights. The reference signature catalog is COSMICv3 (top row) and the 18 signatures that are active in the Stomach-AdenoCA samples (bottom row).
Nevertheless, relying on a pre-determined list of relevant signatures is problematic for a number of reasons.
First, the lists of relevant signatures are likely to change over time. Four of the eight considered cancer types have cohorts with less than 100 patients, so adding more WGS-sequenced tissues to the catalog is likely to significantly expand the list of signatures that are active in them.
Second, most tools have difficulty recognizing that the provided signature catalog is insufficient even when the number of mutations in a sample is very large (Fig. S16 in SI).
Third, when the list of relevant signatures is obtained from the COSMIC website, we rely on results obtained with one specific tool and this tool can be biased.
We thus employ a different approach, which is based on fitting signatures using the whole COSMICv3 catalog (step 1) and then constraining the reference catalog to the signatures that sufficiently occur in the obtained results (step 2, see Methods for details).
When the two-step fitting process is applied (Figure 5), the performance of some methods changes considerably, but the best methods when averaged over all cancer types are the same as in Figure 3 where all COSMICv3 signatures were used as a reference.
SigProfilerSingleSample and SigProfilerAssignment are the best methods when the number of mutations per sample is small; Fig. S17 in SI shows that the two-step process is beneficial in this case.
For a higher number of mutations per sample, SigProfilerAssignment is the best method, but there are some close competitors. When the number of mutations per sample is high, constraining the set of reference signatures allows simple methods, such as MutationalPatterns, to compete with sophisticated methods.
However, a closer inspection (Fig. S17 in SI) reveals that this is because the sophisticated methods (e.g., SigProfilerAssignment and sigLASSO) are hampered by the two-step selection process. In other words, our ad hoc procedure of choosing which subset of signatures to use as a fitting reference is inferior to the statistical selection mechanisms built in the methods themselves, in particular when the number of mutations per sample is high.
Using a less strict two-step selection process can avoid its adverse effects on performance when the number of mutations is high at the cost of a much smaller improvement when the number of mutations is small (Fig. S18 in SI).
A different multi-step process was proposed in (Maura et al., 2019) to deal with under- and over-fitting.
Signatures are first extracted de novo, each extracted signature is then assigned to one or two known reference signatures (see Methods for details), and thus-identified reference signatures are then
Figure 5: A comparison of signature fitting methods for heterogeneous cohorts when the set of reference signatures is determined from the data. Except for using the two-step selection process for the reference signatures, all parameters are as in Figure 3.
Figure 6: Performance of four fitting methods in identifying systematic differences in mutational weights between two groups of samples. Test sensitivity is defined as the fraction of 200 synthetic cohorts where a statistically significant difference (Wilcoxon rank-sum test, p-value below 0.05) between the two groups of samples is found. The shaded area shows the 95% confidence interval (Wilson score interval).
used for fitting. However, this approach is either less effective than the two-step process introduced before or does not yield any improvement compared to using all COSMICv3 signatures (Fig. SF19 in SI).
To summarize the results presented in Figures 3 and 5, the proposed two-step process for narrowing down the list of reference signatures is only helpful for samples with 100 mutations. SigProfilerSingleSample is then the best method for all eight studied cancer types. For 2,000 and 50,000 mutations, it is best to use the complete COSMIC catalog as a reference.
For 2,000 mutations, three different methods are then best for various cancer types (SigProfilerAssignment for four of them, mmsig for three, and SigProfilerSingleSample for one). For 50,000 mutations, SigProfilerAssignment is best for all eight cancer types.
SigProfilerAssignment starts to outperform SigProfilerSingleSample between 450 and 10,000 mutations per sample, depending on the cancer type (Fig. S20 in SI).
We have focused so far on estimates of signature weights and their errors with respect to a well-defined ground truth. In many cases, however, the estimates are only important as a part of further downstream analyses assessing the correlations of signature weights with various clinicopathological parameters.
We present here a simplified example of such an analysis by creating synthetic cohorts with CNS-GBM signature weights where systematic differences in the weights of signature SBS40 are introduced (see Methods for details). Estimation errors, in interaction with the actual magnitude of the effect and the cohort size, are crucial to the ability to detect a significant difference in the signature weights.
We contrast three fitting tools: simple MutationalPatterns, SigProfilerSingleSample which performs particularly well when the number of mutations is small, and recent signature.tools.lib.
When the number of mutations is large (10,000 in Figure 6), all tools are sufficiently precise to allow to establish a statistically significant difference between the two groups of samples equally well as using the true signature weights.
By contrast, when the number of mutations is smaller, the choice of the fitting tool matters. At 1000 mutations per sample, SigProfilerSingleSample estimates are still indistinguishable from true signature weights, while the two other tools perform worse, in particular when the cohort size is small.
At 100 mutations per sample, SigProfilerSingleSample still has some statistical power to detect a difference in SBS40 activities between the two groups of samples whereas MutationalPatterns and signature.tools.lib fail regardless of the cohort size.
When the signature of interest is easier to fit than SBS40 used in Figure 6, the differences between fitting tools become smaller (Fig. S21 in SI).
This paper is available on arxiv under CC 4.0 license.