I have a dataset of somatic mutations in multiple samples. Some samples have a really low mutation rates (~20 mutations) while others have more mutation burden (~2000 mutations). I understand that signatures estimation for low-mutation samples will be unreliable, but the inclusion of these samples will affect to the signature decomposition?
In summary, I can keep low-mutation samples in my analysis without having too much effect in the decomposition or is better to remove them before the analysis?
The general answer is most likely "It depends on the data". However, here some ideas why is good to remove sample with few mutations from the beginning on:
Samples with few mutations are uninformative. If we can identify mutational signatures in the data, there will be nevertheless no information to determine what signatures are present in the sample (matrix H). While we get an estimate back, it will be dominated by noise and not be reliable.
If we add more of these uninformative samples to our motif matrix M, we transition from a small, dense matrix to a large, sparse (many zeros) one. The decomposition of M, which is the step we need for inferring the signatures, becomes (a) computationally more expensive and, worse, (b) more likely to fail. In this sense, failing means that we can find many possible solutions, while none of them really represents our data well.
Some decomposition methods are more fragile with regards to sparse matrices - and may require more tuning.
Since these samples are uninformative, we will drop them later in the analysis anyway.
An alternative to dropping samples with few mutations is grouping them according to other covariates. See the data in the vignette of the package: Here, each sample contains few somatic SNVs. Grouping them by cancer type allows us to get well estimated signatures. And it allows us to directly see which signature is dominant in which cancer type - often this is more interesting than a single sample inference.
There will be no general threshold, and a reasonable choice will have to be driven by the data, especially by the strength of the underlying signatures. You can also have a look at Determining number of signatures in SomaticSignatures for a related discussion. The mutational spectrum is composed of 96 motifs for each sample, and one would at least require this matrix to be dense. While I can't give a general advice on what is sufficient, I would expect to get meaningful and reliable estimates with a minimum of a few hundreds mutations per sample/group.
Instead of trying to identify signatures on a per-sample base, one could try to define another relevant grouping and aggregate the variants by this groups. The case study covered in the vignette is using this approach, where signatures are estimated at the level of studies.
Thanks Julian, By clustering samples based on study for ex. Shall i sum up counts for each motif in all samples to have a new matrix with number of columns=number of studies??
Please have a look at the vignette of the package, which covers exactly this use case and, more importantly, has the functionality implemented to do all of this for you.
Thanks for the answer. Do you have a general estimation of the minimum number of mutations to have an reliable estimation of the signatures? I suppose that this number would be in function of the strength of the signature
Hi,
Do you have an idea about the a reasonable threshold for removing samples with low mutations (for ex. samples <20 mutations)
Thank you
There will be no general threshold, and a reasonable choice will have to be driven by the data, especially by the strength of the underlying signatures. You can also have a look at Determining number of signatures in SomaticSignatures for a related discussion. The mutational spectrum is composed of 96 motifs for each sample, and one would at least require this matrix to be dense. While I can't give a general advice on what is sufficient, I would expect to get meaningful and reliable estimates with a minimum of a few hundreds mutations per sample/group.
Instead of trying to identify signatures on a per-sample base, one could try to define another relevant grouping and aggregate the variants by this groups. The case study covered in the vignette is using this approach, where signatures are estimated at the level of studies.
Thanks Julian, By clustering samples based on study for ex. Shall i sum up counts for each motif in all samples to have a new matrix with number of columns=number of studies??
Please have a look at the vignette of the package, which covers exactly this use case and, more importantly, has the functionality implemented to do all of this for you.