The general answer is most likely "It depends on the data". However, here some ideas why is good to remove sample with few mutations from the beginning on:
- Samples with few mutations are uninformative. If we can identify mutational signatures in the data, there will be nevertheless no information to determine what signatures are present in the sample (matrix H). While we get an estimate back, it will be dominated by noise and not be reliable.
- If we add more of these uninformative samples to our motif matrix M, we transition from a small, dense matrix to a large, sparse (many zeros) one. The decomposition of M, which is the step we need for inferring the signatures, becomes (a) computationally more expensive and, worse, (b) more likely to fail. In this sense, failing means that we can find many possible solutions, while none of them really represents our data well.
- Some decomposition methods are more fragile with regards to sparse matrices - and may require more tuning.
- Since these samples are uninformative, we will drop them later in the analysis anyway.
An alternative to dropping samples with few mutations is grouping them according to other covariates. See the data in the vignette of the package: Here, each sample contains few somatic SNVs. Grouping them by cancer type allows us to get well estimated signatures. And it allows us to directly see which signature is dominant in which cancer type - often this is more interesting than a single sample inference.