Somatic signatures package uses RSS and unexplained variance to assess the best number of signatures. In Alexandrov et al paper (Cell Reports 3, 246-259) they used cosine similarity and Frobenius reconstruction error to determine the number of signatures.
Did any one compare the two ways and find difference in determination of number of signatures?
The RSS and explained variance provide two general measures for comparing the deviation of reconstructed matrix with the observed mutational spectrum. General here also means that it can be applied to any decomposition method and does not make strong assumptions on the data, such as non-negativity which makes it also suitable for a wide range approaches, for example the PCA. The measures proposed by Alexandrov and colleagues are more tailored to supporting the NMF, and in the context of assessing the reconstruction error, randomly-seeded NMF decompositions. This is very specific, and hence cannot be used directly while having the flexibility in the matrix decomposition. Comparing the different approaches feels very much like comparing apples with bananas. In my experience, assessing the number of signatures is fairly stable across a wide range of statistics, given good data. However, a suggested number of signatures should not be trusted blindly for any method, but rather be interpreted with biological reasoning.
If you feel that you may want to compare different statistics directly, the best is of course to perform the comparison yourself on your data. This way, you avoid the problem that individual methods may work differently well depending on characteristics of the dataset, and can test the full range of methods that are applicable to analysis.
SomaticSignatures provides a framework to identify mutational signatures with matrix decomposition methods in general, with concrete implementations for the NMF and PCA. Measures that are specific to certain properties of the data, e.g. nonnegativity of the decomposed matices W and H, are therefore more restricted than using measures that can be used in a general context, and hence we prefer the latter.
Thanks Julian for explanation. could you please explain this sentence more:
"This is very specific, and hence cannot be used directly while having the flexibility in the matrix decomposition"
Which flexibility you mean? Somatic signatures is using NMF for decomposition. so is there a need to use a flexible measures?
SomaticSignatures provides a framework to identify mutational signatures with matrix decomposition methods in general, with concrete implementations for the NMF and PCA. Measures that are specific to certain properties of the data, e.g. nonnegativity of the decomposed matices W and H, are therefore more restricted than using measures that can be used in a general context, and hence we prefer the latter.
Thank you for clarification.