Dear Julian,
I have a naive understanding of SomaticSignature package although I've worked out getting an output and need a bit of help with comparing my results.
Our anlaysis:
Briefly, we have ~80 samples under 4 conditions (groups) and have inferred 7 signatures for these for which we are trying to find out the specific signature per group. Normalisation was done by creating probability score in each sample using the function,
normalize<-function(x)(x/sum(x))
Problem:
I'm trying to compare the our inferred signatures from SomaticSignatures with its data(signature21) as well as the latest 30 signatures data(signature30) from http://cancer.sanger.ac.uk/cancergenome/assets/signatures_probabilities.txt (again, im not sure
if these values can be used directly like data(signature21) or need further processing, but they seem to have the same range in values i.e,
# data(signature21) : 21 Somatic signatures range : 0.0000 0.4246
# data(signature30) : 30 Somatic signatures range : 0.0000000 0.4199414
Not much correlation between 21 Somatic signatures & 30 Somatic signatures
- Seem that the correlation between data(signature21) and data(signature30) seem to be quite different although our results seem to concur with data(signature21) more. It seems like S18 from data(signature21) is more correlated with Signature8 or Signature3 from data(signature30). Could the names of signatures have changed in the latest update?
- Not sure what I understand by S1A,S1B,SR1,SR2,SR3,SU1,SU2 from data(signature21) and what they correspond to in the updated version.
Comparing published signatures
I've tried 2 things here, yet to figure out which is the better approach.
a)Using correlation
cor(x, y = NULL, use = "everything",method = c("pearson", "kendall", "spearman"))
b)Using cosine similarity
library(lsa)
cosine(x, y = NULL)
We normalised data by creating probability score in each sample i.e,
Assessed the number of signatures to be n=7
Note: Showing only the correlation output here [cosine similarity is similar for the data(signature21) matrix but not data(signature30)]
Example output of Correlation between 21 Somatic signatures.
Highest correlation with our S1 : Signature- S18,0.761870210035029
Highest correlation with our S2 : Signature- S18,0.895449441547398
Highest correlation with our S3 : Signature- S18,0.8176733925205
Highest correlation with our S4 : Signature- S18,0.66299142221862
Highest correlation with our S5 : Signature- S1B, 0.613441745488655
Highest correlation with our S6 : Signature- S1B, 0.665963561771328
Highest correlation with our S7 : Signature- S5, 0.472316222871848
Example output of Correlation between 30 Somatic signatures.
Highest correlation with our S1 : Signature.30 0.250721518256121
Highest correlation with our S2 : Signature.8 0.230023105084011
Highest correlation with our S3 : Signature.8 0.256823544093413
Highest correlation with our S4 : Signature.25 0.331866927022225
Highest correlation with our S5 : Signature.27 0.353397110264527
Highest correlation with our S6 : Signature.28 0.255187187591234
Highest correlation with our S7 : Signature.25 0.312821309032734
Any idea why the output is different when using data(signature21) and data(signature30)
- Also, would using somatic spectrum motifMatrix values (sca_mm, from package example) to compare published signatures and spectrum of our individual samples make sense?
- On another note, when using functions plotObservedSpectrum, plotFittedSpectrum I get an error " n too large, allowed maximum for palette Set3 is 12”
- I think this is because the package is limited to 12 colours so i don't get an output with the remaining samples.
Do let me know if the above requires clarification.
Looking forward to your comments.
Kind regards,
John
"Briefly, we have ~80 samples under 4 conditions (groups) and have inferred 7 signatures for these for which we are trying to find out the specific signature per group."
I'm not sure if I understand this correctly: If you have grouped your variant calls into 4 groups during the analysis, you can't estimate more than 4 signatures from this. Can you provide some more details on how you analysed the data?
"Seem that the correlation between data(signature21) and data(signature30) seem to be quite different although our results seem to concur with data(signature21) more. It seems like S18 from data(signature21) is more correlated with Signature8 or Signature3 from data(signature30). Could the names of signatures have changed in the latest update?
Not sure what I understand by S1A,S1B,SR1,SR2,SR3,SU1,SU2 from data(signature21) and what they correspond to in the updated version."
The data in signatures21 has been taken from the Alexandrov, 2013 publication on mutational signatures (see the help of signatures21 for the full details). The 30 signatures published by the Sanger/COSMIC (which you have imported) may follow a different naming scheme. There is likely no direct 1:1 mapping of signatures between the two data sets. I would suggest to contact the COSMIC helpdesk and they should be able to give you more details on how they created the mutational signature catalog.
The names in signatures21 come directly from the Alexandrov, 2013 publication - please have a look the the paper and especially the supplement for more details about S1A,S1B,SR1,SR2,SR3,SU1,SU2.
If I understood your approach correctly, the different correlations with the signatures from the two published data sets due to reasons outlined in my second comment. You may get a better overview by plotting your estimated signatures and the signatures of both data sets - this is often more informative than just looking at the correlation coefficient alone.
"On another note, when using functions plotObservedSpectrum, plotFittedSpectrum I get an error " n too large, allowed maximum for palette Set3 is 12”. I think this is because the package is limited to 12 colours so i don't get an output with the remaining samples."
The default colour palette defined 12 colours (it is fairly hard to find a colour palette with many easily distinguishable colours). You can change the colour palette and choose another one if you have more than 12 signatures to show - you can have a look at the section 4.5.1 "Customization: Changing Plot Properties" in the vignette for an example.