Question

Query regarding SomatiSignature bioconductor package

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi all, I have been using SomaticSignature BC package to predict signatures. I am following the examples provided in vignette. However I have some doubts. 1.I have data from a single study (AML) with mutations obtained from 14 patients. In this case, how do I group the data ? If I group the data by ???study??? as in vignette, I am getting an error while running nmfSignatures function.(I guess it???s because the dimension of matrix (sca_occurance) has only one column corresponding to the single study performed ) Can I group it based on patients (sampleNames) instead ? 2.How do I choose the number R (number of signatures to obtain) ? I guess it should be less than number of columns of sca_occurances ? In a recent publication (Nicocolo Bolli et al , 2013, nat. com) involving single study (multiple myeloma with 52 patients), they mention - the have found two signatures, does it mean they have set the number of signatures (R argument in nmfSignatures()) to 2? My apologies if the question is not in the proper format. Thank you, -Anand. -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252 [3] LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C [5] LC_TIME=English_Singapore.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] fastICA_1.2-0 stringr_0.6.2 [3] exomeCopy_1.10.0 SomaticCancerAlterations_1.0.0 [5] SomaticSignatures_1.0.1 Biobase_2.24.0 [7] ggbio_1.12.4 ggplot2_1.0.0 [9] reshape2_1.4 VariantAnnotation_1.10.1 [11] Rsamtools_1.16.0 Biostrings_2.32.0 [13] XVector_0.4.0 GenomicRanges_1.16.3 [15] GenomeInfoDb_1.0.2 IRanges_1.22.8 [17] BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.26.0 BatchJobs_1.2 BBmisc_1.6 [4] BiocParallel_0.6.1 biomaRt_2.20.0 biovizBase_1.12.1 [7] bitops_1.0-6 brew_1.0-6 BSgenome_1.32.0 [10] cluster_1.15.2 codetools_0.2-8 colorspace_1.2-4 [13] DBI_0.2-7 dichromat_2.0-0 digest_0.6.4 [16] doParallel_1.0.8 fail_1.2 foreach_1.4.2 [19] Formula_1.1-1 GenomicAlignments_1.0.1 GenomicFeatures_1.16.2 [22] grid_3.1.0 gridBase_0.4-7 gridExtra_0.9.1 [25] gtable_0.1.2 gtools_3.4.1 Hmisc_3.14-4 [28] iterators_1.0.7 labeling_0.2 lattice_0.20-29 [31] latticeExtra_0.6-26 MASS_7.3-31 munsell_0.4.2 [34] NMF_0.20.5 pcaMethods_1.54.0 pkgmaker_0.22 [37] plyr_1.8.1 proto_0.3-10 RColorBrewer_1.0-5 [40] Rcpp_0.11.2 RCurl_1.95-4.1 registry_0.2 [43] rngtools_1.2.4 RSQLite_0.11.4 rtracklayer_1.24.2 [46] scales_0.2.4 sendmailR_1.1-2 splines_3.1.0 [49] stats4_3.1.0 survival_2.37-7 tools_3.1.0 [52] XML_3.98-1.1 xtable_1.7-3 zlibbioc_1.10.0 -- Sent via the guest posting facility at bioconductor.org.

• 888 views

ADD COMMENT • link updated 9.9 years ago by Julian Gehring ★ 1.3k • written 9.9 years ago by Guest User ★ 13k

0

Entering edit mode

Hi Anand,

Julian Gehring suggested you to group patients on samplenames. Did it work ?

In the sample data, the type of the variable “samplenames” should be "rle". When I try to use it as group var, it doesn't work.

If you have any experience, please share with me. Thank you.

bests,

shengfeng

ADD REPLY • link 8.1 years ago shengfeng • 0

score 0 · Answer 1 · 2014-06-10

Hi Anand, > 1.I have data from a single study (AML) with mutations obtained from 14 patients. In this case, how do I group the data ? If I group the data by ???study??? as in vignette, I am getting an error while running nmfSignatures function.(I guess it???s because the dimension of matrix > (sca_occurance) has only one column corresponding to the single study performed ) Can I group it based on patients (sampleNames) instead ? You can group your variants by any variable that is present in the 'VRanges' object that contain your calls. The object behaves very similar to a data frame, so you could add a column with x$sample = ... ## your 14 samples ## and than group it with motifMatrix(x, group = "sample") If your samples are already stored in the column 'sampleNames', you can also refer to this (see '?mutationContext' for an example). > 2.How do I choose the number R (number of signatures to obtain) ? I guess it should be less than number of columns of sca_occurances ? In a recent publication (Nicocolo Bolli et al , 2013, nat. com) involving single study (multiple myeloma with 52 patients), they mention - the have found two signatures, does it mean they have set the number of signatures (R argument in nmfSignatures()) to 2? For estimating the number of signatures, there are several approaches. If and how well they perform depends largely on the input data, none of them will work reliably in all cases. For this reason, I haven't implemented an estimation for the number of signatures so far - I want to avoid giving a false sense of security/certainty. For the practical aspect, most information will the contained in the first few signatures - increasing the number of signatures further will add little information. From a biological point of view, each signature should result from a different mutation generating process. In your setting with 14 patients suffering from the same type of cancer, one would suspect a low number of such processes. I hope this made things a bit clearer. Best wishes Julian

score 0 · Answer 2 · 2014-06-11

Hi Anand, > Thank you for the elaborative reply. This clears lot of things. Regarding R number you are right. Also, I guess you need more genoms to decipher more signatures. The number of genomes will give you more power to detect signatures, whereas the number of potentially present signatures (and therefore mutational processes) will depend on the biology of the samples. > One more question though, in the plot generated from plotSignatures() , the Y axis 'contribution' - is it the percentage contribution ? The 'contribution' reflects the values of the matrix decomposition. They are proportional to each other, but do not reflect percentages. You can transform them to percentages by dividing the decomposed matrix of interest. As an example: sigs_nmf$w = sigs_nmf$w / rowSums(sigs_nmf$w) I may add a convenient function for this soon. Best wishes Julian > > Thanks again, > > Regards, > -Anand > > -----Original Message----- > From: Julian Gehring [mailto:julian.gehring at embl.de] > Sent: Wednesday, 11 June, 2014 11:31 AM > To: Anand [guest]; public-csiamt-6Bl98Hp8bEiLvajZxc+D7Q at plane.gmane.org > Subject: Re: Query regarding SomatiSignature bioconductor package > > > > Hi Anand, > >> 1.I have data from a single study (AML) with mutations obtained from >> 14 patients. In this case, how do I group the data ? If I group the >> data by ???study??? as in vignette, I am getting an error while >> running nmfSignatures function.(I guess it???s because the dimension >> of matrix >> (sca_occurance) has only one column corresponding to the single study performed ) Can I group it based on patients (sampleNames) instead ? > > You can group your variants by any variable that is present in the 'VRanges' object that contain your calls. The object behaves very similar to a data frame, so you could add a column with > > x$sample = ... ## your 14 samples ## > > and than group it with > > motifMatrix(x, group = "sample") > > If your samples are already stored in the column 'sampleNames', you can also refer to this (see '?mutationContext' for an example). > > >> 2.How do I choose the number R (number of signatures to obtain) ? I guess it should be less than number of columns of sca_occurances ? In a recent publication (Nicocolo Bolli et al , 2013, nat. com) involving single study (multiple myeloma with 52 patients), they mention - the have found two signatures, does it mean they have set the number of signatures (R argument in nmfSignatures()) to 2? > > For estimating the number of signatures, there are several approaches. > If and how well they perform depends largely on the input data, none of > them will work reliably in all cases. For this reason, I haven't > implemented an estimation for the number of signatures so far - I want > to avoid giving a false sense of security/certainty. > > For the practical aspect, most information will the contained in the > first few signatures - increasing the number of signatures further will > add little information. From a biological point of view, each signature > should result from a different mutation generating process. In your > setting with 14 patients suffering from the same type of cancer, one > would suspect a low number of such processes. > > I hope this made things a bit clearer. > > Best wishes > Julian > >

score 0 · Answer 3 · 2014-06-11

Hi Anand, > Thank you for the elaborative reply. This clears lot of things. Regarding R number you are right. Also, I guess you need more genoms to decipher more signatures. The number of genomes will give you more power to detect signatures, whereas the number of potentially present signatures (and therefore mutational processes) will depend on the biology of the samples. > One more question though, in the plot generated from plotSignatures() , the Y axis 'contribution' - is it the percentage contribution ? The 'contribution' reflects the values of the matrix decomposition. They are proportional to each other, but do not reflect percentages. You can transform them to percentages by dividing the decomposed matrix of interest. As an example: sigs_nmf$w = sigs_nmf$w / rowSums(sigs_nmf$w) I may add a convenient function for this soon. Best wishes Julian > > Thanks again, > > Regards, > -Anand > > -----Original Message----- > From: Julian Gehring [mailto:julian.gehring at embl.de] > Sent: Wednesday, 11 June, 2014 11:31 AM > To: Anand [guest]; public-csiamt-6Bl98Hp8bEiLvajZxc+D7Q at plane.gmane.org > Subject: Re: Query regarding SomatiSignature bioconductor package > > > > Hi Anand, > >> 1.I have data from a single study (AML) with mutations obtained from >> 14 patients. In this case, how do I group the data ? If I group the >> data by ???study??? as in vignette, I am getting an error while >> running nmfSignatures function.(I guess it???s because the dimension >> of matrix >> (sca_occurance) has only one column corresponding to the single study performed ) Can I group it based on patients (sampleNames) instead ? > > You can group your variants by any variable that is present in the 'VRanges' object that contain your calls. The object behaves very similar to a data frame, so you could add a column with > > x$sample = ... ## your 14 samples ## > > and than group it with > > motifMatrix(x, group = "sample") > > If your samples are already stored in the column 'sampleNames', you can also refer to this (see '?mutationContext' for an example). > > >> 2.How do I choose the number R (number of signatures to obtain) ? I guess it should be less than number of columns of sca_occurances ? In a recent publication (Nicocolo Bolli et al , 2013, nat. com) involving single study (multiple myeloma with 52 patients), they mention - the have found two signatures, does it mean they have set the number of signatures (R argument in nmfSignatures()) to 2? > > For estimating the number of signatures, there are several approaches. > If and how well they perform depends largely on the input data, none of > them will work reliably in all cases. For this reason, I haven't > implemented an estimation for the number of signatures so far - I want > to avoid giving a false sense of security/certainty. > > For the practical aspect, most information will the contained in the > first few signatures - increasing the number of signatures further will > add little information. From a biological point of view, each signature > should result from a different mutation generating process. In your > setting with 14 patients suffering from the same type of cancer, one > would suspect a low number of such processes. > > I hope this made things a bit clearer. > > Best wishes > Julian > >