Hello all,
I am trying to use SomaticSignatures to analyze my exome sequencing data. I have a starting table of variants identified (data.frame), rather than metadata and a VRanges object like in the tutorial online (http://www.bioconductor.org/packages/release/bioc/vignettes/SomaticSignatures/inst/doc/SomaticSignatures-vignette.html).
I converted the table using GRanges
file <- read.csv("_all_variants_SomaticSignatures.csv", head=TRUE, stringsAsFactors=TRUE)
file2 <- makeGRangesFromDataFrame(file, keep.extra.columns=TRUE)
And followed the tutorial to make a VRanges object
file3 = unname(subset(file2, Variant_Type %in% "SNV")) file4 = keepSeqlevels(file3, hsAutosomes()) ex_vr = VRanges( seqnames = seqnames(file4), ranges = ranges(file4), ref = file4$Reference_Allele, alt = file4$Tumor_Seq_Allele2, sampleNames = file4$Patient_ID, seqinfo = seqinfo(file4), study = file4$study) ex_vr
VRanges object with 68095 ranges and 1 metadata column:
seqnames ranges strand ref alt
<Rle> <IRanges> <Rle> <character> <characterOrRle>
[1] chr1 [762353, 762353] + G C
[2] chr1 [876516, 876516] + C A
[3] chr1 [878314, 878314] + G C
[4] chr1 [878314, 878314] + G C
[5] chr1 [878314, 878314] + G C
... ... ... ... ... ...
[68091] chr22 [51171667, 51171667] + G A
[68092] chr22 [51171667, 51171667] + G A
[68093] chr22 [51176616, 51176616] + G A
[68094] chr22 [51178065, 51178065] + G A
[68095] chr22 [51219006, 51219006] + G A
totalDepth refDepth altDepth sampleNames
<integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle>
[1] <NA> <NA> <NA> RTN102
[2] <NA> <NA> <NA> RTN126
[3] <NA> <NA> <NA> RTN122
[4] <NA> <NA> <NA> RTN127
[5] <NA> <NA> <NA> RTN045
... ... ... ... ...
[68091] <NA> <NA> <NA> RTN133
[68092] <NA> <NA> <NA> RTN045
[68093] <NA> <NA> <NA> RTN133
[68094] <NA> <NA> <NA> RTN039
[68095] <NA> <NA> <NA> RTN024
softFilterMatrix | study
<matrix> | <factor>
[1] | non-BRCA1-like
[2] | BRCA1-like
[3] | BRCA1-like
[4] | non-BRCA1-like
[5] | BRCA1-mutant
... ... ... ...
[68091] | BRCA1-like
[68092] | BRCA1-mutant
[68093] | BRCA1-like
[68094] | BRCA1-mutant
[68095] | BRCA1-like
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
hardFilters: NULL
I think this looks good and the table looks good
sort(table(ex_vr$study), decreasing=TRUE)
BRCA1-like non-BRCA1-like BRCA1-mutant
45065 15235 7795
But when I try to do mutationContext I get an error
ex_motifs=mutationContext(ex_vr, BSgenome.Hsapiens.UCSC.hg19, unify=TRUE) Error in .Call2("new_XStringSet_from_CHARACTER", ans_class, ans_elementType, : key 44 (char ',') not in lookup table
Can anyone help me figure out what I've done wrong?
Thanks very much,
Tesa
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.10.2 (Yosemite) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods [9] base other attached packages: [1] ggdendro_0.1-15 ggplot2_1.0.1 [3] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.3 [5] rtracklayer_1.28.9 SomaticCancerAlterations_1.4.0 [7] SomaticSignatures_2.4.7 Biobase_2.28.0 [9] VariantAnnotation_1.14.13 Rsamtools_1.20.4 [11] Biostrings_2.36.4 XVector_0.8.0 [13] GenomicRanges_1.20.6 GenomeInfoDb_1.4.2 [15] IRanges_2.2.7 S4Vectors_0.6.4 [17] BiocGenerics_0.14.0 BiocInstaller_1.18.4 loaded via a namespace (and not attached): [1] Rcpp_0.12.0 biovizBase_1.16.0 lattice_0.20-33 [4] digest_0.6.8 foreach_1.4.2 gridBase_0.4-7 [7] plyr_1.8.3 futile.options_1.0.0 acepack_1.3-3.3 [10] RSQLite_1.0.0 zlibbioc_1.14.0 GenomicFeatures_1.20.3 [13] rpart_4.1-10 labeling_0.3 ggbio_1.16.1 [16] proto_0.3-10 splines_3.2.2 BiocParallel_1.2.20 [19] stringr_1.0.0 foreign_0.8-66 RCurl_1.95-4.7 [22] biomaRt_2.24.0 munsell_0.4.2 proxy_0.4-15 [25] pkgmaker_0.22 pcaMethods_1.58.0 nnet_7.3-11 [28] gridExtra_2.0.0 Hmisc_3.16-0 codetools_0.2-14 [31] XML_3.98-1.3 reshape_0.8.5 GenomicAlignments_1.4.1 [34] MASS_7.3-44 bitops_1.0-6 grid_3.2.2 [37] RBGL_1.44.0 exomeCopy_1.14.0 xtable_1.7-4 [40] GGally_0.5.0 gtable_0.1.2 registry_0.3 [43] DBI_0.3.1 magrittr_1.5 scales_0.3.0 [46] graph_1.46.0 stringi_0.5-5 reshape2_1.4.1 [49] doParallel_1.0.8 latticeExtra_0.6-26 futile.logger_1.4.1 [52] Formula_1.2-1 lambda.r_1.1.7 RColorBrewer_1.1-2 [55] NMF_0.20.6 iterators_1.0.7 tools_3.2.2 [58] dichromat_2.0-0 OrganismDbi_1.10.0 rngtools_1.2.4 [61] survival_2.38-3 AnnotationDbi_1.30.1 colorspace_1.2-6 [64] cluster_2.0.3
Hi Tesa, thank you for the detailed report. I have a few questions that can help us to narrow down the issue:
traceback()
directly after the error occurs? This can tell us in which of the underlying functions the issue occurs.unify=FALSE
and see if the issue still occurs?ex_vr[1:1000]
). If you do this with different subset, you can most likely narrow it down to a single problematic entry.In case you can share the data with me, I can also have a look at it.
Regarding (3), running
will step through each variant individually, and show the ones that cause the error.
Hi Julian, thanks for your quick reply. I won't give you traceback details because the problem was that there were still some INDELs in my file. I ran your code for stepping though each variant. The 'error' also occurred also with unify=FALSE. The INDELS were the offending lines. But now I have a new, unrelated problem. I still can't get the mutationContext function to work with my data. I can send you my data and you can have a look to see if you can figure out what might be wrong besides the INDELs.
Thanks again,
Tesa
I'm glad to hear that you could find the issue. Can you give me an example how a "typical offending" InDel looks like in your data? (Either paste it here or send it to me via e-mail). The
mutationContext
function performs checks for catching InDels, but it seems to miss some in your data. I would like to extend the internal checks, such that they also work seamlessly with your data.Hi Julian,
I'm not sure if it is because of the somewhat weird way SnpEff annotates indels. here are a few offenders. I have found an additional one that my code didn't find which was causing the last function problem. I can now use mutationContext when I input the right data ;)
Thanks for your help
Thanks for the data. I'm however not familiar with the output format of SnpEff. For e.g. the first variant, would ACATAGGCCA be the reference and ACA be the alternative allele?
Yes, that's my interpretation of it. So, the alternative allele is missing 'TAGGCCA'. But there doesn't seem to be any rhyme or reason how long the reference section is in the output. We've actually found it so difficult to interpret we re-map indels if needed with another tool.
As I mentioned before, the
mutationContext
function already has some checks that should detect cases like these. I don't fully understand why they did not work here, perhaps due to the representation of these specific indels in the VRanges. Could you please send the a VRanges with the offending entries that you have shown above via e-mail? That should help me diagnosing the issue, since SnpEff is used by quite a number of groups.