SomaticSignatures mutationContext (char ',') not in lookup table
2
2
Entering edit mode
@tesaseverson-8731
Last seen 9.2 years ago
Netherlands

Hello all,

I am trying to use SomaticSignatures to analyze my exome sequencing data. I have a starting table of variants identified (data.frame), rather than metadata and a VRanges object like in the tutorial online (http://www.bioconductor.org/packages/release/bioc/vignettes/SomaticSignatures/inst/doc/SomaticSignatures-vignette.html).

I converted the table using GRanges

file <- read.csv("_all_variants_SomaticSignatures.csv", head=TRUE, stringsAsFactors=TRUE)
file2 <- makeGRangesFromDataFrame(file, keep.extra.columns=TRUE)

And followed the tutorial to make a VRanges object

file3 = unname(subset(file2, Variant_Type %in% "SNV"))
file4 = keepSeqlevels(file3, hsAutosomes())
ex_vr = VRanges(
    seqnames = seqnames(file4),
    ranges = ranges(file4),
    ref = file4$Reference_Allele,
    alt = file4$Tumor_Seq_Allele2,
    sampleNames = file4$Patient_ID,
    seqinfo = seqinfo(file4),
    study = file4$study)


ex_vr

VRanges object with 68095 ranges and 1 metadata column:
          seqnames               ranges strand         ref              alt
             <Rle>            <IRanges>  <Rle> <character> <characterOrRle>
      [1]     chr1     [762353, 762353]      +           G                C
      [2]     chr1     [876516, 876516]      +           C                A
      [3]     chr1     [878314, 878314]      +           G                C
      [4]     chr1     [878314, 878314]      +           G                C
      [5]     chr1     [878314, 878314]      +           G                C
      ...      ...                  ...    ...         ...              ...
  [68091]    chr22 [51171667, 51171667]      +           G                A
  [68092]    chr22 [51171667, 51171667]      +           G                A
  [68093]    chr22 [51176616, 51176616]      +           G                A
  [68094]    chr22 [51178065, 51178065]      +           G                A
  [68095]    chr22 [51219006, 51219006]      +           G                A
              totalDepth       refDepth       altDepth   sampleNames
          <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle>
      [1]           <NA>           <NA>           <NA>        RTN102
      [2]           <NA>           <NA>           <NA>        RTN126
      [3]           <NA>           <NA>           <NA>        RTN122
      [4]           <NA>           <NA>           <NA>        RTN127
      [5]           <NA>           <NA>           <NA>        RTN045
      ...            ...            ...            ...           ...
  [68091]           <NA>           <NA>           <NA>        RTN133
  [68092]           <NA>           <NA>           <NA>        RTN045
  [68093]           <NA>           <NA>           <NA>        RTN133
  [68094]           <NA>           <NA>           <NA>        RTN039
  [68095]           <NA>           <NA>           <NA>        RTN024
          softFilterMatrix   |          study
                  <matrix>   |       <factor>
      [1]                    | non-BRCA1-like
      [2]                    |     BRCA1-like
      [3]                    |     BRCA1-like
      [4]                    | non-BRCA1-like
      [5]                    |   BRCA1-mutant
      ...              ... ...            ...
  [68091]                    |     BRCA1-like
  [68092]                    |   BRCA1-mutant
  [68093]                    |     BRCA1-like
  [68094]                    |   BRCA1-mutant
  [68095]                    |     BRCA1-like
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths
  hardFilters: NULL

I think this looks good and the table looks good

sort(table(ex_vr$study), decreasing=TRUE)

 BRCA1-like non-BRCA1-like   BRCA1-mutant
         45065          15235           7795

But when I try to do mutationContext I get an error

ex_motifs=mutationContext(ex_vr, BSgenome.Hsapiens.UCSC.hg19, unify=TRUE)
Error in .Call2("new_XStringSet_from_CHARACTER", ans_class, ans_elementType,  :
  key 44 (char ',') not in lookup table

 

Can anyone help me figure out what I've done wrong?

Thanks very much,

Tesa

 

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.2 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] ggdendro_0.1-15                   ggplot2_1.0.1                    
 [3] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.3                  
 [5] rtracklayer_1.28.9                SomaticCancerAlterations_1.4.0   
 [7] SomaticSignatures_2.4.7           Biobase_2.28.0                   
 [9] VariantAnnotation_1.14.13         Rsamtools_1.20.4                 
[11] Biostrings_2.36.4                 XVector_0.8.0                    
[13] GenomicRanges_1.20.6              GenomeInfoDb_1.4.2               
[15] IRanges_2.2.7                     S4Vectors_0.6.4                  
[17] BiocGenerics_0.14.0               BiocInstaller_1.18.4             

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0             biovizBase_1.16.0       lattice_0.20-33        
 [4] digest_0.6.8            foreach_1.4.2           gridBase_0.4-7         
 [7] plyr_1.8.3              futile.options_1.0.0    acepack_1.3-3.3        
[10] RSQLite_1.0.0           zlibbioc_1.14.0         GenomicFeatures_1.20.3
[13] rpart_4.1-10            labeling_0.3            ggbio_1.16.1           
[16] proto_0.3-10            splines_3.2.2           BiocParallel_1.2.20    
[19] stringr_1.0.0           foreign_0.8-66          RCurl_1.95-4.7         
[22] biomaRt_2.24.0          munsell_0.4.2           proxy_0.4-15           
[25] pkgmaker_0.22           pcaMethods_1.58.0       nnet_7.3-11            
[28] gridExtra_2.0.0         Hmisc_3.16-0            codetools_0.2-14       
[31] XML_3.98-1.3            reshape_0.8.5           GenomicAlignments_1.4.1
[34] MASS_7.3-44             bitops_1.0-6            grid_3.2.2             
[37] RBGL_1.44.0             exomeCopy_1.14.0        xtable_1.7-4           
[40] GGally_0.5.0            gtable_0.1.2            registry_0.3           
[43] DBI_0.3.1               magrittr_1.5            scales_0.3.0           
[46] graph_1.46.0            stringi_0.5-5           reshape2_1.4.1         
[49] doParallel_1.0.8        latticeExtra_0.6-26     futile.logger_1.4.1    
[52] Formula_1.2-1           lambda.r_1.1.7          RColorBrewer_1.1-2     
[55] NMF_0.20.6              iterators_1.0.7         tools_3.2.2            
[58] dichromat_2.0-0         OrganismDbi_1.10.0      rngtools_1.2.4         
[61] survival_2.38-3         AnnotationDbi_1.30.1    colorspace_1.2-6       
[64] cluster_2.0.3

 

SomaticSignatures mutationContext VRanges • 2.3k views
ADD COMMENT
1
Entering edit mode

Hi Tesa, thank you for the detailed report. I have a few questions that can help us to narrow down the issue:

  1. Can you please rerun your example and run traceback() directly after the error occurs? This can tell us in which of the underlying functions the issue occurs.
  2. Can you rerun the command with unify=FALSE and see if the issue still occurs?
  3. You can try to rerun the command with a subset of the VRanges (e.g. ex_vr[1:1000]). If you do this with different subset, you can most likely narrow it down to a single problematic entry.

In case you can share the data with me, I can also have a look at it.

Regarding (3), running

for(i in seq_along(ex_vr)) {
  v = ex_vr[i]
  tryCatch(mutationContext(v, BSgenome.Hsapiens.UCSC.hg19, unify=TRUE), error = function(e) print(v))
}

will step through each variant individually, and show the ones that cause the error.

ADD REPLY
0
Entering edit mode

Hi Julian, thanks for your quick reply. I won't give you traceback details because the problem was that there were still some INDELs in my file. I ran your code for stepping though each variant. The 'error' also occurred also with unify=FALSE. The INDELS were the offending lines. But now I have a new, unrelated problem. I still can't get the mutationContext function to work with my data. I can send you my data and you can have a look to see if you can figure out what might be wrong besides the INDELs.

Thanks again,

Tesa

 

ADD REPLY
0
Entering edit mode

I'm glad to hear that you could find the issue. Can you give me an example how a "typical offending" InDel looks like in your data? (Either paste it here or send it to me via e-mail). The mutationContext function performs checks for catching InDels, but it seems to miss some in your data. I would like to extend the internal checks, such that they also work seamlessly with your data.

ADD REPLY
0
Entering edit mode

Hi Julian,

I'm not sure if it is because of the somewhat weird way SnpEff annotates indels. here are a few offenders. I have found an additional one that my code didn't find which was causing the last function problem. I can now use mutationContext when I input the right data ;)

1 211465056 211465056 1 * RCOR3 26155 sequence_feature INDEL ACATAGGCCA ACA ACA NA Untested Somatic RTN038 RTN038 304 BRCA1-like
3 157820753 157820753 1 * SHOX2 26155 5_prime_UTR_variant INDEL ACAGCA ACA ACA NA Untested Somatic RTN039 RTN039 139 BRCA1-mutant
4 185580420 185580420 1 * PRIMPOL 26155 intron_variant INDEL ACATTCA ACA ACA NA Untested Somatic RTN097 RTN097 305 BRCA1-like
4 185580420 185580420 1 * PRIMPOL 26155 intron_variant INDEL ACATTCA ACA ACA NA Untested Somatic RTN102 RTN102 305 non-BRCA1-like
4 185580420 185580420 1 * PRIMPOL 26155 intron_variant INDEL ACATTCA ACA ACA NA Untested Somatic RTN126 RTN126 305 BRCA1-like
7 12428493 12428493 1 * VWDE 26155 3_prime_UTR_variant INDEL ACACTCA ACA ACA NA Untested Somatic RTN133 RTN133 1092 BRCA1-like
14 37717240 37717240 1 * MIPOL1 26155 intron_variant INDEL ACATGCACACAGAGACCCA ACA ACA NA Untested Somatic RTN039 RTN039 860 BRCA1-mutant
17 7809187 7809187 1 * CHD3 26155 frameshift_variant INDEL ACACCCGTCA ACA ACA NA Untested Somatic RTN051 RTN051 624 BRCA1-like
14 71202835 71202835 1 * MAP3K9 26155 intron_variant INDEL ACACCAC ACAC ACAC NA Untested Somatic RTN005 RTN005 170 BRCA1-like
14 71202835 71202835 1 * MAP3K9 26155 intron_variant INDEL ACACCAC ACAC ACAC NA Untested Somatic RTN028 RTN028 170 non-BRCA1-like
14 71202835 71202835 1 * MAP3K9 26155 intron_variant INDEL ACACCAC ACAC ACAC NA Untested Somatic RTN039 RTN039 170 BRCA1-mutant

Thanks for your help

ADD REPLY
0
Entering edit mode

Thanks for the data. I'm however not familiar with the output format of SnpEff. For e.g. the first variant, would ACATAGGCCA be the reference and ACA be the alternative allele?

ADD REPLY
0
Entering edit mode

Yes, that's my interpretation of it. So, the alternative allele is missing 'TAGGCCA'. But there doesn't seem to be any rhyme or reason how long the reference section is in the output. We've actually found it so difficult to interpret we re-map indels if needed with another tool.

ADD REPLY
0
Entering edit mode

As I mentioned before, the mutationContext function already has some checks that should detect cases like these. I don't fully understand why they did not work here, perhaps due to the representation of these specific indels in the VRanges. Could you please send the a VRanges with the offending entries that you have shown above via e-mail? That should help me diagnosing the issue, since SnpEff is used by quite a number of groups.

ADD REPLY
0
Entering edit mode
@tesaseverson-8731
Last seen 9.2 years ago
Netherlands

Hi Julian, no problem. I don't have access to the computer I did the work on until later this afternoon, but I'll email it to you then.

Cheers,

Tesa

ADD COMMENT
0
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 5.5 years ago

From the data that you have sent me, the issue appears to be caused by alternative alleles that contain "," characters: 

head(alt(vr)[grep(",", alt(vr))])
[1] "TGGGGGGGGG,TGGGGGGGGGGGGGG"                    
[2] "TTCCTCCTCCTCCTCCTCCTCCTC,TTCCTCCTCCTCCTCCTCCTC"
[3] "TTCCTCCTCCTCCTCCTCCTCCTC,TTCCTCCTCCTCCTCCTCCTC"
[4] "TTCCTCCTCCTCCTCCTCCTCCTC,TTCCTCCTCCTCCTCCTCCTC"
[5] "GAA,GA"                                        
[6] "GAA,GA"

You can remove those simply with

vrx = vr[!grepl(",", alt(vr))]

and the analysis works fine. I suspect that different alternative alleles are encoded this way in the "csv" file that you are starting with; however, it does not work with the concept of VRanges. As I mentioned before, I will change the checks of the respective functions, such that you will get a more informative error message in the future.

Update: SomaticSignatures 2.4.8 implements more extensives checks and would provide a more informative error message here. It should become available once the build is completed, which will happen over the weekend.

ADD COMMENT

Login before adding your answer.

Traffic: 586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6