motifBreakR error with custom BSgenome
Entering edit mode
Chris Seidel ▴ 80
Last seen 4 days ago
United States

I'm wondering if it's no longer possible to use motifBreakR with custom genomes.

I'm unable to read in SNPs from a VCF or BED file for analysis with motifBreakR only if I'm using a custom forged BSgenome object. In 2018 all my code was successful, but now I'm getting errors, and I can't figure out what changed. Using the snps.from.file() function, I get a unique error for each file type. Reading in a VCF files gives an error about GRanges. Reading in a BED file gives an error with seqlevelsStyle(), and prompts me to check genomeStyles(). Of course, my species (A mexicanus) isn't listed in the genomeStyles() output, but I'm using a custom BSgenome object, and I don't know how to modify or add my species to the content of genomeStyles(). (and this wasn't necessary previously).

To troubleshoot, I created a simplified subset of only a few SNPs from a single chromosome, and placed them into a VCF file, or a BED file. I also forged a new BSgenome object with just this single chromosome (from Cavefish). And I get the errors demonstrated below. I also tried essentially the same thing with yeast (Scerevisiae), which is listed within genomeStyles(), and I get success with either VCF or BED files, even if I use a custom forged yeast genome.

Not quite sure how to proceed. Success with a custom built yeast genome but not cavefish genome, and genomeStyles() contains yeast but not others has me confused. I don't know what to try next. How does one use motifBreakR with a custom genome? (I knew the answer in 2018, but not 2021).


# VCF file with 23 SNPs from a single chromosome
vp <- "nnra_snp_NC_035897_20_50k.vcf.gz"
snps.vcf <- snps.from.file(file=vp, search.genome=Amexicanus, format='vcf')

This fails with the following error:

Error in getListElement(x, i, ...) : 
  GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames

So I try loading a simple BED file with a few SNPs

# BED file with 10 SNPs from single chromosome
vp <- "head_NC_035897.bed"
snps.bed <- snps.from.file(file=vp, search.genome=Amexicanus, format='bed')

This fails with:

Error in seqlevelsStyle(seqlevels(x)) : 
  The style does not have a compatible entry for the species supported by Seqname.
  Please see genomeStyles() for supported species/style

This error seems to imply that motifBreakR can not be used with custom genomes, but that was not the case in the past. I have no idea how to add my chromosome styles to the listing of genomeStyles().

Using a canonical genome (yeast, which is included in the genomeStyles() list, I have success with VCF or BED files.

# a few yeast SNPs
vp <- "snps_final.ann.vcf.gz"
snps.vcf <- snps.from.file(file=vp, search.genome=Scerevisiae, format='vcf')

Success with a warning:

Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames

Try BED file:

# yeast SNPs in BED format
vp <- "snps_final.ann.motifbreakR2.bed"
snps.bed <- snps.from.file(file=vp, search.genome=Scerevisiae, format='bed')

Success. Both yeast examples also work if I use a custom forged yeast genome

> sessionInfo( )
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /n/apps/CentOS7/install/r-4.0.0/lib64/R/lib/
LAPACK: /n/apps/CentOS7/install/r-4.0.0/lib64/R/lib/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0  BSgenome.Amexicanus.UCSC.astMexNC_1.4.2
 [3] motifbreakR_2.2.0                        MotifDb_1.30.0                          
 [5] BSgenome_1.56.0                          Biostrings_2.56.0                       
 [7] XVector_0.28.0                           rtracklayer_1.48.0                      
 [9] GenomicRanges_1.40.0                     GenomeInfoDb_1.24.2                     
[11] IRanges_2.22.2                           S4Vectors_0.26.1                        
[13] BiocGenerics_0.34.0                      pheatmap_1.0.12                         
[15] knitr_1.31                               RColorBrewer_1.1-2                      

loaded via a namespace (and not attached):
  [1] colorspace_2.0-0            grImport2_0.2-0             ellipsis_0.3.1             
  [4] biovizBase_1.36.0           htmlTable_2.1.0             base64enc_0.1-3            
  [7] dichromat_2.0-0             rstudioapi_0.13             rGADEM_2.36.0              
 [10] bit64_4.0.5                 AnnotationDbi_1.50.3        fansi_0.4.2                
 [13] xml2_1.3.2                  splines_4.0.0               motifStack_1.32.1          
 [16] cachem_1.0.4                ade4_1.7-16                 Formula_1.2-4              
 [19] splitstackshape_1.4.8       Rsamtools_2.4.0             seqLogo_1.54.3             
 [22] cluster_2.1.1               dbplyr_2.1.0                png_0.1-7                  
 [25] compiler_4.0.0              httr_1.4.2                  backports_1.2.1            
 [28] assertthat_0.2.1            Matrix_1.3-2                fastmap_1.1.0              
 [31] lazyeval_0.2.2              htmltools_0.5.1.1           prettyunits_1.1.1          
 [34] tools_4.0.0                 TFMPvalue_0.0.8             gtable_0.3.0               
 [37] glue_1.4.2                  GenomeInfoDbData_1.2.3      dplyr_1.0.5                
 [40] rappdirs_0.3.3              Rcpp_1.0.6                  Biobase_2.48.0             
 [43] vctrs_0.3.6                 debugme_1.1.0               xfun_0.22                  
 [46] stringr_1.4.0               lifecycle_1.0.0             ensembldb_2.12.1           
 [49] XML_3.99-0.6                zlibbioc_1.34.0             MASS_7.3-53.1              
 [52] scales_1.1.1                VariantAnnotation_1.34.0    ProtGenerics_1.20.0        
 [55] hms_1.0.0                   SummarizedExperiment_1.18.2 AnnotationFilter_1.12.0    
 [58] yaml_2.2.1                  curl_4.3                    memoise_2.0.0              
 [61] gridExtra_2.3               ggplot2_3.3.3               MotIV_1.43.0               
 [64] biomaRt_2.44.4              rpart_4.1-15                latticeExtra_0.6-29        
 [67] stringi_1.5.3               RSQLite_2.2.4               checkmate_2.0.0            
 [70] GenomicFeatures_1.40.1      BiocParallel_1.22.0         rlang_0.4.10               
 [73] pkgconfig_2.0.3             matrixStats_0.58.0          bitops_1.0-6               
 [76] evaluate_0.14               lattice_0.20-41             purrr_0.3.4                
 [79] GenomicAlignments_1.24.0    htmlwidgets_1.5.3           bit_4.0.4                  
 [82] tidyselect_1.1.0            magrittr_2.0.1              R6_2.5.0                   
 [85] generics_0.1.0              Hmisc_4.5-0                 DelayedArray_0.14.1        
 [88] DBI_1.1.1                   pillar_1.5.1                foreign_0.8-81             
 [91] survival_3.2-10             RCurl_1.98-1.3              nnet_7.3-15                
 [94] tibble_3.1.0                crayon_1.4.1                utf8_1.2.1                 
 [97] BiocFileCache_1.12.1        rmarkdown_2.7               jpeg_0.1-8.1               
[100] progress_1.2.2              data.table_1.14.0           blob_1.2.1                 
[103] digest_0.6.27               openssl_1.4.3               munsell_0.5.0              
[106] Gviz_1.32.0                 askpass_1.1

Update April 20: Using R 4.0.3 and BSgenome 1.58.0, if I rebuild the astMex custom BSgenome and then attempt to load a BED file of SNPs as above, there's a new error:

Error in mapSeqlevels(sequence, seqlevelsStyle(search.genome)) : 
  supplied seqname style "RefSeq" is not supported

I think this is from GenomeInfoDb because it is detecting RefSeq style chr names, but I have no idea why this matters.

Update April 27: A custom genome object in which all chromosome names begin with "chr" followed by an integer functions ok. So for now I can simply translate all my chromosome names to a new space for this part of the analysis, and translate them back afterwards. Suboptimal strategy, but works in a pinch.

GenomeInfoDb motifbreakR • 74 views

Login before adding your answer.

Traffic: 335 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6