motifBreakR error with custom BSgenome
0
0
Entering edit mode
Chris Seidel ▴ 80
@chris-seidel-5840
Last seen 4 days ago
United States

I'm wondering if it's no longer possible to use motifBreakR with custom genomes.

I'm unable to read in SNPs from a VCF or BED file for analysis with motifBreakR only if I'm using a custom forged BSgenome object. In 2018 all my code was successful, but now I'm getting errors, and I can't figure out what changed. Using the snps.from.file() function, I get a unique error for each file type. Reading in a VCF files gives an error about GRanges. Reading in a BED file gives an error with seqlevelsStyle(), and prompts me to check genomeStyles(). Of course, my species (A mexicanus) isn't listed in the genomeStyles() output, but I'm using a custom BSgenome object, and I don't know how to modify or add my species to the content of genomeStyles(). (and this wasn't necessary previously).

To troubleshoot, I created a simplified subset of only a few SNPs from a single chromosome, and placed them into a VCF file, or a BED file. I also forged a new BSgenome object with just this single chromosome (from Cavefish). And I get the errors demonstrated below. I also tried essentially the same thing with yeast (Scerevisiae), which is listed within genomeStyles(), and I get success with either VCF or BED files, even if I use a custom forged yeast genome.

Not quite sure how to proceed. Success with a custom built yeast genome but not cavefish genome, and genomeStyles() contains yeast but not others has me confused. I don't know what to try next. How does one use motifBreakR with a custom genome? (I knew the answer in 2018, but not 2021).

library(motifbreakR)
library(BSgenome.Amexicanus.UCSC.astMexNC)

# VCF file with 23 SNPs from a single chromosome
vp <- "nnra_snp_NC_035897_20_50k.vcf.gz"
snps.vcf <- snps.from.file(file=vp, search.genome=Amexicanus, format='vcf')

This fails with the following error:

Error in getListElement(x, i, ...) : 
  GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames

So I try loading a simple BED file with a few SNPs

# BED file with 10 SNPs from single chromosome
vp <- "head_NC_035897.bed"
snps.bed <- snps.from.file(file=vp, search.genome=Amexicanus, format='bed')

This fails with:

Error in seqlevelsStyle(seqlevels(x)) : 
  The style does not have a compatible entry for the species supported by Seqname.
  Please see genomeStyles() for supported species/style

This error seems to imply that motifBreakR can not be used with custom genomes, but that was not the case in the past. I have no idea how to add my chromosome styles to the listing of genomeStyles().

Using a canonical genome (yeast, which is included in the genomeStyles() list, I have success with VCF or BED files.

library(BSgenome.Scerevisiae.UCSC.sacCer3)
# a few yeast SNPs
vp <- "snps_final.ann.vcf.gz"
snps.vcf <- snps.from.file(file=vp, search.genome=Scerevisiae, format='vcf')

Success with a warning:

Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames

Try BED file:

# yeast SNPs in BED format
vp <- "snps_final.ann.motifbreakR2.bed"
snps.bed <- snps.from.file(file=vp, search.genome=Scerevisiae, format='bed')

Success. Both yeast examples also work if I use a custom forged yeast genome

> sessionInfo( )
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /n/apps/CentOS7/install/r-4.0.0/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-4.0.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0  BSgenome.Amexicanus.UCSC.astMexNC_1.4.2
 [3] motifbreakR_2.2.0                        MotifDb_1.30.0                          
 [5] BSgenome_1.56.0                          Biostrings_2.56.0                       
 [7] XVector_0.28.0                           rtracklayer_1.48.0                      
 [9] GenomicRanges_1.40.0                     GenomeInfoDb_1.24.2                     
[11] IRanges_2.22.2                           S4Vectors_0.26.1                        
[13] BiocGenerics_0.34.0                      pheatmap_1.0.12                         
[15] knitr_1.31                               RColorBrewer_1.1-2                      

loaded via a namespace (and not attached):
  [1] colorspace_2.0-0            grImport2_0.2-0             ellipsis_0.3.1             
  [4] biovizBase_1.36.0           htmlTable_2.1.0             base64enc_0.1-3            
  [7] dichromat_2.0-0             rstudioapi_0.13             rGADEM_2.36.0              
 [10] bit64_4.0.5                 AnnotationDbi_1.50.3        fansi_0.4.2                
 [13] xml2_1.3.2                  splines_4.0.0               motifStack_1.32.1          
 [16] cachem_1.0.4                ade4_1.7-16                 Formula_1.2-4              
 [19] splitstackshape_1.4.8       Rsamtools_2.4.0             seqLogo_1.54.3             
 [22] cluster_2.1.1               dbplyr_2.1.0                png_0.1-7                  
 [25] compiler_4.0.0              httr_1.4.2                  backports_1.2.1            
 [28] assertthat_0.2.1            Matrix_1.3-2                fastmap_1.1.0              
 [31] lazyeval_0.2.2              htmltools_0.5.1.1           prettyunits_1.1.1          
 [34] tools_4.0.0                 TFMPvalue_0.0.8             gtable_0.3.0               
 [37] glue_1.4.2                  GenomeInfoDbData_1.2.3      dplyr_1.0.5                
 [40] rappdirs_0.3.3              Rcpp_1.0.6                  Biobase_2.48.0             
 [43] vctrs_0.3.6                 debugme_1.1.0               xfun_0.22                  
 [46] stringr_1.4.0               lifecycle_1.0.0             ensembldb_2.12.1           
 [49] XML_3.99-0.6                zlibbioc_1.34.0             MASS_7.3-53.1              
 [52] scales_1.1.1                VariantAnnotation_1.34.0    ProtGenerics_1.20.0        
 [55] hms_1.0.0                   SummarizedExperiment_1.18.2 AnnotationFilter_1.12.0    
 [58] yaml_2.2.1                  curl_4.3                    memoise_2.0.0              
 [61] gridExtra_2.3               ggplot2_3.3.3               MotIV_1.43.0               
 [64] biomaRt_2.44.4              rpart_4.1-15                latticeExtra_0.6-29        
 [67] stringi_1.5.3               RSQLite_2.2.4               checkmate_2.0.0            
 [70] GenomicFeatures_1.40.1      BiocParallel_1.22.0         rlang_0.4.10               
 [73] pkgconfig_2.0.3             matrixStats_0.58.0          bitops_1.0-6               
 [76] evaluate_0.14               lattice_0.20-41             purrr_0.3.4                
 [79] GenomicAlignments_1.24.0    htmlwidgets_1.5.3           bit_4.0.4                  
 [82] tidyselect_1.1.0            magrittr_2.0.1              R6_2.5.0                   
 [85] generics_0.1.0              Hmisc_4.5-0                 DelayedArray_0.14.1        
 [88] DBI_1.1.1                   pillar_1.5.1                foreign_0.8-81             
 [91] survival_3.2-10             RCurl_1.98-1.3              nnet_7.3-15                
 [94] tibble_3.1.0                crayon_1.4.1                utf8_1.2.1                 
 [97] BiocFileCache_1.12.1        rmarkdown_2.7               jpeg_0.1-8.1               
[100] progress_1.2.2              data.table_1.14.0           blob_1.2.1                 
[103] digest_0.6.27               openssl_1.4.3               munsell_0.5.0              
[106] Gviz_1.32.0                 askpass_1.1

Update April 20: Using R 4.0.3 and BSgenome 1.58.0, if I rebuild the astMex custom BSgenome and then attempt to load a BED file of SNPs as above, there's a new error:

Error in mapSeqlevels(sequence, seqlevelsStyle(search.genome)) : 
  supplied seqname style "RefSeq" is not supported

I think this is from GenomeInfoDb because it is detecting RefSeq style chr names, but I have no idea why this matters.

Update April 27: A custom genome object in which all chromosome names begin with "chr" followed by an integer functions ok. So for now I can simply translate all my chromosome names to a new space for this part of the analysis, and translate them back afterwards. Suboptimal strategy, but works in a pinch.

GenomeInfoDb motifbreakR • 74 views
ADD COMMENT

Login before adding your answer.

Traffic: 335 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6