I'm wondering if it's no longer possible to use motifBreakR with custom genomes.
I'm unable to read in SNPs from a VCF or BED file for analysis with motifBreakR only if I'm using a custom forged BSgenome object. In 2018 all my code was successful, but now I'm getting errors, and I can't figure out what changed. Using the snps.from.file() function, I get a unique error for each file type. Reading in a VCF files gives an error about GRanges. Reading in a BED file gives an error with seqlevelsStyle(), and prompts me to check genomeStyles(). Of course, my species (A mexicanus) isn't listed in the genomeStyles() output, but I'm using a custom BSgenome object, and I don't know how to modify or add my species to the content of genomeStyles(). (and this wasn't necessary previously).
To troubleshoot, I created a simplified subset of only a few SNPs from a single chromosome, and placed them into a VCF file, or a BED file. I also forged a new BSgenome object with just this single chromosome (from Cavefish). And I get the errors demonstrated below. I also tried essentially the same thing with yeast (Scerevisiae), which is listed within genomeStyles(), and I get success with either VCF or BED files, even if I use a custom forged yeast genome.
Not quite sure how to proceed. Success with a custom built yeast genome but not cavefish genome, and genomeStyles() contains yeast but not others has me confused. I don't know what to try next. How does one use motifBreakR with a custom genome? (I knew the answer in 2018, but not 2021).
library(motifbreakR)
library(BSgenome.Amexicanus.UCSC.astMexNC)
# VCF file with 23 SNPs from a single chromosome
vp <- "nnra_snp_NC_035897_20_50k.vcf.gz"
snps.vcf <- snps.from.file(file=vp, search.genome=Amexicanus, format='vcf')
This fails with the following error:
Error in getListElement(x, i, ...) :
GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
duplicate keys in header will be forced to unique rownames
So I try loading a simple BED file with a few SNPs
# BED file with 10 SNPs from single chromosome
vp <- "head_NC_035897.bed"
snps.bed <- snps.from.file(file=vp, search.genome=Amexicanus, format='bed')
This fails with:
Error in seqlevelsStyle(seqlevels(x)) :
The style does not have a compatible entry for the species supported by Seqname.
Please see genomeStyles() for supported species/style
This error seems to imply that motifBreakR can not be used with custom genomes, but that was not the case in the past. I have no idea how to add my chromosome styles to the listing of genomeStyles().
Using a canonical genome (yeast, which is included in the genomeStyles() list, I have success with VCF or BED files.
library(BSgenome.Scerevisiae.UCSC.sacCer3)
# a few yeast SNPs
vp <- "snps_final.ann.vcf.gz"
snps.vcf <- snps.from.file(file=vp, search.genome=Scerevisiae, format='vcf')
Success with a warning:
Warning message:
In .bcfHeaderAsSimpleList(header) :
duplicate keys in header will be forced to unique rownames
Try BED file:
# yeast SNPs in BED format
vp <- "snps_final.ann.motifbreakR2.bed"
snps.bed <- snps.from.file(file=vp, search.genome=Scerevisiae, format='bed')
Success. Both yeast examples also work if I use a custom forged yeast genome
> sessionInfo( )
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /n/apps/CentOS7/install/r-4.0.0/lib64/R/lib/libRblas.so
LAPACK: /n/apps/CentOS7/install/r-4.0.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods
[10] base
other attached packages:
[1] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0 BSgenome.Amexicanus.UCSC.astMexNC_1.4.2
[3] motifbreakR_2.2.0 MotifDb_1.30.0
[5] BSgenome_1.56.0 Biostrings_2.56.0
[7] XVector_0.28.0 rtracklayer_1.48.0
[9] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
[11] IRanges_2.22.2 S4Vectors_0.26.1
[13] BiocGenerics_0.34.0 pheatmap_1.0.12
[15] knitr_1.31 RColorBrewer_1.1-2
loaded via a namespace (and not attached):
[1] colorspace_2.0-0 grImport2_0.2-0 ellipsis_0.3.1
[4] biovizBase_1.36.0 htmlTable_2.1.0 base64enc_0.1-3
[7] dichromat_2.0-0 rstudioapi_0.13 rGADEM_2.36.0
[10] bit64_4.0.5 AnnotationDbi_1.50.3 fansi_0.4.2
[13] xml2_1.3.2 splines_4.0.0 motifStack_1.32.1
[16] cachem_1.0.4 ade4_1.7-16 Formula_1.2-4
[19] splitstackshape_1.4.8 Rsamtools_2.4.0 seqLogo_1.54.3
[22] cluster_2.1.1 dbplyr_2.1.0 png_0.1-7
[25] compiler_4.0.0 httr_1.4.2 backports_1.2.1
[28] assertthat_0.2.1 Matrix_1.3-2 fastmap_1.1.0
[31] lazyeval_0.2.2 htmltools_0.5.1.1 prettyunits_1.1.1
[34] tools_4.0.0 TFMPvalue_0.0.8 gtable_0.3.0
[37] glue_1.4.2 GenomeInfoDbData_1.2.3 dplyr_1.0.5
[40] rappdirs_0.3.3 Rcpp_1.0.6 Biobase_2.48.0
[43] vctrs_0.3.6 debugme_1.1.0 xfun_0.22
[46] stringr_1.4.0 lifecycle_1.0.0 ensembldb_2.12.1
[49] XML_3.99-0.6 zlibbioc_1.34.0 MASS_7.3-53.1
[52] scales_1.1.1 VariantAnnotation_1.34.0 ProtGenerics_1.20.0
[55] hms_1.0.0 SummarizedExperiment_1.18.2 AnnotationFilter_1.12.0
[58] yaml_2.2.1 curl_4.3 memoise_2.0.0
[61] gridExtra_2.3 ggplot2_3.3.3 MotIV_1.43.0
[64] biomaRt_2.44.4 rpart_4.1-15 latticeExtra_0.6-29
[67] stringi_1.5.3 RSQLite_2.2.4 checkmate_2.0.0
[70] GenomicFeatures_1.40.1 BiocParallel_1.22.0 rlang_0.4.10
[73] pkgconfig_2.0.3 matrixStats_0.58.0 bitops_1.0-6
[76] evaluate_0.14 lattice_0.20-41 purrr_0.3.4
[79] GenomicAlignments_1.24.0 htmlwidgets_1.5.3 bit_4.0.4
[82] tidyselect_1.1.0 magrittr_2.0.1 R6_2.5.0
[85] generics_0.1.0 Hmisc_4.5-0 DelayedArray_0.14.1
[88] DBI_1.1.1 pillar_1.5.1 foreign_0.8-81
[91] survival_3.2-10 RCurl_1.98-1.3 nnet_7.3-15
[94] tibble_3.1.0 crayon_1.4.1 utf8_1.2.1
[97] BiocFileCache_1.12.1 rmarkdown_2.7 jpeg_0.1-8.1
[100] progress_1.2.2 data.table_1.14.0 blob_1.2.1
[103] digest_0.6.27 openssl_1.4.3 munsell_0.5.0
[106] Gviz_1.32.0 askpass_1.1
Update April 20: Using R 4.0.3 and BSgenome 1.58.0, if I rebuild the astMex custom BSgenome and then attempt to load a BED file of SNPs as above, there's a new error:
Error in mapSeqlevels(sequence, seqlevelsStyle(search.genome)) :
supplied seqname style "RefSeq" is not supported
I think this is from GenomeInfoDb because it is detecting RefSeq style chr names, but I have no idea why this matters.
Update April 27: A custom genome object in which all chromosome names begin with "chr" followed by an integer functions ok. So for now I can simply translate all my chromosome names to a new space for this part of the analysis, and translate them back afterwards. Suboptimal strategy, but works in a pinch.