I'm wondering if it's no longer possible to use motifBreakR with custom genomes.
I'm unable to read in SNPs from a VCF or BED file for analysis with motifBreakR only if I'm using a custom forged BSgenome object. In 2018 all my code was successful, but now I'm getting errors, and I can't figure out what changed. Using the snps.from.file() function, I get a unique error for each file type. Reading in a VCF files gives an error about GRanges. Reading in a BED file gives an error with seqlevelsStyle(), and prompts me to check genomeStyles(). Of course, my species (A mexicanus) isn't listed in the genomeStyles() output, but I'm using a custom BSgenome object, and I don't know how to modify or add my species to the content of genomeStyles(). (and this wasn't necessary previously).
To troubleshoot, I created a simplified subset of only a few SNPs from a single chromosome, and placed them into a VCF file, or a BED file. I also forged a new BSgenome object with just this single chromosome (from Cavefish). And I get the errors demonstrated below. I also tried essentially the same thing with yeast (Scerevisiae), which is listed within genomeStyles(), and I get success with either VCF or BED files, even if I use a custom forged yeast genome.
Not quite sure how to proceed. Success with a custom built yeast genome but not cavefish genome, and genomeStyles() contains yeast but not others has me confused. I don't know what to try next. How does one use motifBreakR with a custom genome? (I knew the answer in 2018, but not 2021).
library(motifbreakR) library(BSgenome.Amexicanus.UCSC.astMexNC) # VCF file with 23 SNPs from a single chromosome vp <- "nnra_snp_NC_035897_20_50k.vcf.gz" snps.vcf <- snps.from.file(file=vp, search.genome=Amexicanus, format='vcf')
This fails with the following error:
Error in getListElement(x, i, ...) : GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment In addition: Warning message: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames
So I try loading a simple BED file with a few SNPs
# BED file with 10 SNPs from single chromosome vp <- "head_NC_035897.bed" snps.bed <- snps.from.file(file=vp, search.genome=Amexicanus, format='bed')
This fails with:
Error in seqlevelsStyle(seqlevels(x)) : The style does not have a compatible entry for the species supported by Seqname. Please see genomeStyles() for supported species/style
This error seems to imply that motifBreakR can not be used with custom genomes, but that was not the case in the past. I have no idea how to add my chromosome styles to the listing of genomeStyles().
Using a canonical genome (yeast, which is included in the genomeStyles() list, I have success with VCF or BED files.
library(BSgenome.Scerevisiae.UCSC.sacCer3) # a few yeast SNPs vp <- "snps_final.ann.vcf.gz" snps.vcf <- snps.from.file(file=vp, search.genome=Scerevisiae, format='vcf')
Success with a warning:
Warning message: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames
Try BED file:
# yeast SNPs in BED format vp <- "snps_final.ann.motifbreakR2.bed" snps.bed <- snps.from.file(file=vp, search.genome=Scerevisiae, format='bed')
Success. Both yeast examples also work if I use a custom forged yeast genome
> sessionInfo( ) R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS: /n/apps/CentOS7/install/r-4.0.0/lib64/R/lib/libRblas.so LAPACK: /n/apps/CentOS7/install/r-4.0.0/lib64/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  grid parallel stats4 stats graphics grDevices utils datasets methods  base other attached packages:  BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0 BSgenome.Amexicanus.UCSC.astMexNC_1.4.2  motifbreakR_2.2.0 MotifDb_1.30.0  BSgenome_1.56.0 Biostrings_2.56.0  XVector_0.28.0 rtracklayer_1.48.0  GenomicRanges_1.40.0 GenomeInfoDb_1.24.2  IRanges_2.22.2 S4Vectors_0.26.1  BiocGenerics_0.34.0 pheatmap_1.0.12  knitr_1.31 RColorBrewer_1.1-2 loaded via a namespace (and not attached):  colorspace_2.0-0 grImport2_0.2-0 ellipsis_0.3.1  biovizBase_1.36.0 htmlTable_2.1.0 base64enc_0.1-3  dichromat_2.0-0 rstudioapi_0.13 rGADEM_2.36.0  bit64_4.0.5 AnnotationDbi_1.50.3 fansi_0.4.2  xml2_1.3.2 splines_4.0.0 motifStack_1.32.1  cachem_1.0.4 ade4_1.7-16 Formula_1.2-4  splitstackshape_1.4.8 Rsamtools_2.4.0 seqLogo_1.54.3  cluster_2.1.1 dbplyr_2.1.0 png_0.1-7  compiler_4.0.0 httr_1.4.2 backports_1.2.1  assertthat_0.2.1 Matrix_1.3-2 fastmap_1.1.0  lazyeval_0.2.2 htmltools_0.5.1.1 prettyunits_1.1.1  tools_4.0.0 TFMPvalue_0.0.8 gtable_0.3.0  glue_1.4.2 GenomeInfoDbData_1.2.3 dplyr_1.0.5  rappdirs_0.3.3 Rcpp_1.0.6 Biobase_2.48.0  vctrs_0.3.6 debugme_1.1.0 xfun_0.22  stringr_1.4.0 lifecycle_1.0.0 ensembldb_2.12.1  XML_3.99-0.6 zlibbioc_1.34.0 MASS_7.3-53.1  scales_1.1.1 VariantAnnotation_1.34.0 ProtGenerics_1.20.0  hms_1.0.0 SummarizedExperiment_1.18.2 AnnotationFilter_1.12.0  yaml_2.2.1 curl_4.3 memoise_2.0.0  gridExtra_2.3 ggplot2_3.3.3 MotIV_1.43.0  biomaRt_2.44.4 rpart_4.1-15 latticeExtra_0.6-29  stringi_1.5.3 RSQLite_2.2.4 checkmate_2.0.0  GenomicFeatures_1.40.1 BiocParallel_1.22.0 rlang_0.4.10  pkgconfig_2.0.3 matrixStats_0.58.0 bitops_1.0-6  evaluate_0.14 lattice_0.20-41 purrr_0.3.4  GenomicAlignments_1.24.0 htmlwidgets_1.5.3 bit_4.0.4  tidyselect_1.1.0 magrittr_2.0.1 R6_2.5.0  generics_0.1.0 Hmisc_4.5-0 DelayedArray_0.14.1  DBI_1.1.1 pillar_1.5.1 foreign_0.8-81  survival_3.2-10 RCurl_1.98-1.3 nnet_7.3-15  tibble_3.1.0 crayon_1.4.1 utf8_1.2.1  BiocFileCache_1.12.1 rmarkdown_2.7 jpeg_0.1-8.1  progress_1.2.2 data.table_1.14.0 blob_1.2.1  digest_0.6.27 openssl_1.4.3 munsell_0.5.0  Gviz_1.32.0 askpass_1.1
Update April 20: Using R 4.0.3 and BSgenome 1.58.0, if I rebuild the astMex custom BSgenome and then attempt to load a BED file of SNPs as above, there's a new error:
Error in mapSeqlevels(sequence, seqlevelsStyle(search.genome)) : supplied seqname style "RefSeq" is not supported
I think this is from GenomeInfoDb because it is detecting RefSeq style chr names, but I have no idea why this matters.
Update April 27: A custom genome object in which all chromosome names begin with "chr" followed by an integer functions ok. So for now I can simply translate all my chromosome names to a new space for this part of the analysis, and translate them back afterwards. Suboptimal strategy, but works in a pinch.