I am using the cleanUpdTSeq package to analyze a set of potential 3' UTRs in the parasite Giardia lamblia. I am working from a dataframe with the coordinates of putative 3' UTRs, as per the package requirements. The issue arises with the function "buildFeatureVector" which adds the upstream and downstream sequences around the coordinates in the data frame. Here is the code I am using and the error message:
> peaks_grange <- BED2GRangesSeq(peaks, header = TRUE, withSeq = FALSE) > peaks_feature <- buildFeatureVector(peaks_grange, BSgenomeName = Glamblia, sampleType = 'unknkown', method = 'NaiveBayes', fetchSeq = TRUE) Error in gsub("^chr", chr, ignore.case = TRUE) : argument "x" is missing, with no default
I believe the issue comes from a compatibility issue with the Glamblia genome. I prepared the package for use with BSgenome by following the instructions here: https://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf
The genome is forged from a single fasta file which I named
gl34.fa for the species name and genome version. The instructions speak of assembling a genome from multiple files; one for each chromosome, and I believe the issue comes from my single file containing all the data. Instead of having a file name for each chromosome, as follows (from the genome used in package examples) :
> seqnames(Drerio)  "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chr23" "chr24" "chr25"  "chrM"
I get the following:
> seqnames(Glamblia)  "gl34"
The genome is only available as a single fasta file (no twoBit format) which contains all 5 chromosomes (GLCHR01 through GLCHR05) and a few extra fragments (ex: AACB02000013_frag).
Below is the seed file used to forge/build the genome:
Package: BSgenome.Glamblia.GiardiaDB.35 Title: Genome sequence for Giardia lamblia (GiardiaDB version 35) Version: 0.1.0 License: GPL Description: BSGenome compatible Giardia genome for use in CleanupDTSeq package Author: D.Y.Bilodeau Maintainer: D.Y.Bilodeau <email@example.com> organism: Giardia lamblia common_name: Giardia provider: GiardiaDB provider_version: 35 release_date: 2017-08-15 release_name: release-35 source_url: http://giardiadb.org/common/downloads/release-35/GintestinalisAssemblageAWB/fasta/data/ organism_biocview: Giardia_lamblia BSgenomeObjname: Glamblia seqs_srcdir: /Users/DanB/R/BSGenome seqnames: paste("gl", "34", sep = "")
Is there a way to forge the genome that would make it compatible with this package? There also seems to be an issue with the use of gsub above where there is no x argument (I believe this line comes from the "getUpstreamSequence" function.
Below is my session info.
> sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.3 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale:  en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages:  stats4 parallel stats graphics grDevices utils datasets methods base other attached packages:  BSgenome.Glamblia.GiardiaDB.35_0.1.0 cleanUpdTSeq_1.20.0 e1071_1.7-0.1 seqinr_3.4-5 BSgenome.Drerio.UCSC.danRer7_1.4.0  BSgenome_1.50.0 rtracklayer_1.42.1 Biostrings_2.50.2 XVector_0.22.0 GenomicRanges_1.34.0  GenomeInfoDb_1.18.1 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 loaded via a namespace (and not attached):  knitr_1.21 MASS_7.3-51.1 zlibbioc_1.28.0 GenomicAlignments_1.18.1 BiocParallel_1.16.5 lattice_0.20-38 tools_3.5.2  SummarizedExperiment_1.12.0 grid_3.5.2 Biobase_2.42.0 xfun_0.4 class_7.3-15 matrixStats_0.54.0 ade4_1.7-13  yaml_2.2.0 Matrix_1.2-15 GenomeInfoDbData_1.2.0 bitops_1.0-6 RCurl_1.95-4.11 DelayedArray_0.8.0 compiler_3.5.2