I am working with a non-model organism for which there are 2 versions of the genome. I would like to use the newer version as it has a better resolution and completeness. There is no Ensembl gene model for this version of the genome. I have instead used Gnomon gene model included with the genome version deposited at NCBI. I was able to perform DESeq analysis. My questions is with regards to extracting fasta sequences from the original genome based on the results.
The results obtained from DESeq use the following gene ID format:
gene-NC_027757.2:10030825..10032141
gene-NC_027757.2:10036650..10043774
gene-NC_027757.2:10047118..10049010
gene-NC_027757.2:10076606..10077880
gene-NC_027757.2:10099173..10101111
The GFF file associated with the Gnomon gene model has the following entry format (per line):
NC_027757.2 Gnomon gene 45762 45986 . - . ID=geneNC_027757.2:45762..45986;description=gene.37187;gbkey=Gene;gene_biotype=protein_coding
The problem lies with the fact that the results format for each gene corresponds to the 9th column of the GFF file and is embedded with other information within this column. I am wondering if there is package such as Biostrings that allows me to retrieve the DNA sequences from the genome based on the Gnomon gene ID format. I have attempted to do this in Biostrings however, from what I can understand unless the "gene-NC_027757.2:10030825..10032141" is used as the gene name rather than being in the 9th column of the GFF file the Biostrings can't retrieve the corresponding sequences.
Are there packages or code chunk that you might be able to suggest which would allow me to use the results from DESeq2 to retrieve the corresponding genomic sequences within Bioconductor?
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] stringr_1.4.0 genefilter_1.72.0
[3] ggplot2_3.3.2 PoiClaClu_1.0.2.1
[5] RColorBrewer_1.1-2 pheatmap_1.0.12
[7] DESeq2_1.30.0 BiocParallel_1.24.1
[9] GenomicAlignments_1.26.0 SummarizedExperiment_1.20.0
[11] MatrixGenerics_1.2.0 matrixStats_0.57.0
[13] GenomicFeatures_1.42.1 AnnotationDbi_1.52.0
[15] Biobase_2.50.0 Rsamtools_2.6.0
[17] Biostrings_2.58.0 XVector_0.30.0
[19] GenomicRanges_1.42.0 GenomeInfoDb_1.26.1
[21] IRanges_2.24.0 S4Vectors_0.28.0
[23] BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] httr_1.4.2 bit64_4.0.5 splines_4.0.4
[4] assertthat_0.2.1 askpass_1.1 BiocFileCache_1.14.0
[7] blob_1.2.1 GenomeInfoDbData_1.2.4 yaml_2.2.1
[10] progress_1.2.2 pillar_1.4.7 RSQLite_2.2.1
[13] lattice_0.20-41 glue_1.4.2 digest_0.6.27
[16] colorspace_2.0-0 Matrix_1.3-2 XML_3.99-0.5
[19] pkgconfig_2.0.3 biomaRt_2.46.0 zlibbioc_1.36.0
[22] purrr_0.3.4 xtable_1.8-4 scales_1.1.1
[25] tibble_3.0.4 openssl_1.4.3 annotate_1.68.0
[28] generics_0.1.0 ellipsis_0.3.1 withr_2.3.0
[31] survival_3.2-7 magrittr_2.0.1 crayon_1.3.4
[34] memoise_1.1.0 xml2_1.3.2 tools_4.0.4
[37] prettyunits_1.1.1 hms_0.5.3 lifecycle_0.2.0
[40] locfit_1.5-9.4 munsell_0.5.0 DelayedArray_0.16.0
[43] compiler_4.0.4 tinytex_0.27 rlang_0.4.8
[46] grid_4.0.4 RCurl_1.98-1.2 rstudioapi_0.13
[49] rappdirs_0.3.1 bitops_1.0-6 gtable_0.3.0
[52] DBI_1.1.0 curl_4.3 R6_2.5.0
[55] dplyr_1.0.2 rtracklayer_1.50.0 bit_4.0.4
[58] stringi_1.5.3 Rcpp_1.0.5 vctrs_0.3.5
[61] geneplotter_1.68.0 dbplyr_2.0.0 tidyselect_1.1.0
Oh, and I should note that the 'GRanges` object can be subsetted, as it does include things like the whole extent of each chromosome.
Which is boring, so you could do
And like that.
Hello,
I have a follow-up question. I have followed your instructions. I am attempting to forge the genome for Bnapus using the BSGenomeForge function.
My seed file (in dcf format) is as follows:
Package: BSgenome.Bnapus.NCBI.Bra_napus_v2.0
Title: Full genome sequences for Brassica napus (Bra_napus_v2.0)
Description: Full genome sequences for Brassica napus as provided by NCBI (v2.0, RefSeq assembly accession: GCF_000686985.2)
Version: 1.0.0
organism: Brassica napus
common_name: Rape
genome: Bra_napus_v2.0
provider: NCBI
release_date: 2017/09/22
source_url: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Brassica_napus/latest_assembly_versions/GCF_000686985.2_Bra_napus_v2.0/
organism_biocview: Brassica_napus
seqnames: c("GCF_000686985.2_Bra_napus_v2.0_genomic.fa")
BSgenomeObjname: Bnapus
SrcDataFiles: GCF_000686985.2_Bra_napus_v2.0_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Brassica_napus/latest_assembly_versions/GCF_000686985.2_Bra_napus_v2.0/GCF_000686985.2_Bra_napus_v2.0_genomic.fna.gz
PkgExamples: genome[["1"]]
seqs_srcdir: /home/edytas/scratch/bnapus_genome/BSgenomeForge/seqs_srcdir
ondisk_seq_format: fa
When running the following command: forgeBSgenomeDataPkg("/home/edytas/scratch/bnapus_genome/BSgenomeForge/seqs_srcdir/bnapus_seed.dcf")
I receive the following error: Error in .make_Seqinfo_from_genome(genome) : "Bra_napus_v2.0" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)
I have checked and this genome is not currently registered with GenomeInfoDb however I feel that this error has something to do with my seed file. Would you be able to take a look and let me know you see any glaring errors?