Question

DESeq2 - extracting results - Gnomon gene model

0

Entering edit mode

sieminsk • 0

@3fc834aa

Last seen 2.6 years ago

Canada

I am working with a non-model organism for which there are 2 versions of the genome. I would like to use the newer version as it has a better resolution and completeness. There is no Ensembl gene model for this version of the genome. I have instead used Gnomon gene model included with the genome version deposited at NCBI. I was able to perform DESeq analysis. My questions is with regards to extracting fasta sequences from the original genome based on the results.

The results obtained from DESeq use the following gene ID format:

gene-NC_027757.2:10030825..10032141
gene-NC_027757.2:10036650..10043774
gene-NC_027757.2:10047118..10049010
gene-NC_027757.2:10076606..10077880
gene-NC_027757.2:10099173..10101111

The GFF file associated with the Gnomon gene model has the following entry format (per line):

NC_027757.2     Gnomon  gene    45762   45986   .       -       .       ID=geneNC_027757.2:45762..45986;description=gene.37187;gbkey=Gene;gene_biotype=protein_coding

The problem lies with the fact that the results format for each gene corresponds to the 9th column of the GFF file and is embedded with other information within this column. I am wondering if there is package such as Biostrings that allows me to retrieve the DNA sequences from the genome based on the Gnomon gene ID format. I have attempted to do this in Biostrings however, from what I can understand unless the "gene-NC_027757.2:10030825..10032141" is used as the gene name rather than being in the 9th column of the GFF file the Biostrings can't retrieve the corresponding sequences.

Are there packages or code chunk that you might be able to suggest which would allow me to use the results from DESeq2 to retrieve the corresponding genomic sequences within Bioconductor?

R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] stringr_1.4.0               genefilter_1.72.0          
 [3] ggplot2_3.3.2               PoiClaClu_1.0.2.1          
 [5] RColorBrewer_1.1-2          pheatmap_1.0.12            
 [7] DESeq2_1.30.0               BiocParallel_1.24.1        
 [9] GenomicAlignments_1.26.0    SummarizedExperiment_1.20.0
[11] MatrixGenerics_1.2.0        matrixStats_0.57.0         
[13] GenomicFeatures_1.42.1      AnnotationDbi_1.52.0       
[15] Biobase_2.50.0              Rsamtools_2.6.0            
[17] Biostrings_2.58.0           XVector_0.30.0             
[19] GenomicRanges_1.42.0        GenomeInfoDb_1.26.1        
[21] IRanges_2.24.0              S4Vectors_0.28.0           
[23] BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.2             bit64_4.0.5            splines_4.0.4         
 [4] assertthat_0.2.1       askpass_1.1            BiocFileCache_1.14.0  
 [7] blob_1.2.1             GenomeInfoDbData_1.2.4 yaml_2.2.1            
[10] progress_1.2.2         pillar_1.4.7           RSQLite_2.2.1         
[13] lattice_0.20-41        glue_1.4.2             digest_0.6.27         
[16] colorspace_2.0-0       Matrix_1.3-2           XML_3.99-0.5          
[19] pkgconfig_2.0.3        biomaRt_2.46.0         zlibbioc_1.36.0       
[22] purrr_0.3.4            xtable_1.8-4           scales_1.1.1          
[25] tibble_3.0.4           openssl_1.4.3          annotate_1.68.0       
[28] generics_0.1.0         ellipsis_0.3.1         withr_2.3.0           
[31] survival_3.2-7         magrittr_2.0.1         crayon_1.3.4          
[34] memoise_1.1.0          xml2_1.3.2             tools_4.0.4           
[37] prettyunits_1.1.1      hms_0.5.3              lifecycle_0.2.0       
[40] locfit_1.5-9.4         munsell_0.5.0          DelayedArray_0.16.0   
[43] compiler_4.0.4         tinytex_0.27           rlang_0.4.8           
[46] grid_4.0.4             RCurl_1.98-1.2         rstudioapi_0.13       
[49] rappdirs_0.3.1         bitops_1.0-6           gtable_0.3.0          
[52] DBI_1.1.0              curl_4.3               R6_2.5.0              
[55] dplyr_1.0.2            rtracklayer_1.50.0     bit_4.0.4             
[58] stringi_1.5.3          Rcpp_1.0.5             vctrs_0.3.5           
[61] geneplotter_1.68.0     dbplyr_2.0.0           tidyselect_1.1.0

DESeq2 Gnomon Biostrings • 958 views

ADD COMMENT • link 3.1 years ago sieminsk • 0

score 0 · Answer 1 · 2021-03-23

I am not sure I completely understand your question. There doesn't seem to be anything particularly useful in the 9th (actually 8th by my count) column.

> library(rtracklayer)
> download.file("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/686/985/GCF_000686985.2_Bra_napus_v2.0/GCF_000686985.2_Bra_napus_v2.0_genomic.gff.gz", "GCF_000686985.2_Bra_napus_v2.0_genomic.gff.gz")
> gr <- import("GCF_000686985.2_Bra_napus_v2.0_genomic.gff.gz")
> gr
GRanges object with 1794744 ranges and 60 metadata columns:
               seqnames      ranges strand |   source            type     score
                  <Rle>   <IRanges>  <Rle> | <factor>        <factor> <numeric>
        [1] NC_027757.2  1-35822559      + |   RefSeq          region        NA
        [2] NC_027757.2   6637-9236      - |   Gnomon          gene          NA
        [3] NC_027757.2   6637-9236      - |   Gnomon          mRNA          NA
        [4] NC_027757.2   9092-9236      - |   Gnomon          exon          NA
        [5] NC_027757.2   8813-8994      - |   Gnomon          exon          NA
        ...         ...         ...    ... .      ...             ...       ...
  [1794740] NC_004946.1   6765-7175      + |   RefSeq gene                   NA
  [1794741] NC_004946.1   6765-7175      + |   RefSeq CDS                    NA
  [1794742] NC_004946.1  8175-11411      - |   RefSeq gene                   NA
  [1794743] NC_004946.1  8175-11411      - |   RefSeq CDS                    NA
  [1794744] NC_004946.1 11314-11640      + |   RefSeq inverted_repeat        NA
                phase                     ID
            <integer>            <character>
        [1]      <NA> NC_027757.2:1..35822..
        [2]      <NA>      gene-LOC106345161
        [3]      <NA>     rna-XM_013814943.2
        [4]      <NA>  exon-XM_013814943.2-1
        [5]      <NA>  exon-XM_013814943.2-2
        ...       ...                    ...
  [1794740]      <NA>        gene-BrnaMpl_p5
  [1794741]         0        cds-NP_862327.1
  [1794742]      <NA>        gene-BrnaMpl_p6
  [1794743]         0        cds-NP_862328.1
  [1794744]      <NA> id-NC_004946.1:11314..

<snip>

This GRanges object is what you would need to get the sequences (note that the RefSeq IDs in the seqnames column correspond to the chromosomes). You would have to do some mapping from the row.names of your DESeq2, results to extract the chromosome and positions for those genes, but that's just a simple text parsing exercise.

Now you need a BSgenome object, which you can make using the BSgenome package. I'm not going to go into that, except to note that there is a vignette, and that there are some example BSgenomePkgSeeds in that package:

> dir(system.file("extdata/GentlemanLab", package="BSgenome"))
  [1] "BSgenome.Amellifera.BeeBase.assembly4-seed"    
  [2] "BSgenome.Amellifera.BeeBase.assembly4-tools"   
  [3] "BSgenome.Amellifera.UCSC.apiMel2-seed"         
  [4] "BSgenome.Amellifera.UCSC.apiMel2.masked-seed"  
  [5] "BSgenome.Athaliana.TAIR.04232008-seed"         
  [6] "BSgenome.Athaliana.TAIR.TAIR10.1-seed"         
  [7] "BSgenome.Athaliana.TAIR.TAIR10.1-tools"        
  [8] "BSgenome.Athaliana.TAIR.TAIR9-seed"            
  [9] "BSgenome.Athaliana.TAIR.TAIR9-tools"    

<snip>

And one of the A thaliana examples might be useful, being plants and all.

> readLines(dir(system.file("extdata/GentlemanLab", package="BSgenome"), full.names = TRUE)[6])
 [1] "Package: BSgenome.Athaliana.TAIR.TAIR10.1"                                                                                                                               
 [2] "Title: Full genome sequences for Arabidopsis thaliana (TAIR10.1)"                                                                                                        
 [3] "Description: Full genome sequences for Arabidopsis thaliana as provided by TAIR (TAIR10.1, RefSeq assembly accession: GCF_000001735.4) and stored in Biostrings objects."
 [4] "Version: 1.5.0"                                                                                                                                                          
 [5] "organism: Arabidopsis thaliana"                                                                                                                                          
 [6] "common_name: Thale cress"                                                                                                                                                
 [7] "genome: TAIR10.1"                                                                                                                                                        
 [8] "provider: TAIR"                                                                                                                                                          
 [9] "release_date: 2018/03/15"                                                                                                                                                
[10] "source_url: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001735.4"                                                                                                       
[11] "organism_biocview: Arabidopsis_thaliana"                                                                                                                                 
[12] "BSgenomeObjname: Athaliana"                                                                                                                                              
[13] "SrcDataFiles: GCF_000001735.4_TAIR10.1_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/"                           
[14] "PkgExamples: genome[[\"1\"]]"                                                                                                                                            
[15] "seqs_srcdir: /home/hpages/BSgenomeForge/srcdata/BSgenome.Athaliana.TAIR.TAIR10.1/seqs"                                                                                   
[16] "ondisk_seq_format: rds"

Which you could copy and adjust to suit. And then to get the sequences you would just use subsets of your GRanges object of interest to get the sequences, using getSeq in BSgenome.

Anyway, that's a lot, but you do work with a non-model organism so you are +/- on your own. So give it a shot and let us know if you get jammed up somewhere and have a specific question.