Search
Question: Extract genomic features from a TxDb-like object
0
gravatar for r.t.greenblatt
3 months ago by
r.t.greenblatt0 wrote:

Hi, 

I am trying to extract promoter sequences from a TxDB like object. When I try the following lines of code, the GRange object that is formed does not have IDs because there is no use.name = T argument, which means when I try to extract the sequences there is no match. I know this because if I limit my promoter annotations to just the first output, promoters_txdb[1] and then extract the sequences I receive a DNA string set list. 

 

library(GenomicFeatures)
library(GenomicRanges)
library(BSgenome.Athaliana.TAIR.TAIR9)
library(TxDb.Athaliana.BioMart.plantsmart28)
library(Biostrings)
# The entire athaliana genome
bs_athaliana <- BSgenome.Athaliana.TAIR.TAIR9 

# transcript DB information
txdb_athaliana <- TxDb.Athaliana.BioMart.plantsmart28

head(seqlevels(txdb_athaliana))

#all genes 
promoters_txdb <- transcriptsBy(txdb_athaliana, by = "gene")

#extract sequences 
promoter_seqs <- getPromoterSeq(promoters_txdb, bs_athaliana, upstream = 1000, downstream = 100)


Error: 
Error in extractList(x, at) : some ranges are out of bounds

 

ADD COMMENTlink modified 3 months ago by Hervé Pagès ♦♦ 13k • written 3 months ago by r.t.greenblatt0
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_IL.UTF-8       LC_NUMERIC=C               LC_TIME=en_IL.UTF-8        LC_COLLATE=en_IL.UTF-8    
 [5] LC_MONETARY=en_IL.UTF-8    LC_MESSAGES=en_IL.UTF-8    LC_PAPER=en_IL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Athaliana.BioMart.plantsmart28_3.2.2 BSgenome.Athaliana.TAIR.TAIR9_1.3.1000   
 [3] BSgenome_1.46.0                           rtracklayer_1.38.3                       
 [5] Biostrings_2.46.0                         XVector_0.18.0                           
 [7] GenomicFeatures_1.30.3                    AnnotationDbi_1.40.0                     
 [9] Biobase_2.38.0                            GenomicRanges_1.30.3                     
[11] GenomeInfoDb_1.14.0                       IRanges_2.12.0                           
[13] S4Vectors_0.16.0                          BiocGenerics_0.24.0                      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15               pillar_1.2.1               compiler_3.4.3             prettyunits_1.0.2         
 [5] bitops_1.0-6               tools_3.4.3                zlibbioc_1.24.0            progress_1.1.2            
 [9] biomaRt_2.35.11            digest_0.6.15              bit_1.1-12                 lattice_0.20-35           
[13] RSQLite_2.0                memoise_1.1.0              tibble_1.4.2               pkgconfig_2.0.1           
[17] rlang_0.2.0                Matrix_1.2-12              DelayedArray_0.4.1         DBI_0.8                   
[21] yaml_2.1.17                GenomeInfoDbData_1.0.0     stringr_1.3.0              httr_1.3.1                
[25] grid_3.4.3                 bit64_0.9-7                R6_2.2.2                   BiocParallel_1.12.0       
[29] XML_3.98-1.10              RMySQL_0.10.14             blob_1.1.0                 magrittr_1.5              
[33] matrixStats_0.53.1         GenomicAlignments_1.14.1   Rsamtools_1.30.0           SummarizedExperiment_1.8.1
[37] assertthat_0.2.0           stringi_1.1.6              RCurl_1.95-4.10           
ADD REPLYlink written 3 months ago by r.t.greenblatt0
1
gravatar for Hervé Pagès
3 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Let's separate the 2 issues you seem to be having: (1) the use of different chromosome naming conventions for Athaliana between plantsmart28 and UCSC, and (2) getPromoterSeq() failing to extract out-of-bounds ranges.

Issue (1) can be addressed by setting the same sequence style on the 2 objects:

library(TxDb.Athaliana.BioMart.plantsmart28)
txdb <- TxDb.Athaliana.BioMart.plantsmart28
library(BSgenome.Athaliana.TAIR.TAIR9)
genome <- BSgenome.Athaliana.TAIR.TAIR9
seqlevels(txdb)
# [1] "1"  "2"  "3"  "4"  "5"  "Mt" "Pt"
seqlevels(genome)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
seqlevelsStyle(txdb)
# [1] "Ensembl" "MSU6"
seqlevelsStyle(genome)
# [1] "TAIR9"
seqlevelsStyle(txdb) <- seqlevelsStyle(genome)
seqlevels(txdb)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"

Issue (2) is caused by getPromoterSeq() trying to extract sequences from beyond the bounds of the chromosomes, and failing to do so. getPromoterSeq() could certainly be improved to gracefully handle this situation. In the mean time, you can use the following workaround:

tx_by_gene <- transcriptsBy(txdb, by="gene")
prom_ranges <- promoters(tx_by_gene, upstream=1000, downstream=100)
# Warning messages about out-of-bound ranges...

## Let's take a look at some of those out-of-bound ranges:
prom_ranges[any(start(prom_ranges) < 1)]
# GRangesList object of length 1:
# $ATCG00010 
# GRanges object with 1 range and 2 metadata columns:
#       seqnames      ranges strand |     tx_id     tx_name
#          <Rle>   <IRanges>  <Rle> | <integer> <character>
#   [1]     ChrC [-23, 1076]      - |     41594 ATCG00010.1
# 
# -------
# seqinfo: 7 sequences (1 circular) from an unspecified genome

## getSeq() will fail to extract the corresponding sequences (which is
## actually what getPromoterSeq() tried to do internally):
getSeq(genome, prom_ranges)
# Error in extractList(x, at) : some ranges are out of bounds

## Let's trim the ranges in prom_ranges. Unfortunately there is no "trim"
## method for GRangesList objects, only for GRanges objects, so we use the
## unlist/relist trick to trim GRangesList object prom_ranges:
prom_ranges2 <- relist(trim(unlist(prom_ranges, use.names=FALSE)), prom_ranges)

## Then:
getSeq(genome, prom_ranges2)
# DNAStringSetList of length 33602
# [["AT1G01010"]] ATATTGCTATTTCTGCCAATATTAAAACTTCACTTAGGAAGACTTGAACCTACCACA...
# [["AT1G01020"]] AATATAAAATGCATTTTAATATTCAAAAGAATGCATGCATAGACTGATAGAAAAGAA...
# [["AT1G01030"]] TTGAACAAACAGACACGTATTGATTATAGTATTTCTGCTATTGATAAGTTTTTAAAC...
# [["AT1G01040"]] ATGCGTGTTGGATGCTTACAAAGAAGAATTTGTTTTCGGCGTTAGCCAAAAGTCGAT...
# [["AT1G01046"]] CGCGATTCTTTTTGGCAATGAGCTGGATGCAGAGGTTGTTGTTTCTCTCTCCATGGA...
# [["AT1G01050"]] GGTGTTGGAACATGCAAGAGTCTTAAAACGAGACAGACAGGATTTAAGCCATACAAG...
# [["AT1G01060"]] TGGTTATTTTGGAATAATTTCGGTTATTTCAATTAGATTCGGGTAGTTCAGTTCTTC...
# [["AT1G01070"]] GACATAATCTGCAAAGAATACGCAATATAATATTAGTAAATGAGTGGCAATTTCTCA...
# [["AT1G01073"]] AGTTCTAATTTTGCTTTCTGTTTTTAATTTAAATTTTGGATCAATAAGAAGAGACAT...
# [["AT1G01080"]] AAACAACCTTCTCTTTAATTTGAATTTTTTTTTGCAGCTGAGAAAGCCAAATACGCG...
# ...
# <33592 more elements>

Note that you end up with a DNAStringSetList object "parallel" to prom_ranges2, prom_ranges, and tx_by_gene i.e. with one list element per gene. Each list element contains one sequence per transcript in the gene.

I'm not sure I understood your comment about the use.names argument. This doesn't seem to have anything to do with issues (1) and (2).

Hope this helps,

H.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Hervé Pagès ♦♦ 13k

Hi Herve, 

Thanks so much, I was successfully able to get the DNAStringSetList; however now I can't write it to a FASTA file using

promoter_seqs <- getSeq(genome, prom_ranges2) #following trim get sequences 

writeXStringSet(promoter_seqs, promoter_seq.file) # save sequences as fasta file

 

Error: in writeXStringSet(promoter_seqs, promoter_seq.file) : 'x' must be an XStringSet object

 

Any idea why this is? The object is definitely a DNAStringSet object.  

 

ADD REPLYlink written 3 months ago by r.t.greenblatt0

It's a DNAStringSetList object, not a DNAStringSet object. A DNAStringSetList object is a list-like object where each list element is a DNAStringSet object. Like with an ordinary list, you can extract individual list elements of a DNAStringSetList object with [[ or unlist the object with unlist(). Both, [[ and unlist() return a DNAStringSet object. You should be able to pass the result of unlist(promoter_seqs) to writeXStringSet().

H.

ADD REPLYlink modified 3 months ago • written 3 months ago by Hervé Pagès ♦♦ 13k

Thanks Herve!

ADD REPLYlink written 3 months ago by r.t.greenblatt0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 151 users visited in the last hour