Extract genomic features from a TxDb-like object
1
0
Entering edit mode
rbenel ▴ 40
@rbenel-13642
Last seen 21 months ago
Israel

Hi, 

I am trying to extract promoter sequences from a TxDB like object. When I try the following lines of code, the GRange object that is formed does not have IDs because there is no use.name = T argument, which means when I try to extract the sequences there is no match. I know this because if I limit my promoter annotations to just the first output, promoters_txdb[1] and then extract the sequences I receive a DNA string set list. 

 

library(GenomicFeatures)
library(GenomicRanges)
library(BSgenome.Athaliana.TAIR.TAIR9)
library(TxDb.Athaliana.BioMart.plantsmart28)
library(Biostrings)
# The entire athaliana genome
bs_athaliana <- BSgenome.Athaliana.TAIR.TAIR9 

# transcript DB information
txdb_athaliana <- TxDb.Athaliana.BioMart.plantsmart28

head(seqlevels(txdb_athaliana))

#all genes 
promoters_txdb <- transcriptsBy(txdb_athaliana, by = "gene")

#extract sequences 
promoter_seqs <- getPromoterSeq(promoters_txdb, bs_athaliana, upstream = 1000, downstream = 100)


Error: 
Error in extractList(x, at) : some ranges are out of bounds

 

TxDB genomicfeatures promoter getpromoterseq • 1.9k views
ADD COMMENT
0
Entering edit mode
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_IL.UTF-8       LC_NUMERIC=C               LC_TIME=en_IL.UTF-8        LC_COLLATE=en_IL.UTF-8    
 [5] LC_MONETARY=en_IL.UTF-8    LC_MESSAGES=en_IL.UTF-8    LC_PAPER=en_IL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Athaliana.BioMart.plantsmart28_3.2.2 BSgenome.Athaliana.TAIR.TAIR9_1.3.1000   
 [3] BSgenome_1.46.0                           rtracklayer_1.38.3                       
 [5] Biostrings_2.46.0                         XVector_0.18.0                           
 [7] GenomicFeatures_1.30.3                    AnnotationDbi_1.40.0                     
 [9] Biobase_2.38.0                            GenomicRanges_1.30.3                     
[11] GenomeInfoDb_1.14.0                       IRanges_2.12.0                           
[13] S4Vectors_0.16.0                          BiocGenerics_0.24.0                      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15               pillar_1.2.1               compiler_3.4.3             prettyunits_1.0.2         
 [5] bitops_1.0-6               tools_3.4.3                zlibbioc_1.24.0            progress_1.1.2            
 [9] biomaRt_2.35.11            digest_0.6.15              bit_1.1-12                 lattice_0.20-35           
[13] RSQLite_2.0                memoise_1.1.0              tibble_1.4.2               pkgconfig_2.0.1           
[17] rlang_0.2.0                Matrix_1.2-12              DelayedArray_0.4.1         DBI_0.8                   
[21] yaml_2.1.17                GenomeInfoDbData_1.0.0     stringr_1.3.0              httr_1.3.1                
[25] grid_3.4.3                 bit64_0.9-7                R6_2.2.2                   BiocParallel_1.12.0       
[29] XML_3.98-1.10              RMySQL_0.10.14             blob_1.1.0                 magrittr_1.5              
[33] matrixStats_0.53.1         GenomicAlignments_1.14.1   Rsamtools_1.30.0           SummarizedExperiment_1.8.1
[37] assertthat_0.2.0           stringi_1.1.6              RCurl_1.95-4.10           
ADD REPLY
1
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

Let's separate the 2 issues you seem to be having: (1) the use of different chromosome naming conventions for Athaliana between plantsmart28 and UCSC, and (2) getPromoterSeq() failing to extract out-of-bounds ranges.

Issue (1) can be addressed by setting the same sequence style on the 2 objects:

library(TxDb.Athaliana.BioMart.plantsmart28)
txdb <- TxDb.Athaliana.BioMart.plantsmart28
library(BSgenome.Athaliana.TAIR.TAIR9)
genome <- BSgenome.Athaliana.TAIR.TAIR9
seqlevels(txdb)
# [1] "1"  "2"  "3"  "4"  "5"  "Mt" "Pt"
seqlevels(genome)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
seqlevelsStyle(txdb)
# [1] "Ensembl" "MSU6"
seqlevelsStyle(genome)
# [1] "TAIR9"
seqlevelsStyle(txdb) <- seqlevelsStyle(genome)
seqlevels(txdb)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"

Issue (2) is caused by getPromoterSeq() trying to extract sequences from beyond the bounds of the chromosomes, and failing to do so. getPromoterSeq() could certainly be improved to gracefully handle this situation. In the mean time, you can use the following workaround:

tx_by_gene <- transcriptsBy(txdb, by="gene")
prom_ranges <- promoters(tx_by_gene, upstream=1000, downstream=100)
# Warning messages about out-of-bound ranges...

## Let's take a look at some of those out-of-bound ranges:
prom_ranges[any(start(prom_ranges) < 1)]
# GRangesList object of length 1:
# $ATCG00010 
# GRanges object with 1 range and 2 metadata columns:
#       seqnames      ranges strand |     tx_id     tx_name
#          <Rle>   <IRanges>  <Rle> | <integer> <character>
#   [1]     ChrC [-23, 1076]      - |     41594 ATCG00010.1
# 
# -------
# seqinfo: 7 sequences (1 circular) from an unspecified genome

## getSeq() will fail to extract the corresponding sequences (which is
## actually what getPromoterSeq() tried to do internally):
getSeq(genome, prom_ranges)
# Error in extractList(x, at) : some ranges are out of bounds

## Let's trim the ranges in prom_ranges. Unfortunately there is no "trim"
## method for GRangesList objects, only for GRanges objects, so we use the
## unlist/relist trick to trim GRangesList object prom_ranges:
prom_ranges2 <- relist(trim(unlist(prom_ranges, use.names=FALSE)), prom_ranges)

## Then:
getSeq(genome, prom_ranges2)
# DNAStringSetList of length 33602
# [["AT1G01010"]] ATATTGCTATTTCTGCCAATATTAAAACTTCACTTAGGAAGACTTGAACCTACCACA...
# [["AT1G01020"]] AATATAAAATGCATTTTAATATTCAAAAGAATGCATGCATAGACTGATAGAAAAGAA...
# [["AT1G01030"]] TTGAACAAACAGACACGTATTGATTATAGTATTTCTGCTATTGATAAGTTTTTAAAC...
# [["AT1G01040"]] ATGCGTGTTGGATGCTTACAAAGAAGAATTTGTTTTCGGCGTTAGCCAAAAGTCGAT...
# [["AT1G01046"]] CGCGATTCTTTTTGGCAATGAGCTGGATGCAGAGGTTGTTGTTTCTCTCTCCATGGA...
# [["AT1G01050"]] GGTGTTGGAACATGCAAGAGTCTTAAAACGAGACAGACAGGATTTAAGCCATACAAG...
# [["AT1G01060"]] TGGTTATTTTGGAATAATTTCGGTTATTTCAATTAGATTCGGGTAGTTCAGTTCTTC...
# [["AT1G01070"]] GACATAATCTGCAAAGAATACGCAATATAATATTAGTAAATGAGTGGCAATTTCTCA...
# [["AT1G01073"]] AGTTCTAATTTTGCTTTCTGTTTTTAATTTAAATTTTGGATCAATAAGAAGAGACAT...
# [["AT1G01080"]] AAACAACCTTCTCTTTAATTTGAATTTTTTTTTGCAGCTGAGAAAGCCAAATACGCG...
# ...
# <33592 more elements>

Note that you end up with a DNAStringSetList object "parallel" to prom_ranges2, prom_ranges, and tx_by_gene i.e. with one list element per gene. Each list element contains one sequence per transcript in the gene.

I'm not sure I understood your comment about the use.names argument. This doesn't seem to have anything to do with issues (1) and (2).

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Hi Herve, 

Thanks so much, I was successfully able to get the DNAStringSetList; however now I can't write it to a FASTA file using

promoter_seqs <- getSeq(genome, prom_ranges2) #following trim get sequences 

writeXStringSet(promoter_seqs, promoter_seq.file) # save sequences as fasta file

 

Error: in writeXStringSet(promoter_seqs, promoter_seq.file) : 'x' must be an XStringSet object

 

Any idea why this is? The object is definitely a DNAStringSet object.  

 

ADD REPLY
0
Entering edit mode

It's a DNAStringSetList object, not a DNAStringSet object. A DNAStringSetList object is a list-like object where each list element is a DNAStringSet object. Like with an ordinary list, you can extract individual list elements of a DNAStringSetList object with [[ or unlist the object with unlist(). Both, [[ and unlist() return a DNAStringSet object. You should be able to pass the result of unlist(promoter_seqs) to writeXStringSet().

H.

ADD REPLY
0
Entering edit mode

Thanks Herve!

ADD REPLY

Login before adding your answer.

Traffic: 441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6