Hi,
Let's separate the 2 issues you seem to be having: (1) the use of different chromosome naming conventions for Athaliana between plantsmart28 and UCSC, and (2) getPromoterSeq()
failing to extract out-of-bounds ranges.
Issue (1) can be addressed by setting the same sequence style on the 2 objects:
library(TxDb.Athaliana.BioMart.plantsmart28)
txdb <- TxDb.Athaliana.BioMart.plantsmart28
library(BSgenome.Athaliana.TAIR.TAIR9)
genome <- BSgenome.Athaliana.TAIR.TAIR9
seqlevels(txdb)
# [1] "1" "2" "3" "4" "5" "Mt" "Pt"
seqlevels(genome)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
seqlevelsStyle(txdb)
# [1] "Ensembl" "MSU6"
seqlevelsStyle(genome)
# [1] "TAIR9"
seqlevelsStyle(txdb) <- seqlevelsStyle(genome)
seqlevels(txdb)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
Issue (2) is caused by getPromoterSeq()
trying to extract sequences from beyond the bounds of the chromosomes, and failing to do so. getPromoterSeq()
could certainly be improved to gracefully handle this situation. In the mean time, you can use the following workaround:
tx_by_gene <- transcriptsBy(txdb, by="gene")
prom_ranges <- promoters(tx_by_gene, upstream=1000, downstream=100)
# Warning messages about out-of-bound ranges...
## Let's take a look at some of those out-of-bound ranges:
prom_ranges[any(start(prom_ranges) < 1)]
# GRangesList object of length 1:
# $ATCG00010
# GRanges object with 1 range and 2 metadata columns:
# seqnames ranges strand | tx_id tx_name
# <Rle> <IRanges> <Rle> | <integer> <character>
# [1] ChrC [-23, 1076] - | 41594 ATCG00010.1
#
# -------
# seqinfo: 7 sequences (1 circular) from an unspecified genome
## getSeq() will fail to extract the corresponding sequences (which is
## actually what getPromoterSeq() tried to do internally):
getSeq(genome, prom_ranges)
# Error in extractList(x, at) : some ranges are out of bounds
## Let's trim the ranges in prom_ranges. Unfortunately there is no "trim"
## method for GRangesList objects, only for GRanges objects, so we use the
## unlist/relist trick to trim GRangesList object prom_ranges:
prom_ranges2 <- relist(trim(unlist(prom_ranges, use.names=FALSE)), prom_ranges)
## Then:
getSeq(genome, prom_ranges2)
# DNAStringSetList of length 33602
# [["AT1G01010"]] ATATTGCTATTTCTGCCAATATTAAAACTTCACTTAGGAAGACTTGAACCTACCACA...
# [["AT1G01020"]] AATATAAAATGCATTTTAATATTCAAAAGAATGCATGCATAGACTGATAGAAAAGAA...
# [["AT1G01030"]] TTGAACAAACAGACACGTATTGATTATAGTATTTCTGCTATTGATAAGTTTTTAAAC...
# [["AT1G01040"]] ATGCGTGTTGGATGCTTACAAAGAAGAATTTGTTTTCGGCGTTAGCCAAAAGTCGAT...
# [["AT1G01046"]] CGCGATTCTTTTTGGCAATGAGCTGGATGCAGAGGTTGTTGTTTCTCTCTCCATGGA...
# [["AT1G01050"]] GGTGTTGGAACATGCAAGAGTCTTAAAACGAGACAGACAGGATTTAAGCCATACAAG...
# [["AT1G01060"]] TGGTTATTTTGGAATAATTTCGGTTATTTCAATTAGATTCGGGTAGTTCAGTTCTTC...
# [["AT1G01070"]] GACATAATCTGCAAAGAATACGCAATATAATATTAGTAAATGAGTGGCAATTTCTCA...
# [["AT1G01073"]] AGTTCTAATTTTGCTTTCTGTTTTTAATTTAAATTTTGGATCAATAAGAAGAGACAT...
# [["AT1G01080"]] AAACAACCTTCTCTTTAATTTGAATTTTTTTTTGCAGCTGAGAAAGCCAAATACGCG...
# ...
# <33592 more elements>
Note that you end up with a DNAStringSetList object "parallel" to prom_ranges2
, prom_ranges
, and tx_by_gene
i.e. with one list element per gene. Each list element contains one sequence per transcript in the gene.
I'm not sure I understood your comment about the use.names
argument. This doesn't seem to have anything to do with issues (1) and (2).
Hope this helps,
H.