I've noticed that some of the newer genome assemblies (most notably mm10) are missing from geneLenDataBase. Would it be possible to add these new assemblies to geneLenDataBase?
For example, using Bioconductor release 3.0, I find:
I think the functionality is in goseq::getlength(), where the additional step is to map transcripts to genes and take median transcript length. For mm10 / knownGene it would make sense to use the existing TxDb directly. And since I guess a typical scenario is "I / my lab is working on organism X with annotation Y and I'll have many questions about these annotations" maybe it's better to save the txdb as a package for sharing with colleagues. So
(I see that Herve has added GenomicFeatures::transcriptLengths()to devel...)
genelengths <- function(txdb, map=character()){
len <- txlengths(txdb)
if (missing(map))
map <- mapIds(txdb, names(len), "GENEID", "TXNAME")
median(splitAsList(len, map[names(len)]))
}
(map is a named character vector meant to allow some flexibility in how genes are named; the TxDb uses ENTREZ ids, but are these universally useful? The names of map would be the transcript ids, and the values would be the corresponding genes)
It's indeed better to keep creation and querying of the TxDb separated so I added transcriptLengths() for extracting the transcript lengths (and other metrics) from a TxDb object:
GeneLenDataBase is becoming deprecated. As mentioned above, goseq will grab gene length information for mm10 and other new assemblies on the fly. Using TxDb is a more elegant solution, so I'll add this into goseq for future releases.
I did send them a note alerting them to this post....