Search
Question: mm10 ENSEMBL not available for GOSEQ
2
3.5 years ago by
Switzerland
Merienne Nicolas120 wrote:

Hello all,

I am encountering problems with the goseq package. I did some analysis few months ago without any problems and I now need to reuse the script for another part of the project. So, I used exactly the same command lines with the new data but I get an error message when launching the nullp function. So I tried to re-do my old analysis and I have now the same error.

Here is my commands and the error:

pwf.D1 = nullp(D1.vector, 'mm10', 'ensGene')

Can't find mm10/ensGene length data in genLenDataBase... Trying to download from UCSC. This might take a couple of minutes.
Erreur dans value[[3L]](cond) :
Length information for genome mm10 and gene ID ensGene is not available. You will have to specify bias.data manually.
De plus : Messages d'avis :
1: package ‘rtracklayer’ was built under R version 3.1.3
2: package ‘GenomicRanges’ was built under R version 3.1.2

Here is the command I used few months ago with the comments during the analysis was running:

pwf.D1 = nullp(D1.vector, 'mm10', 'ensGene')

#Can't find mm10/ensGene length data in genLenDataBase... Trying to download from UCSC. #This might take a couple of minutes.
#Message d'avis :
#In pcls(G) : initial point very close to some inequality constraints

Because I updated my Bioconductor packages recently (I think I used goseq_1.16.2 before) and I am using an old R version (still the one for Snow Leopard), I am thinking that it could be a problem of version compatibility... Here is my sessionInfo():

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_3.22.7          rtracklayer_1.26.3    GenomicRanges_1.18.4  goseq_1.18.0          AnnotationDbi_1.28.2
[6] GenomeInfoDb_1.2.5    IRanges_2.0.1         S4Vectors_0.4.0       Biobase_2.26.0        BiocGenerics_0.12.1
[11] RSQLite_1.0.0         DBI_0.3.1             geneLenDataBase_1.1.1 BiasedUrn_1.06.1

loaded via a namespace (and not attached):
[1] base64enc_0.1-2         BatchJobs_1.6           BBmisc_1.9              BiocParallel_1.0.3
[5] biomaRt_2.22.0          Biostrings_2.34.1       bitops_1.0-6            brew_1.0-6
[9] checkmate_1.5.3         codetools_0.2-11        colorspace_1.2-6        digest_0.6.8
[13] fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.2 GenomicFeatures_1.18.7
[17] ggplot2_1.0.1           GO.db_3.0.0             grid_3.1.1              gtable_0.1.2
[21] iterators_1.0.7         lattice_0.20-31         magrittr_1.5            MASS_7.3-40
[25] Matrix_1.2-0            mgcv_1.8-6              munsell_0.4.2           nlme_3.1-120
[29] plyr_1.8.2              proto_0.3-10            Rcpp_0.11.6             RCurl_1.95-4.6
[33] reshape2_1.4.1          Rsamtools_1.18.3        scales_0.2.4            sendmailR_1.2-1
[37] stringi_0.4-1           stringr_1.0.0           tools_3.1.1             XML_3.98-1.1
[41] XVector_0.6.0           zlibbioc_1.12.0       

So, does anyone has an idea of the origin of the problem? Should I have to reinstall a newer R version or just manually install goseq_1.16.2?

Best,

Nicolas

modified 3.5 years ago • written 3.5 years ago by Merienne Nicolas120

Hi Nicolas,

The UCSC download uses rtracklayer, so it may be that your issue is arising from it being build under the wrong version of R.

In the newest version of bioconductor (3.1) and goseq (1.20), we've movied towards fetching the gene lengths from TxDb rather than geneLenDataBase. So if you upgrade to R 3.2 and reinstall all the packages plus TxDb.Mmusculus.UCSC.mm10.ensGene, goseq shouldn't need to fetch the gene length from UCSC at all.

Cheers,

I updated my R and goseq versions and it is working now without trying to retrieve informations from UCSC (and transcript lengths are the same as before). Thank you for your help!

Best,

Nicolas

0
3.5 years ago by
idomtamir10
Austria
idomtamir10 wrote:

These are just warnings. The problem is that goseq has to be always updated with new builds and the maintainers don't seem to do this.

Seems like you have to do this yourself.

How to get length information for mm10 genome and geneID gene Symbol?

Maybe community could take iover to add gene length info to the database. I also intend to use goseq routinely.

> subset(supportedGenomes(), species == "Mouse")
db species      date                               name
48 mm10   Mouse Dec. 2011 Genome Reference Consortium GRCm38
49  mm9   Mouse Jul. 2007                      NCBI Build 37
50  mm8   Mouse Feb. 2006                      NCBI Build 36
51  mm7   Mouse Aug. 2005                      NCBI Build 35
AvailableGeneIDs
48
49 acembly,ccdsGene,ensGene,exoniphy,geneSymbol,geneid,genscan,knownGene,nscanGene,refGene,sgpGene,xenoRefGene
50          ccdsGene,ensGene,geneSymbol,geneid,genscan,knownGene,nscanGene,refGene,sgpGene,sibGene,xenoRefGene
51                                     ensGene,geneSymbol,geneid,genscan,knownGene,refGene,sgpGene,xenoRefGene

Thank you for your answer. However, the problems with mm10 genome is not just a warning but an error message (my R version is in French, so it is indicated by "Erreur dans value[[3L]](cond) : Length information for genome mm10 and gene ID ensGene is not available. You will have to specify bias.data manually.").

I knew that goseq has no informations for gene length with mm10 version, but this is the fact that now it is no more able to download it from UCSC website which is really surprising. I don't know if this is a problem from UCSC or if I just did something wrong during an update (more probable than a problem with UCSC...). But I am trying to manually construct the gene length vector...

Nico

I missed that it did work with mm10 once which is very strange. I just installed it fresh and it did not work with mm10 and ensGene. I don't think reinstalling will help you.

Hi,

Just wanted to mention that transcriptlengths() was added to GenomicFeatures in BioC 3.1 (see ?transcriptLengths):

library(TxDb.Mmusculus.UCSC.mm10.ensGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene
tx_lens <- transcriptLengths(txdb)
#   tx_id            tx_name            gene_id nexon tx_len
# 1     1 ENSMUST00000160944 ENSMUSG00000090025     1    501
# 2     2 ENSMUST00000082908 ENSMUSG00000064842     1    110
# 3     3 ENSMUST00000161581 ENSMUSG00000089699     2    250
# 4     4 ENSMUST00000180019 ENSMUSG00000096126     1    107
# 5     5 ENSMUST00000134384 ENSMUSG00000025903    10   1136
# 6     6 ENSMUST00000027036 ENSMUSG00000025903     9   2507


You still need to choose how you want to go from transcript lengths to gene lengths though (and I'm not sure there is a clear consensus on how this should be done):

gene2tx_lens <- splitAsList(tx_lens$tx_len, tx_lens$gene_id)
gene2tx_lens
# IntegerList of length 39017
# [["ENSMUSG00000000001"]] 3262
# [["ENSMUSG00000000003"]] 902 697
# [["ENSMUSG00000000028"]] 2143 1747 832
# [["ENSMUSG00000000031"]] 2286 1853 817 452 935
# [["ENSMUSG00000000037"]] 2931 2953 3440 4842 4847 3550 517
# [["ENSMUSG00000000049"]] 367 374 731 1190
# [["ENSMUSG00000000056"]] 4395 435 525
# [["ENSMUSG00000000058"]] 2733 974 694
# [["ENSMUSG00000000078"]] 4217
# [["ENSMUSG00000000085"]] 1132 3353 352 1776 623 2584 ... 530 1536 595 557 443
# ...
# <39007 more elements>

## If the length of a gene is the length of its longer transcript, then:
gene_lens <- max(gene2tx_lens)
# ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028 ENSMUSG00000000031
#               3262                902               2143               2286
# ENSMUSG00000000037 ENSMUSG00000000049
#               4847               1190

## If it's the mean of the lengths of its transcripts, then:
gene_lens <- mean(gene2tx_lens)

Cheers,

H.

0
3.5 years ago by
Switzerland
Merienne Nicolas120 wrote:

Idomtamir, thank you for your comments. I also used another computer with a more recent R version with the same result...

Anyway, I tried to manually construct my gene length vector but I am not sure to use the good tools. If I understand well, I have to correct for the length of transcripts (or the length of the full gene, contaning the introns?). I used GenomicFeatures and tips from the goseq vignette. Here is my code:

library(GenomicFeatures)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
txdb = makeTranscriptDbFromBiomart(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl') # I can not use makeTranscriptFromUCSC because it can not retrieve informations from UCSC website (also reported in another topic of the support site)

#### Alternatively, I also tried to use the TxDb.Mmusculus.UCSC.mm10.ensGene package with same results:

library(TxDb.Mmusculus.UCSC.mm10.ensGene)

txdb = library(TxDb.Mmusculus.UCSC.mm10.ensGene)

#### Then, I grouped transcripts (so different isoforms) by genes and determine the median of their length:

txdbByGene = transcriptsBy(txdb, 'gene')
lengthData = median(width(txdbByGene))

I was finally able to run the nullp function but when I compared with the analysis I did few months ago (for which nullp function was able to retrieve informations from UCSC automatically), I found that the gene length bias has not the same values now and thus, the pwf is different (not so much but it changes the final results). Here is an example of what I obtained now:

head(pwf.D1.2)
DEgenes bias.data        pwf
ENSMUSG00000000001       0   38867.0 0.01716926
ENSMUSG00000000028       0   31174.0 0.01614970
ENSMUSG00000000037       0   95425.0 0.01962995
ENSMUSG00000000049       0   22637.5 0.01409042
ENSMUSG00000000056       0     911.0 0.00828189
ENSMUSG00000000058       0    6080.0 0.01040964

And this is what I has before:

head(pwf.D1)
DEgenes bias.data         pwf
ENSMUSG00000000001       0    3262.0 0.017426846
ENSMUSG00000000028       0    1747.0 0.014699994
ENSMUSG00000000037       0    3440.0 0.017572570
ENSMUSG00000000049       0     552.5 0.006838229
ENSMUSG00000000056       0     525.0 0.006539448
ENSMUSG00000000058       0     974.0 0.010853943

Indeed, the length corresponds to the median of the genomic length (including introns). Could anyone help me to find what I did wrong? Do I have to substract from this the sum of the length of all the introns for each transcript? Or is there an easier tip to retrieve transcript length?