Entering edit mode
H'lo Ensembl Helpdesk,
I find that the R application that queries BioMart to build a local
'transcriptDb from dmelanogaster_gene_ensembl has begun with Ensembl
65 to produce an error :
BioMart data anomaly: for some transcripts, the cds
cumulative length inferred from the exon and UTR info doesn't match
the "cds_length" attribute from BioMart
This is reproducible in R with the code:
# first install the required packages
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicFeatures ")
# use the package
library(GenomicFeatures)
# and try to build the TranscriptDb (expect error here)
txdb<-makeTranscriptDbFromBiomart(biomart="ensembl",
dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL))
I assure you that the above worked in Ensembl 63. I did not test it
in Enesemb 64. In Ensembl 65 it generates the error.
Is this sufficient for you to research and perhaps fix? Or perhaps
you already are aware of the issue? I assume the problem is as
reported by the error message and that there are indeed such data
anomalies.
I am afraid that the R package does not produce a log of its
processing which details the presumably errant records. If not,
perhaps Marc Carlson, cc:ed as one of the GenomicFeatures developers,
can more easily than you or I produce such a log. I will be happy to
try if need be.
Thanks!
~Malcolm Cook - Computational Biology - Stowers Institute for Medical
Research
From: Rhoda Kinsella [mailto:rhoda@ebi.ac.uk]
Sent: Monday, February 06, 2012 2:53 AM
To: Cook, Malcolm
Cc: Marc Carlson; bioconductor@r-project.org
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart -
BioMart data anomaly: for some transcripts, the cds cumulative length
inferred from the exon and UTR info doesn't match the "cds_length"
attribute from BioMart
Hi Malcolm and Marc,
Please submit an Ensembl helpdesk ticket about this issue along with a
detailed example to
(helpdesk@ensembl.org<mailto:helpdesk@ensembl.org>) and we will look
into it.
Kind regards
Rhoda
On 3 Feb 2012, at 20:32, Cook, Malcolm wrote:
Hi Marc, and other `library(GenomicFeatures)` users working in fly,
I just changed Subject to keep alive one of the issues I still have,
namely:
I get the following error:
library(GenomicFeatures)
txdb<-makeTranscriptDbFromBiomart(biomart="ensembl",
dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL))
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... Error in
.extractCdsRangesFromBiomartTable(bm_table) :
BioMart data anomaly: for some transcripts, the cds cumulative length
inferred from the exon and UTR info doesn't match the "cds_length"
attribute from BioMart
Marc, you already observed that:
the data for cds ranges and total cds length (both from biomaRt) no
longer agree with each other. In other words, the data from the
current
drosophila ranges in biomaRt seems to disagree with itself, and so the
code is refusing to make a package out of this data as a result.
To get the 2nd issue fixed probably involves talking to ensembl about
their CDS data for fly to see if we can resolve the discrepancy.
I would be happy to take this to them.
I still wonder:
Can you recommend a best way to get a more diagnostic trace from the
attempt at txdb creation so we can correctly report to ensembl team
the
errant transcript(s) ?
I would be happy to take this up with Ensembl team, but, need details
which I don't know how to produce.
Finally, one the side, here is a tiny suggestion:
* change the default for circ_seqs in
makeTranscriptDbFromBiomart to be NULL, instead of any organism
(human) specific.
Regards,
--Malcolm
R version 2.14.0 (2011-10-31)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0
[4] GenomicRanges_1.6.6 IRanges_1.12.5
loaded via a namespace (and not attached):
[1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5
RCurl_1.9-5
[5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0
rtracklayer_1.14.4
[9] tools_2.14.0 zlibbioc_1.0.0
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Rhoda Kinsella Ph.D.
Ensembl Production Project Leader,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.
[[alternative HTML version deleted]]