BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart
0
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 7 days ago
United States
H'lo Ensembl Helpdesk, I find that the R application that queries BioMart to build a local 'transcriptDb from dmelanogaster_gene_ensembl has begun with Ensembl 65 to produce an error : BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart This is reproducible in R with the code: # first install the required packages source("http://bioconductor.org/biocLite.R") biocLite("GenomicFeatures ") # use the package library(GenomicFeatures) # and try to build the TranscriptDb (expect error here) txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) I assure you that the above worked in Ensembl 63. I did not test it in Enesemb 64. In Ensembl 65 it generates the error. Is this sufficient for you to research and perhaps fix? Or perhaps you already are aware of the issue? I assume the problem is as reported by the error message and that there are indeed such data anomalies. I am afraid that the R package does not produce a log of its processing which details the presumably errant records. If not, perhaps Marc Carlson, cc:ed as one of the GenomicFeatures developers, can more easily than you or I produce such a log. I will be happy to try if need be. Thanks! ~Malcolm Cook - Computational Biology - Stowers Institute for Medical Research From: Rhoda Kinsella [mailto:rhoda@ebi.ac.uk] Sent: Monday, February 06, 2012 2:53 AM To: Cook, Malcolm Cc: Marc Carlson; bioconductor@r-project.org Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart Hi Malcolm and Marc, Please submit an Ensembl helpdesk ticket about this issue along with a detailed example to (helpdesk@ensembl.org<mailto:helpdesk@ensembl.org>) and we will look into it. Kind regards Rhoda On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: Hi Marc, and other `library(GenomicFeatures)` users working in fly, I just changed Subject to keep alive one of the issues I still have, namely: I get the following error: library(GenomicFeatures) txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) Download and preprocess the 'transcripts' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Download and preprocess the 'splicings' data frame ... Error in .extractCdsRangesFromBiomartTable(bm_table) : BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart Marc, you already observed that: the data for cds ranges and total cds length (both from biomaRt) no longer agree with each other. In other words, the data from the current drosophila ranges in biomaRt seems to disagree with itself, and so the code is refusing to make a package out of this data as a result. To get the 2nd issue fixed probably involves talking to ensembl about their CDS data for fly to see if we can resolve the discrepancy. I would be happy to take this to them. I still wonder: Can you recommend a best way to get a more diagnostic trace from the attempt at txdb creation so we can correctly report to ensembl team the errant transcript(s) ? I would be happy to take this up with Ensembl team, but, need details which I don't know how to produce. Finally, one the side, here is a tiny suggestion: * change the default for circ_seqs in makeTranscriptDbFromBiomart to be NULL, instead of any organism (human) specific. Regards, --Malcolm R version 2.14.0 (2011-10-31) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 [4] GenomicRanges_1.6.6 IRanges_1.12.5 loaded via a namespace (and not attached): [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 RCurl_1.9-5 [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 rtracklayer_1.14.4 [9] tools_2.14.0 zlibbioc_1.0.0 _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Rhoda Kinsella Ph.D. Ensembl Production Project Leader, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK. [[alternative HTML version deleted]]
TranscriptDb biomaRt GenomicFeatures TranscriptDb biomaRt GenomicFeatures • 1.3k views
ADD COMMENT

Login before adding your answer.

Traffic: 704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6