GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_leng
1
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 14 hours ago
United States
Hi Marc, and other `library(GenomicFeatures)` users working in fly, I just changed Subject to keep alive one of the issues I still have, namely: I get the following error: > library(GenomicFeatures) > txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) Download and preprocess the 'transcripts' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Download and preprocess the 'splicings' data frame ... Error in .extractCdsRangesFromBiomartTable(bm_table) : BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart Marc, you already observed that: > >> the data for cds ranges and total cds length (both from biomaRt) no > >> longer agree with each other. In other words, the data from the current > >> drosophila ranges in biomaRt seems to disagree with itself, and so the > >> code is refusing to make a package out of this data as a result. > >> To get the 2nd issue fixed probably involves talking to ensembl about > >> their CDS data for fly to see if we can resolve the discrepancy. > > I would be happy to take this to them. I still wonder: > Can you recommend a best way to get a more diagnostic trace from the > attempt at txdb creation so we can correctly report to ensembl team the > errant transcript(s) ? I would be happy to take this up with Ensembl team, but, need details which I don't know how to produce. Finally, one the side, here is a tiny suggestion: * change the default for circ_seqs in makeTranscriptDbFromBiomart to be NULL, instead of any organism (human) specific. Regards, --Malcolm R version 2.14.0 (2011-10-31) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 [4] GenomicRanges_1.6.6 IRanges_1.12.5 loaded via a namespace (and not attached): [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 RCurl_1.9-5 [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 rtracklayer_1.14.4 [9] tools_2.14.0 zlibbioc_1.0.0 >
biomaRt biomaRt • 1.7k views
ADD COMMENT
0
Entering edit mode
@rhoda-kinsella-3200
Last seen 9.6 years ago
Hi Malcolm and Marc, Please submit an Ensembl helpdesk ticket about this issue along with a detailed example to (helpdesk@ensembl.org) and we will look into it. Kind regards Rhoda On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: > Hi Marc, and other `library(GenomicFeatures)` users working in fly, > > I just changed Subject to keep alive one of the issues I still have, > namely: > > I get the following error: > >> library(GenomicFeatures) >> txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", >> dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) > Download and preprocess the 'transcripts' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Download and preprocess the 'splicings' data frame ... Error > in .extractCdsRangesFromBiomartTable(bm_table) : > BioMart data anomaly: for some transcripts, the cds cumulative > length inferred from the exon and UTR info doesn't match the > "cds_length" attribute from BioMart > > > Marc, you already observed that: > >>>> the data for cds ranges and total cds length (both from biomaRt) no >>>> longer agree with each other. In other words, the data from the >>>> current >>>> drosophila ranges in biomaRt seems to disagree with itself, and >>>> so the >>>> code is refusing to make a package out of this data as a result. >>>> To get the 2nd issue fixed probably involves talking to ensembl >>>> about >>>> their CDS data for fly to see if we can resolve the discrepancy. >>> I would be happy to take this to them. > > I still wonder: > >> Can you recommend a best way to get a more diagnostic trace from the >> attempt at txdb creation so we can correctly report to ensembl team >> the >> errant transcript(s) ? > > I would be happy to take this up with Ensembl team, but, need > details which I don't know how to produce. > > > Finally, one the side, here is a tiny suggestion: > > * change the default for circ_seqs in makeTranscriptDbFromBiomart > to be NULL, instead of any organism (human) specific. > > Regards, > > --Malcolm > > > R version 2.14.0 (2011-10-31) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 > [4] GenomicRanges_1.6.6 IRanges_1.12.5 > > loaded via a namespace (and not attached): > [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 > RCurl_1.9-5 > [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 > rtracklayer_1.14.4 > [9] tools_2.14.0 zlibbioc_1.0.0 >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Rhoda Kinsella Ph.D. Ensembl Production Project Leader, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK. [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Rhoda and others, I still need to check that this error issued by internal helper .extractCdsRangesFromBiomartTable() about "the cds cumulative length inferred from the exon and UTR not matching the cds_length attribute from BioMart" is not a FALSE positive. I'm planning to patch the code in charge of this sanity check so it issues a warning instead of an error and it displays something more useful than just "for some transcripts etc...". Would be nice to know at least for which transcript. I'll keep you informed, thanks! H. On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: > Hi Malcolm and Marc, > Please submit an Ensembl helpdesk ticket about this issue along with a > detailed example to (helpdesk at ensembl.org) and we will look into it. > Kind regards > Rhoda > > > On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: > >> Hi Marc, and other `library(GenomicFeatures)` users working in fly, >> >> I just changed Subject to keep alive one of the issues I still have, >> namely: >> >> I get the following error: >> >>> library(GenomicFeatures) >>> txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", >>> dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) >> Download and preprocess the 'transcripts' data frame ... OK >> Download and preprocess the 'chrominfo' data frame ... OK >> Download and preprocess the 'splicings' data frame ... Error >> in .extractCdsRangesFromBiomartTable(bm_table) : >> BioMart data anomaly: for some transcripts, the cds cumulative >> length inferred from the exon and UTR info doesn't match the >> "cds_length" attribute from BioMart >> >> >> Marc, you already observed that: >> >>>>> the data for cds ranges and total cds length (both from biomaRt) no >>>>> longer agree with each other. In other words, the data from the >>>>> current >>>>> drosophila ranges in biomaRt seems to disagree with itself, and >>>>> so the >>>>> code is refusing to make a package out of this data as a result. >>>>> To get the 2nd issue fixed probably involves talking to ensembl >>>>> about >>>>> their CDS data for fly to see if we can resolve the discrepancy. >>>> I would be happy to take this to them. >> >> I still wonder: >> >>> Can you recommend a best way to get a more diagnostic trace from the >>> attempt at txdb creation so we can correctly report to ensembl team >>> the >>> errant transcript(s) ? >> >> I would be happy to take this up with Ensembl team, but, need >> details which I don't know how to produce. >> >> >> Finally, one the side, here is a tiny suggestion: >> >> * change the default for circ_seqs in makeTranscriptDbFromBiomart >> to be NULL, instead of any organism (human) specific. >> >> Regards, >> >> --Malcolm >> >> >> R version 2.14.0 (2011-10-31) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 >> [4] GenomicRanges_1.6.6 IRanges_1.12.5 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 >> RCurl_1.9-5 >> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 >> rtracklayer_1.14.4 >> [9] tools_2.14.0 zlibbioc_1.0.0 >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > Rhoda Kinsella Ph.D. > Ensembl Production Project Leader, > European Bioinformatics Institute (EMBL-EBI), > Wellcome Trust Genome Campus, > Hinxton > Cambridge CB10 1SD, > UK. > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi Rhoda, Malcolm, and others, So after taking a closer look at this, I can confirm that the reported "cds_length" looks wrong for some Fly transcripts. Take for example the FBtr0079414 transcript (minus strand): > library(biomaRt) > mart1 <- useMart(biomart="ensembl", dataset="dmelanogaster_gene_ensembl") > attributes <- c("ensembl_transcript_id", "strand", + "rank", "exon_chrom_start", "exon_chrom_end", + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end", + "cds_length") > filters <- "ensembl_transcript_id" > values <- "FBtr0079414" > getBM(attributes=attributes, filters=filters, values=values, mart=mart1) ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end 5_utr_start 1 FBtr0079414 -1 1 7218909 7220029 7219112 2 FBtr0079414 -1 2 7218643 7218853 NA 5_utr_end 3_utr_start 3_utr_end cds_length 1 7220029 NA NA 204 2 NA 7218643 7218853 204 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no CDS on this exon. The start of the 5' UTR (located on exon 1) is 203 bases upstream of the exon start. But the reported cds_length is 204. Something looks wrong. For other transcripts, e.g. FBtr0300689 (plus strand), things look OK: > getBM(attributes=attributes, filters=filters, values=values, mart=mart1) ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end 5_utr_start 1 FBtr0300689 1 1 7529 8116 7529 2 FBtr0300689 1 2 8193 9484 NA 5_utr_end 3_utr_start 3_utr_end cds_length 1 7679 NA NA 855 2 NA 8611 9484 855 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases upstream of the exon end. The start of the 3' UTR (located on exon 2) is 418 bases downstream of the exon start. So the CDS total length is 437 + 418 = 855, as reported. @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to commit a patch to this function so that this anomaly in the Ensembl data causes a warning instead of an error. Also the warning will display the first 6 affected transcripts. The patch will make it into GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will become available via biocLite() in the next 24-36 hours. Cheers, H. On 02/06/2012 02:18 PM, Hervé Pagès wrote: > Hi Rhoda and others, > > I still need to check that this error issued by internal helper > .extractCdsRangesFromBiomartTable() about "the cds cumulative > length inferred from the exon and UTR not matching the cds_length > attribute from BioMart" is not a FALSE positive. > > I'm planning to patch the code in charge of this sanity check > so it issues a warning instead of an error and it displays > something more useful than just "for some transcripts etc...". > Would be nice to know at least for which transcript. > > I'll keep you informed, thanks! > H. > > > On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: >> Hi Malcolm and Marc, >> Please submit an Ensembl helpdesk ticket about this issue along with a >> detailed example to (helpdesk at ensembl.org) and we will look into it. >> Kind regards >> Rhoda >> >> >> On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: >> >>> Hi Marc, and other `library(GenomicFeatures)` users working in fly, >>> >>> I just changed Subject to keep alive one of the issues I still have, >>> namely: >>> >>> I get the following error: >>> >>>> library(GenomicFeatures) >>>> txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", >>>> dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) >>> Download and preprocess the 'transcripts' data frame ... OK >>> Download and preprocess the 'chrominfo' data frame ... OK >>> Download and preprocess the 'splicings' data frame ... Error >>> in .extractCdsRangesFromBiomartTable(bm_table) : >>> BioMart data anomaly: for some transcripts, the cds cumulative >>> length inferred from the exon and UTR info doesn't match the >>> "cds_length" attribute from BioMart >>> >>> >>> Marc, you already observed that: >>> >>>>>> the data for cds ranges and total cds length (both from biomaRt) no >>>>>> longer agree with each other. In other words, the data from the >>>>>> current >>>>>> drosophila ranges in biomaRt seems to disagree with itself, and >>>>>> so the >>>>>> code is refusing to make a package out of this data as a result. >>>>>> To get the 2nd issue fixed probably involves talking to ensembl >>>>>> about >>>>>> their CDS data for fly to see if we can resolve the discrepancy. >>>>> I would be happy to take this to them. >>> >>> I still wonder: >>> >>>> Can you recommend a best way to get a more diagnostic trace from the >>>> attempt at txdb creation so we can correctly report to ensembl team >>>> the >>>> errant transcript(s) ? >>> >>> I would be happy to take this up with Ensembl team, but, need >>> details which I don't know how to produce. >>> >>> >>> Finally, one the side, here is a tiny suggestion: >>> >>> * change the default for circ_seqs in makeTranscriptDbFromBiomart >>> to be NULL, instead of any organism (human) specific. >>> >>> Regards, >>> >>> --Malcolm >>> >>> >>> R version 2.14.0 (2011-10-31) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 >>> [4] GenomicRanges_1.6.6 IRanges_1.12.5 >>> >>> loaded via a namespace (and not attached): >>> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 >>> RCurl_1.9-5 >>> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 >>> rtracklayer_1.14.4 >>> [9] tools_2.14.0 zlibbioc_1.0.0 >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> Rhoda Kinsella Ph.D. >> Ensembl Production Project Leader, >> European Bioinformatics Institute (EMBL-EBI), >> Wellcome Trust Genome Campus, >> Hinxton >> Cambridge CB10 1SD, >> UK. >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6