[Hinxton #251937] RE: GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn
3
0
Entering edit mode
@herve-pages-1542
Last seen 15 hours ago
Seattle, WA, United States
Hi Malcolm, Rhoda, Did you hear back from the Ensembl helpdesk about this issue? AFAICT the issue is still in Ensembl release 66 (released 10 days ago). For example, when querying directly the Ensembl Mart, I get the following for transcript FBtr0079414 (dmelanogaster): Exon Rank in Transcript | Chromosome Name | Strand 1 | 2L | -1 2 | 2L | -1 Exon Chr Start (bp) | Exon Chr End (bp) 7218909 | 7220029 7218643 | 7218853 5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End 7219112 | 7220029 | | | | 7218643 | 7218853 CDS Start | CDS End | CDS Length 1 | 203 | 204 204 | 204 | 204 Note that querying directly the Ensembl Mart thru the web interface allows me to choose database Ensembl Genes 66 but querying with the Bioconductor biomaRt package is still accessing Ensembl Genes 65, I wonder why, but this is a different story... So the "CDS Length" column (which, IIUC, is actually supposed to report the "Total CDS Length") is still incompatible with the exon/UTR starts and ends. If the exon/UTR starts and ends are correct then the total CDS length should be 203, not 204. But also, it could be that the exon/UTR starts and ends are incorrect. Finally note that there is no CDS region on exon 2 (the 3' UTR entirely spans exon 2) but the Ensembl Mart reports a CDS region of length 1 on this exon (CDS Start = CDS End = 204). This is probably why then the reported CDS Length is 204 (at least it's consistent with the highest "CDS End" value). Would be nice to see this dataset fixed. Thanks, H. On 02/15/2012 06:33 AM, Cook, Malcolm wrote: > Dear helpdesk at ensemblgenomes.org, > > I am following up on this issue which I understand Rhoda Kinsella at EBI to have forwarded to you. > > I originally identified and reported the issue, first to the bioconductor email list where Rhoda picked up on it and replied as below. > > I am trying to ensure that there is a tracked issue with ensemblgenomes.org with my name on it ? not that it has to be resolved with a fix, just I'd like to be assured I know as you resolve it. > > If there is anything further I can provide pertaining to describing or resolving the issue, please advise. > > Of course the issue may be in fact even further upstream ? in flybase. I've not tried to find the root cause myself. > > Thanks, > > Malcolm Cook > > > From: Rhoda Kinsella<rhoda at="" ebi.ac.uk<mailto:rhoda="" at="" ebi.ac.uk="">> > Date: Wed, 8 Feb 2012 10:27:02 -0600 > To: Malcolm Cook<mec at="" stowers.org<mailto:mec="" at="" stowers.org="">> > Cc: Hervé Pagès<hpages at="" fhcrc.org<mailto:hpages="" at="" fhcrc.org="">>, "bioconductor at r-project.org<mailto:bioconductor at="" r-project.org="">"<bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-project.org="">> > Subject: Re: [Hinxton #251937] RE: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart > > Hi Malcolm and Herv? > This appears to be a data issue with the Drosophila core database which was then propagated into BioMart. I have forwarded the issue to the Ensembl Genomes project as they maintain this database and they will respond as soon as possible. > Regards > Rhoda > > > On 7 Feb 2012, at 21:35, Cook, Malcolm wrote: > > Herve, Thanks so much for digging into this. > > Rhonda, I had submitted a ticket as suggested to Ensembl helpdesk, and have included them as recipients to this message (after changing the subject to include the issue tracker number). > > Ensembl helpdesk, I expect that Herve's detailed report, below, provides an example of the reported data anomaly that will help resolve the underlying issue. > > Cheers, > > ~Malcolm > > > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Tuesday, February 07, 2012 2:37 PM > To: Rhoda Kinsella; bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Cc: Cook, Malcolm > Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart - > BioMart data anomaly: for some transcripts, the cds cumulative length > inferred from the exon and UTR info doesn't match the "cds_length" > attribute from BioMart > > Hi Rhoda, Malcolm, and others, > > So after taking a closer look at this, I can confirm that the reported > "cds_length" looks wrong for some Fly transcripts. Take for example > the FBtr0079414 transcript (minus strand): > > library(biomaRt) > mart1<- useMart(biomart="ensembl", > dataset="dmelanogaster_gene_ensembl") > attributes<- c("ensembl_transcript_id", "strand", > + "rank", "exon_chrom_start", "exon_chrom_end", > + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end", > + "cds_length") > filters<- "ensembl_transcript_id" > values<- "FBtr0079414" > getBM(attributes=attributes, filters=filters, values=values, mart=mart1) > ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end > 5_utr_start > 1 FBtr0079414 -1 1 7218909 7220029 > 7219112 > 2 FBtr0079414 -1 2 7218643 7218853 > NA > 5_utr_end 3_utr_start 3_utr_end cds_length > 1 7220029 NA NA 204 > 2 NA 7218643 7218853 204 > > 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no > CDS on this exon. The start of the 5' UTR (located on exon 1) is 203 > bases upstream of the exon start. But the reported cds_length is 204. > Something looks wrong. > > For other transcripts, e.g. FBtr0300689 (plus strand), things look OK: > > getBM(attributes=attributes, filters=filters, values=values, mart=mart1) > ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end > 5_utr_start > 1 FBtr0300689 1 1 7529 8116 > 7529 > 2 FBtr0300689 1 2 8193 9484 > NA > 5_utr_end 3_utr_start 3_utr_end cds_length > 1 7679 NA NA 855 > 2 NA 8611 9484 855 > > 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases > upstream of the exon end. The start of the 3' UTR (located on exon 2) > is 418 bases downstream of the exon start. So the CDS total length is > 437 + 418 = 855, as reported. > > @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to > commit a patch to this function so that this anomaly in the Ensembl > data causes a warning instead of an error. Also the warning will > display the first 6 affected transcripts. The patch will make it into > GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will become > available via biocLite() in the next 24-36 hours. > > Cheers, > H. > > > On 02/06/2012 02:18 PM, Hervé Pagès wrote: > Hi Rhoda and others, > > I still need to check that this error issued by internal helper > .extractCdsRangesFromBiomartTable() about "the cds cumulative > length inferred from the exon and UTR not matching the cds_length > attribute from BioMart" is not a FALSE positive. > > I'm planning to patch the code in charge of this sanity check > so it issues a warning instead of an error and it displays > something more useful than just "for some transcripts etc...". > Would be nice to know at least for which transcript. > > I'll keep you informed, thanks! > H. > > > On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: > Hi Malcolm and Marc, > Please submit an Ensembl helpdesk ticket about this issue along with a > detailed example to (helpdesk at ensembl.org<mailto:helpdesk at="" ensembl.org="">) and we will look into it. > Kind regards > Rhoda > > > On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: > > Hi Marc, and other `library(GenomicFeatures)` users working in fly, > > I just changed Subject to keep alive one of the issues I still have, > namely: > > I get the following error: > > library(GenomicFeatures) > txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", > dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) > Download and preprocess the 'transcripts' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Download and preprocess the 'splicings' data frame ... Error > in .extractCdsRangesFromBiomartTable(bm_table) : > BioMart data anomaly: for some transcripts, the cds cumulative > length inferred from the exon and UTR info doesn't match the > "cds_length" attribute from BioMart > > > Marc, you already observed that: > > the data for cds ranges and total cds length (both from biomaRt) no > longer agree with each other. In other words, the data from the > current > drosophila ranges in biomaRt seems to disagree with itself, and > so the > code is refusing to make a package out of this data as a result. > To get the 2nd issue fixed probably involves talking to ensembl > about > their CDS data for fly to see if we can resolve the discrepancy. > I would be happy to take this to them. > > I still wonder: > > Can you recommend a best way to get a more diagnostic trace from the > attempt at txdb creation so we can correctly report to ensembl team > the > errant transcript(s) ? > > I would be happy to take this up with Ensembl team, but, need > details which I don't know how to produce. > > > Finally, one the side, here is a tiny suggestion: > > * change the default for circ_seqs in makeTranscriptDbFromBiomart > to be NULL, instead of any organism (human) specific. > > Regards, > > --Malcolm > > > R version 2.14.0 (2011-10-31) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 > [4] GenomicRanges_1.6.6 IRanges_1.12.5 > > loaded via a namespace (and not attached): > [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 > RCurl_1.9-5 > [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 > rtracklayer_1.14.4 > [9] tools_2.14.0 zlibbioc_1.0.0 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > Rhoda Kinsella Ph.D. > Ensembl Production Project Leader, > European Bioinformatics Institute (EMBL-EBI), > Wellcome Trust Genome Campus, > Hinxton > Cambridge CB10 1SD, > UK. > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org<mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > Rhoda Kinsella Ph.D. > Ensembl Production Project Leader, > European Bioinformatics Institute (EMBL-EBI), > Wellcome Trust Genome Campus, > Hinxton > Cambridge CB10 1SD, > UK. > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Cancer Organism biomaRt GenomicFeatures genomes Cancer Organism biomaRt GenomicFeatures • 1.7k views
ADD COMMENT
0
Entering edit mode
@steffen-durinck-4465
Last seen 10.2 years ago
Hi Herve, To answer your question: "Bioconductor biomaRt package is still accessing Ensembl Genes 65, I wonder why, but this is a different story..." By default biomaRt queries http://www.biomart.org , which hosts a copy of Ensembl. There is a time lag between an Ensembl update and an update of Ensembl on biomart.org An alternative is to query ensembl directly by specifying the host: > library(biomaRt) > listMarts(host="uswest.ensembl.org") biomart version 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 66 2 ENSEMBL_MART_SNP Ensembl Variation 66 > mart = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl",host=" uswest.ensembl.org") Note that the normal ensembl host is www.ensembl.org, but for some reason if you use this on the US west coast, I end up in a redirect page to uswest.ensembl.org . This redirecting is something new and biomaRt won't work currently if you use www.ensembl.org as host when you're based in the US, so use uswest.ensembl.org Cheers, Steffen 2012/3/13 Hervé Pagès <hpages@fhcrc.org> > Hi Malcolm, Rhoda, > > Did you hear back from the Ensembl helpdesk about this issue? > > AFAICT the issue is still in Ensembl release 66 (released 10 days > ago). For example, when querying directly the Ensembl Mart, I get > the following for transcript FBtr0079414 (dmelanogaster): > > Exon Rank in Transcript | Chromosome Name | Strand > 1 | 2L | -1 > 2 | 2L | -1 > > Exon Chr Start (bp) | Exon Chr End (bp) > 7218909 | 7220029 > 7218643 | 7218853 > > 5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End > 7219112 | 7220029 | | > | | 7218643 | 7218853 > > CDS Start | CDS End | CDS Length > 1 | 203 | 204 > 204 | 204 | 204 > > Note that querying directly the Ensembl Mart thru the web interface > allows me to choose database Ensembl Genes 66 but querying with the > Bioconductor biomaRt package is still accessing Ensembl Genes 65, > I wonder why, but this is a different story... > > So the "CDS Length" column (which, IIUC, is actually supposed to > report the "Total CDS Length") is still incompatible with the > exon/UTR starts and ends. If the exon/UTR starts and ends > are correct then the total CDS length should be 203, not 204. > > But also, it could be that the exon/UTR starts and ends are > incorrect. > > Finally note that there is no CDS region on exon 2 (the 3' UTR > entirely spans exon 2) but the Ensembl Mart reports a CDS region > of length 1 on this exon (CDS Start = CDS End = 204). This is > probably why then the reported CDS Length is 204 (at least it's > consistent with the highest "CDS End" value). > > Would be nice to see this dataset fixed. > > Thanks, > H. > > > On 02/15/2012 06:33 AM, Cook, Malcolm wrote: > >> Dear helpdesk@ensemblgenomes.org, >> >> I am following up on this issue which I understand Rhoda Kinsella at EBI >> to have forwarded to you. >> >> I originally identified and reported the issue, first to the bioconductor >> email list where Rhoda picked up on it and replied as below. >> >> I am trying to ensure that there is a tracked issue with >> ensemblgenomes.org with my name on it – not that it has to be resolved >> with a fix, just I'd like to be assured I know as you resolve it. >> >> If there is anything further I can provide pertaining to describing or >> resolving the issue, please advise. >> >> Of course the issue may be in fact even further upstream – in flybase. >> I've not tried to find the root cause myself. >> >> Thanks, >> >> Malcolm Cook >> >> >> From: Rhoda Kinsella<rhoda@ebi.ac.uk<**mailto:rhoda@ebi.ac.uk>> >> Date: Wed, 8 Feb 2012 10:27:02 -0600 >> To: Malcolm Cook<mec@stowers.org<mailto:me**c@stowers.org<mec@stowers.org> >> >> >> Cc: Hervé Pagès<hpages@fhcrc.org<mailto:**hpages@fhcrc.org<hpages@fhcrc.org>>>, >> "bioconductor@r-project.org<**mailto:bioconductor@r-project.**org >> >"<bioconductor@r-project.**org <bioconductor@r-project.org=""><mailto:>> bioconductor@r-**project.org <bioconductor@r-project.org>>> >> Subject: Re: [Hinxton #251937] RE: [BioC] GenomicFeatures::**makeTranscriptDbFromBiomart >> - BioMart data anomaly: for some transcripts, the cds cumulative length >> inferred from the exon and UTR info doesn't match the "cds_length" >> attribute from BioMart >> >> Hi Malcolm and Hervé >> This appears to be a data issue with the Drosophila core database which >> was then propagated into BioMart. I have forwarded the issue to the Ensembl >> Genomes project as they maintain this database and they will respond as >> soon as possible. >> Regards >> Rhoda >> >> >> On 7 Feb 2012, at 21:35, Cook, Malcolm wrote: >> >> Herve, Thanks so much for digging into this. >> >> Rhonda, I had submitted a ticket as suggested to Ensembl helpdesk, and >> have included them as recipients to this message (after changing the >> subject to include the issue tracker number). >> >> Ensembl helpdesk, I expect that Herve's detailed report, below, provides >> an example of the reported data anomaly that will help resolve the >> underlying issue. >> >> Cheers, >> >> ~Malcolm >> >> >> -----Original Message----- >> From: Hervé Pagès [mailto:hpages@fhcrc.org] >> Sent: Tuesday, February 07, 2012 2:37 PM >> To: Rhoda Kinsella; bioconductor@r-project.org<**mailto: >> bioconductor@r-project.**org <bioconductor@r-project.org>> >> Cc: Cook, Malcolm >> Subject: Re: [BioC] GenomicFeatures::**makeTranscriptDbFromBiomart - >> BioMart data anomaly: for some transcripts, the cds cumulative length >> inferred from the exon and UTR info doesn't match the "cds_length" >> attribute from BioMart >> >> Hi Rhoda, Malcolm, and others, >> >> So after taking a closer look at this, I can confirm that the reported >> "cds_length" looks wrong for some Fly transcripts. Take for example >> the FBtr0079414 transcript (minus strand): >> >> library(biomaRt) >> mart1<- useMart(biomart="ensembl", >> dataset="dmelanogaster_gene_**ensembl") >> attributes<- c("ensembl_transcript_id", "strand", >> + "rank", "exon_chrom_start", "exon_chrom_end", >> + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end", >> + "cds_length") >> filters<- "ensembl_transcript_id" >> values<- "FBtr0079414" >> getBM(attributes=attributes, filters=filters, values=values, mart=mart1) >> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end >> 5_utr_start >> 1 FBtr0079414 -1 1 7218909 7220029 >> 7219112 >> 2 FBtr0079414 -1 2 7218643 7218853 >> NA >> 5_utr_end 3_utr_start 3_utr_end cds_length >> 1 7220029 NA NA 204 >> 2 NA 7218643 7218853 204 >> >> 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no >> CDS on this exon. The start of the 5' UTR (located on exon 1) is 203 >> bases upstream of the exon start. But the reported cds_length is 204. >> Something looks wrong. >> >> For other transcripts, e.g. FBtr0300689 (plus strand), things look OK: >> >> getBM(attributes=attributes, filters=filters, values=values, mart=mart1) >> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end >> 5_utr_start >> 1 FBtr0300689 1 1 7529 8116 >> 7529 >> 2 FBtr0300689 1 2 8193 9484 >> NA >> 5_utr_end 3_utr_start 3_utr_end cds_length >> 1 7679 NA NA 855 >> 2 NA 8611 9484 855 >> >> 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases >> upstream of the exon end. The start of the 3' UTR (located on exon 2) >> is 418 bases downstream of the exon start. So the CDS total length is >> 437 + 418 = 855, as reported. >> >> @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to >> commit a patch to this function so that this anomaly in the Ensembl >> data causes a warning instead of an error. Also the warning will >> display the first 6 affected transcripts. The patch will make it into >> GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will become >> available via biocLite() in the next 24-36 hours. >> >> Cheers, >> H. >> >> >> On 02/06/2012 02:18 PM, Hervé Pagès wrote: >> Hi Rhoda and others, >> >> I still need to check that this error issued by internal helper >> .**extractCdsRangesFromBiomartTab**le() about "the cds cumulative >> length inferred from the exon and UTR not matching the cds_length >> attribute from BioMart" is not a FALSE positive. >> >> I'm planning to patch the code in charge of this sanity check >> so it issues a warning instead of an error and it displays >> something more useful than just "for some transcripts etc...". >> Would be nice to know at least for which transcript. >> >> I'll keep you informed, thanks! >> H. >> >> >> On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: >> Hi Malcolm and Marc, >> Please submit an Ensembl helpdesk ticket about this issue along with a >> detailed example to (helpdesk@ensembl.org<mailto:h**elpdesk@ensembl .org<helpdesk@ensembl.org="">>) >> and we will look into it. >> Kind regards >> Rhoda >> >> >> On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: >> >> Hi Marc, and other `library(GenomicFeatures)` users working in fly, >> >> I just changed Subject to keep alive one of the issues I still have, >> namely: >> >> I get the following error: >> >> library(GenomicFeatures) >> txdb<-**makeTranscriptDbFromBiomart(**biomart="ensembl", >> dataset="dmelanogaster_gene_**ensembl", circ_seqs=NULL)) >> Download and preprocess the 'transcripts' data frame ... OK >> Download and preprocess the 'chrominfo' data frame ... OK >> Download and preprocess the 'splicings' data frame ... Error >> in .**extractCdsRangesFromBiomartTab**le(bm_table) : >> BioMart data anomaly: for some transcripts, the cds cumulative >> length inferred from the exon and UTR info doesn't match the >> "cds_length" attribute from BioMart >> >> >> Marc, you already observed that: >> >> the data for cds ranges and total cds length (both from biomaRt) no >> longer agree with each other. In other words, the data from the >> current >> drosophila ranges in biomaRt seems to disagree with itself, and >> so the >> code is refusing to make a package out of this data as a result. >> To get the 2nd issue fixed probably involves talking to ensembl >> about >> their CDS data for fly to see if we can resolve the discrepancy. >> I would be happy to take this to them. >> >> I still wonder: >> >> Can you recommend a best way to get a more diagnostic trace from the >> attempt at txdb creation so we can correctly report to ensembl team >> the >> errant transcript(s) ? >> >> I would be happy to take this up with Ensembl team, but, need >> details which I don't know how to produce. >> >> >> Finally, one the side, here is a tiny suggestion: >> >> * change the default for circ_seqs in makeTranscriptDbFromBiomart >> to be NULL, instead of any organism (human) specific. >> >> Regards, >> >> --Malcolm >> >> >> R version 2.14.0 (2011-10-31) >> Platform: x86_64-apple-darwin9.8.0/x86_**64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 >> [4] GenomicRanges_1.6.6 IRanges_1.12.5 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 >> RCurl_1.9-5 >> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 >> rtracklayer_1.14.4 >> [9] tools_2.14.0 zlibbioc_1.0.0 >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org<**mailto:Bioconductor@r-project.**org<bi oconductor@r-project.org=""> >> > >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.**conduct or<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> Rhoda Kinsella Ph.D. >> Ensembl Production Project Leader, >> European Bioinformatics Institute (EMBL-EBI), >> Wellcome Trust Genome Campus, >> Hinxton >> Cambridge CB10 1SD, >> UK. >> >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org<**mailto:Bioconductor@r-project.**org<bi oconductor@r-project.org=""> >> > >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.**conduct or<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org<mailto:hpages**@fhcrc.org <hpages@fhcrc.org="">> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> Rhoda Kinsella Ph.D. >> Ensembl Production Project Leader, >> European Bioinformatics Institute (EMBL-EBI), >> Wellcome Trust Genome Campus, >> Hinxton >> Cambridge CB10 1SD, >> UK. >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Steffen, On 03/13/2012 02:37 PM, Steffen Durinck wrote: > Hi Herve, > > To answer your question: > > "Bioconductor biomaRt package is still accessing Ensembl Genes 65, > I wonder why, but this is a different story..." > > By default biomaRt queries http://www.biomart.org , which hosts a copy > of Ensembl. There is a time lag between an Ensembl update and an update > of Ensembl on biomart.org <http: biomart.org=""> Thanks Steffen for the details. Yes I knew about this lag, we see it at each new Ensembl release. I guess the grumbling was more like "why on earth every time it takes 2 weeks for the new Ensembl release to propagate to http://biomart.org?". Or, "why on earth do we have to wait 2 weeks after each new Ensembl release to see our unit tests break in the GenomicFeatures package?" ;-) > > An alternative is to query ensembl directly by specifying the host: > > > library(biomaRt) > > listMarts(host="uswest.ensembl.org <http: uswest.ensembl.org="">") > biomart version > 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 66 > 2 ENSEMBL_MART_SNP Ensembl Variation 66 > > mart = > useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl",host= "uswest.ensembl.org > <http: uswest.ensembl.org="">") Thanks for the reminder. I wish they could use the same biomart name: why "ensembl" on http://biomart.org and "ENSEMBL_MART_ENSEMBL" on http://uswest.ensembl.org. Now I'll stop grumbling... > > > Note that the normal ensembl host is www.ensembl.org > <http: www.ensembl.org="">, but for some reason if you use this on the US > west coast, I end up in a redirect page to uswest.ensembl.org > <http: uswest.ensembl.org=""> . This redirecting is something new and > biomaRt won't work currently if you use www.ensembl.org > <http: www.ensembl.org=""> as host when you're based in the US, so use > uswest.ensembl.org <http: uswest.ensembl.org=""> Thanks for the extra details. Cheers, H. > > Cheers, > Steffen > > > > > 2012/3/13 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Malcolm, Rhoda, > > Did you hear back from the Ensembl helpdesk about this issue? > > AFAICT the issue is still in Ensembl release 66 (released 10 days > ago). For example, when querying directly the Ensembl Mart, I get > the following for transcript FBtr0079414 (dmelanogaster): > > Exon Rank in Transcript | Chromosome Name | Strand > 1 | 2L | -1 > 2 | 2L | -1 > > Exon Chr Start (bp) | Exon Chr End (bp) > 7218909 | 7220029 > 7218643 | 7218853 > > 5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End > 7219112 | 7220029 | | > | | 7218643 | 7218853 > > CDS Start | CDS End | CDS Length > 1 | 203 | 204 > 204 | 204 | 204 > > Note that querying directly the Ensembl Mart thru the web interface > allows me to choose database Ensembl Genes 66 but querying with the > Bioconductor biomaRt package is still accessing Ensembl Genes 65, > I wonder why, but this is a different story... > > So the "CDS Length" column (which, IIUC, is actually supposed to > report the "Total CDS Length") is still incompatible with the > exon/UTR starts and ends. If the exon/UTR starts and ends > are correct then the total CDS length should be 203, not 204. > > But also, it could be that the exon/UTR starts and ends are > incorrect. > > Finally note that there is no CDS region on exon 2 (the 3' UTR > entirely spans exon 2) but the Ensembl Mart reports a CDS region > of length 1 on this exon (CDS Start = CDS End = 204). This is > probably why then the reported CDS Length is 204 (at least it's > consistent with the highest "CDS End" value). > > Would be nice to see this dataset fixed. > > Thanks, > H. > > > On 02/15/2012 06:33 AM, Cook, Malcolm wrote: > > Dear helpdesk at ensemblgenomes.org > <mailto:helpdesk at="" ensemblgenomes.org="">, > > I am following up on this issue which I understand Rhoda > Kinsella at EBI to have forwarded to you. > > I originally identified and reported the issue, first to the > bioconductor email list where Rhoda picked up on it and replied > as below. > > I am trying to ensure that there is a tracked issue with > ensemblgenomes.org <http: ensemblgenomes.org=""> with my name on > it ? not that it has to be resolved with a fix, just I'd like to > be assured I know as you resolve it. > > If there is anything further I can provide pertaining to > describing or resolving the issue, please advise. > > Of course the issue may be in fact even further upstream ? in > flybase. I've not tried to find the root cause myself. > > Thanks, > > Malcolm Cook > > > From: Rhoda Kinsella<rhoda at="" ebi.ac.uk=""> <mailto:rhoda at="" ebi.ac.uk=""><__mailto:rhoda at ebi.ac.uk > <mailto:rhoda at="" ebi.ac.uk="">>> > Date: Wed, 8 Feb 2012 10:27:02 -0600 > To: Malcolm Cook<mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""><mailto:me__c at="" stowers.org=""> <mailto:mec at="" stowers.org="">>> > Cc: Hervé Pagès<hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""><mailto:__hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>, "bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""><__mailto:bioconductor at r-project.__org > <mailto:bioconductor at="" r-project.org="">>"<bioconductor at="" r-project.__org=""> <mailto:bioconductor at="" r-project.org=""><mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > Subject: Re: [Hinxton #251937] RE: [BioC] > GenomicFeatures::__makeTranscriptDbFromBiomart - BioMart data > anomaly: for some transcripts, the cds cumulative length > inferred from the exon and UTR info doesn't match the > "cds_length" attribute from BioMart > > Hi Malcolm and Herv? > This appears to be a data issue with the Drosophila core > database which was then propagated into BioMart. I have > forwarded the issue to the Ensembl Genomes project as they > maintain this database and they will respond as soon as possible. > Regards > Rhoda > > > On 7 Feb 2012, at 21:35, Cook, Malcolm wrote: > > Herve, Thanks so much for digging into this. > > Rhonda, I had submitted a ticket as suggested to Ensembl > helpdesk, and have included them as recipients to this message > (after changing the subject to include the issue tracker number). > > Ensembl helpdesk, I expect that Herve's detailed report, below, > provides an example of the reported data anomaly that will help > resolve the underlying issue. > > Cheers, > > ~Malcolm > > > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org="">] > Sent: Tuesday, February 07, 2012 2:37 PM > To: Rhoda Kinsella; bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""><__mailto:bioconductor at r-project.__org > <mailto:bioconductor at="" r-project.org="">> > Cc: Cook, Malcolm > Subject: Re: [BioC] GenomicFeatures::__makeTranscriptDbFromBiomart - > BioMart data anomaly: for some transcripts, the cds cumulative > length > inferred from the exon and UTR info doesn't match the "cds_length" > attribute from BioMart > > Hi Rhoda, Malcolm, and others, > > So after taking a closer look at this, I can confirm that the > reported > "cds_length" looks wrong for some Fly transcripts. Take for example > the FBtr0079414 transcript (minus strand): > > library(biomaRt) > mart1<- useMart(biomart="ensembl", > dataset="dmelanogaster_gene___ensembl") > attributes<- c("ensembl_transcript_id", "strand", > + "rank", "exon_chrom_start", "exon_chrom_end", > + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end", > + "cds_length") > filters<- "ensembl_transcript_id" > values<- "FBtr0079414" > getBM(attributes=attributes, filters=filters, values=values, > mart=mart1) > ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end > 5_utr_start > 1 FBtr0079414 -1 1 7218909 7220029 > 7219112 > 2 FBtr0079414 -1 2 7218643 7218853 > NA > 5_utr_end 3_utr_start 3_utr_end cds_length > 1 7220029 NA NA 204 > 2 NA 7218643 7218853 204 > > 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no > CDS on this exon. The start of the 5' UTR (located on exon 1) is 203 > bases upstream of the exon start. But the reported cds_length is > 204. > Something looks wrong. > > For other transcripts, e.g. FBtr0300689 (plus strand), things > look OK: > > getBM(attributes=attributes, filters=filters, values=values, > mart=mart1) > ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end > 5_utr_start > 1 FBtr0300689 1 1 7529 8116 > 7529 > 2 FBtr0300689 1 2 8193 9484 > NA > 5_utr_end 3_utr_start 3_utr_end cds_length > 1 7679 NA NA 855 > 2 NA 8611 9484 855 > > 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases > upstream of the exon end. The start of the 3' UTR (located on > exon 2) > is 418 bases downstream of the exon start. So the CDS total > length is > 437 + 418 = 855, as reported. > > @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to > commit a patch to this function so that this anomaly in the Ensembl > data causes a warning instead of an error. Also the warning will > display the first 6 affected transcripts. The patch will make it > into > GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will > become > available via biocLite() in the next 24-36 hours. > > Cheers, > H. > > > On 02/06/2012 02:18 PM, Hervé Pagès wrote: > Hi Rhoda and others, > > I still need to check that this error issued by internal helper > .__extractCdsRangesFromBiomartTab__le() about "the cds cumulative > length inferred from the exon and UTR not matching the cds_length > attribute from BioMart" is not a FALSE positive. > > I'm planning to patch the code in charge of this sanity check > so it issues a warning instead of an error and it displays > something more useful than just "for some transcripts etc...". > Would be nice to know at least for which transcript. > > I'll keep you informed, thanks! > H. > > > On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: > Hi Malcolm and Marc, > Please submit an Ensembl helpdesk ticket about this issue along > with a > detailed example to (helpdesk at ensembl.org > <mailto:helpdesk at="" ensembl.org=""><mailto:h__elpdesk at="" ensembl.org=""> <mailto:helpdesk at="" ensembl.org="">>) and we will look into it. > Kind regards > Rhoda > > > On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: > > Hi Marc, and other `library(GenomicFeatures)` users working in fly, > > I just changed Subject to keep alive one of the issues I still have, > namely: > > I get the following error: > > library(GenomicFeatures) > txdb<-__makeTranscriptDbFromBiomart(__biomart="ensembl", > dataset="dmelanogaster_gene___ensembl", circ_seqs=NULL)) > Download and preprocess the 'transcripts' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Download and preprocess the 'splicings' data frame ... Error > in .__extractCdsRangesFromBiomartTab__le(bm_table) : > BioMart data anomaly: for some transcripts, the cds cumulative > length inferred from the exon and UTR info doesn't match the > "cds_length" attribute from BioMart > > > Marc, you already observed that: > > the data for cds ranges and total cds length (both from biomaRt) no > longer agree with each other. In other words, the data from the > current > drosophila ranges in biomaRt seems to disagree with itself, and > so the > code is refusing to make a package out of this data as a result. > To get the 2nd issue fixed probably involves talking to ensembl > about > their CDS data for fly to see if we can resolve the discrepancy. > I would be happy to take this to them. > > I still wonder: > > Can you recommend a best way to get a more diagnostic trace from the > attempt at txdb creation so we can correctly report to ensembl team > the > errant transcript(s) ? > > I would be happy to take this up with Ensembl team, but, need > details which I don't know how to produce. > > > Finally, one the side, here is a tiny suggestion: > > * change the default for circ_seqs in makeTranscriptDbFromBiomart > to be NULL, instead of any organism (human) specific. > > Regards, > > --Malcolm > > > R version 2.14.0 (2011-10-31) > Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 > [4] GenomicRanges_1.6.6 IRanges_1.12.5 > > loaded via a namespace (and not attached): > [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 > RCurl_1.9-5 > [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 > rtracklayer_1.14.4 > [9] tools_2.14.0 zlibbioc_1.0.0 > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""><__mailto:Bioconductor at r-project.__org > <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > Rhoda Kinsella Ph.D. > Ensembl Production Project Leader, > European Bioinformatics Institute (EMBL-EBI), > Wellcome Trust Genome Campus, > Hinxton > Cambridge CB10 1SD, > UK. > > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""><__mailto:Bioconductor at r-project.__org > <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""><mailto:hpages__ at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > Rhoda Kinsella Ph.D. > Ensembl Production Project Leader, > European Bioinformatics Institute (EMBL-EBI), > Wellcome Trust Genome Campus, > Hinxton > Cambridge CB10 1SD, > UK. > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi Hervé, For your information, Arek from the BioMart project posted this on the biomart users mailing list. It explains the recent update delays: Dear All, I would like to apologize for the delays with the BioMart services updates that generated a number of complaints on the BioMart users mailing list. We are now in the process of transferring those services and BioMart development out of OICR. I am told, that this process will not be completed until at least the end of the month. In order to minimize the disruption for the BioMart community I have asked OICR leadership to perform an urgent update so the services should be restored shortly. a -- Arek Kasprzyk, MD, MSc, PhD BioMart Project Lead Regards Rhoda On 13 Mar 2012, at 22:09, Hervé Pagès wrote: > Hi Steffen, > > On 03/13/2012 02:37 PM, Steffen Durinck wrote: >> Hi Herve, >> >> To answer your question: >> >> "Bioconductor biomaRt package is still accessing Ensembl Genes 65, >> I wonder why, but this is a different story..." >> >> By default biomaRt queries http://www.biomart.org , which hosts a >> copy >> of Ensembl. There is a time lag between an Ensembl update and an >> update >> of Ensembl on biomart.org <http: biomart.org=""> > > Thanks Steffen for the details. Yes I knew about this lag, we see it > at > each new Ensembl release. I guess the grumbling was more like "why on > earth every time it takes 2 weeks for the new Ensembl release to > propagate to http://biomart.org?". Or, "why on earth do we have to > wait > 2 weeks after each new Ensembl release to see our unit tests break in > the GenomicFeatures package?" ;-) > >> >> An alternative is to query ensembl directly by specifying the host: >> >> > library(biomaRt) >> > listMarts(host="uswest.ensembl.org <http: uswest.ensembl.org="">") >> biomart version >> 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 66 >> 2 ENSEMBL_MART_SNP Ensembl Variation 66 >> > mart = >> useMart >> ("ENSEMBL_MART_ENSEMBL >> ",dataset="hsapiens_gene_ensembl",host="uswest.ensembl.org >> <http: uswest.ensembl.org="">") > > Thanks for the reminder. I wish they could use the same biomart name: > why "ensembl" on http://biomart.org and "ENSEMBL_MART_ENSEMBL" on > http://uswest.ensembl.org. Now I'll stop grumbling... > >> >> >> Note that the normal ensembl host is www.ensembl.org >> <http: www.ensembl.org="">, but for some reason if you use this on >> the US >> west coast, I end up in a redirect page to uswest.ensembl.org >> <http: uswest.ensembl.org=""> . This redirecting is something new and >> biomaRt won't work currently if you use www.ensembl.org >> <http: www.ensembl.org=""> as host when you're based in the US, so use >> uswest.ensembl.org <http: uswest.ensembl.org=""> > > Thanks for the extra details. > > Cheers, > H. > >> >> Cheers, >> Steffen >> >> >> >> >> 2012/3/13 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> Hi Malcolm, Rhoda, >> >> Did you hear back from the Ensembl helpdesk about this issue? >> >> AFAICT the issue is still in Ensembl release 66 (released 10 days >> ago). For example, when querying directly the Ensembl Mart, I get >> the following for transcript FBtr0079414 (dmelanogaster): >> >> Exon Rank in Transcript | Chromosome Name | Strand >> 1 | 2L | -1 >> 2 | 2L | -1 >> >> Exon Chr Start (bp) | Exon Chr End (bp) >> 7218909 | 7220029 >> 7218643 | 7218853 >> >> 5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End >> 7219112 | 7220029 | | >> | | 7218643 | 7218853 >> >> CDS Start | CDS End | CDS Length >> 1 | 203 | 204 >> 204 | 204 | 204 >> >> Note that querying directly the Ensembl Mart thru the web >> interface >> allows me to choose database Ensembl Genes 66 but querying with >> the >> Bioconductor biomaRt package is still accessing Ensembl Genes 65, >> I wonder why, but this is a different story... >> >> So the "CDS Length" column (which, IIUC, is actually supposed to >> report the "Total CDS Length") is still incompatible with the >> exon/UTR starts and ends. If the exon/UTR starts and ends >> are correct then the total CDS length should be 203, not 204. >> >> But also, it could be that the exon/UTR starts and ends are >> incorrect. >> >> Finally note that there is no CDS region on exon 2 (the 3' UTR >> entirely spans exon 2) but the Ensembl Mart reports a CDS region >> of length 1 on this exon (CDS Start = CDS End = 204). This is >> probably why then the reported CDS Length is 204 (at least it's >> consistent with the highest "CDS End" value). >> >> Would be nice to see this dataset fixed. >> >> Thanks, >> H. >> >> >> On 02/15/2012 06:33 AM, Cook, Malcolm wrote: >> >> Dear helpdesk@ensemblgenomes.org >> <mailto:helpdesk@ensemblgenomes.org>, >> >> I am following up on this issue which I understand Rhoda >> Kinsella at EBI to have forwarded to you. >> >> I originally identified and reported the issue, first to the >> bioconductor email list where Rhoda picked up on it and >> replied >> as below. >> >> I am trying to ensure that there is a tracked issue with >> ensemblgenomes.org <http: ensemblgenomes.org=""> with my name on >> it – not that it has to be resolved with a fix, just I'd >> like to >> be assured I know as you resolve it. >> >> If there is anything further I can provide pertaining to >> describing or resolving the issue, please advise. >> >> Of course the issue may be in fact even further upstream – in >> flybase. I've not tried to find the root cause myself. >> >> Thanks, >> >> Malcolm Cook >> >> >> From: Rhoda Kinsella<rhoda@ebi.ac.uk>> <mailto:rhoda@ebi.ac.uk><__mailto:rhoda@ebi.ac.uk >> <mailto:rhoda@ebi.ac.uk>>> >> Date: Wed, 8 Feb 2012 10:27:02 -0600 >> To: Malcolm Cook<mec@stowers.org>> <mailto:mec@stowers.org><mailto:me__c@stowers.org>> <mailto:mec@stowers.org>>> >> Cc: Hervé Pagès<hpages@fhcrc.org>> <mailto:hpages@fhcrc.org><mailto:__hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>>>, "bioconductor@r-project.org >> <mailto:bioconductor@r-project.org><__mailto:bioconductor@r- project.__org >> <mailto:bioconductor@r-project.org>>"<bioconductor@r-project.__org>> <mailto:bioconductor@r-project.org><mailto:bioconductor@r-__ project.org="">> <mailto:bioconductor@r-project.org>>> >> Subject: Re: [Hinxton #251937] RE: [BioC] >> GenomicFeatures::__makeTranscriptDbFromBiomart - BioMart data >> anomaly: for some transcripts, the cds cumulative length >> inferred from the exon and UTR info doesn't match the >> "cds_length" attribute from BioMart >> >> Hi Malcolm and Hervé >> This appears to be a data issue with the Drosophila core >> database which was then propagated into BioMart. I have >> forwarded the issue to the Ensembl Genomes project as they >> maintain this database and they will respond as soon as >> possible. >> Regards >> Rhoda >> >> >> On 7 Feb 2012, at 21:35, Cook, Malcolm wrote: >> >> Herve, Thanks so much for digging into this. >> >> Rhonda, I had submitted a ticket as suggested to Ensembl >> helpdesk, and have included them as recipients to this message >> (after changing the subject to include the issue tracker >> number). >> >> Ensembl helpdesk, I expect that Herve's detailed report, >> below, >> provides an example of the reported data anomaly that will >> help >> resolve the underlying issue. >> >> Cheers, >> >> ~Malcolm >> >> >> -----Original Message----- >> From: Hervé Pagès [mailto:hpages@fhcrc.org >> <mailto:hpages@fhcrc.org>] >> Sent: Tuesday, February 07, 2012 2:37 PM >> To: Rhoda Kinsella; bioconductor@r-project.org >> <mailto:bioconductor@r-project.org><__mailto:bioconductor@r- project.__org >> <mailto:bioconductor@r-project.org>> >> Cc: Cook, Malcolm >> Subject: Re: [BioC] >> GenomicFeatures::__makeTranscriptDbFromBiomart - >> BioMart data anomaly: for some transcripts, the cds cumulative >> length >> inferred from the exon and UTR info doesn't match the >> "cds_length" >> attribute from BioMart >> >> Hi Rhoda, Malcolm, and others, >> >> So after taking a closer look at this, I can confirm that the >> reported >> "cds_length" looks wrong for some Fly transcripts. Take for >> example >> the FBtr0079414 transcript (minus strand): >> >> library(biomaRt) >> mart1<- useMart(biomart="ensembl", >> dataset="dmelanogaster_gene___ensembl") >> attributes<- c("ensembl_transcript_id", "strand", >> + "rank", "exon_chrom_start", "exon_chrom_end", >> + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end", >> + "cds_length") >> filters<- "ensembl_transcript_id" >> values<- "FBtr0079414" >> getBM(attributes=attributes, filters=filters, values=values, >> mart=mart1) >> ensembl_transcript_id strand rank exon_chrom_start >> exon_chrom_end >> 5_utr_start >> 1 FBtr0079414 -1 1 7218909 >> 7220029 >> 7219112 >> 2 FBtr0079414 -1 2 7218643 >> 7218853 >> NA >> 5_utr_end 3_utr_start 3_utr_end cds_length >> 1 7220029 NA NA 204 >> 2 NA 7218643 7218853 204 >> >> 2 exons: The 3' UTR (located on exon 2) spans the entire >> exon so no >> CDS on this exon. The start of the 5' UTR (located on exon >> 1) is 203 >> bases upstream of the exon start. But the reported >> cds_length is >> 204. >> Something looks wrong. >> >> For other transcripts, e.g. FBtr0300689 (plus strand), things >> look OK: >> >> getBM(attributes=attributes, filters=filters, values=values, >> mart=mart1) >> ensembl_transcript_id strand rank exon_chrom_start >> exon_chrom_end >> 5_utr_start >> 1 FBtr0300689 1 1 >> 7529 8116 >> 7529 >> 2 FBtr0300689 1 2 >> 8193 9484 >> NA >> 5_utr_end 3_utr_start 3_utr_end cds_length >> 1 7679 NA NA 855 >> 2 NA 8611 9484 855 >> >> 2 exons: The end of the 5' UTR (located on exon 1) is 437 >> bases >> upstream of the exon end. The start of the 3' UTR (located on >> exon 2) >> is 418 bases downstream of the exon start. So the CDS total >> length is >> 437 + 418 = 855, as reported. >> >> @Malcolm and other makeTranscriptDbFromBiomart() users: I'm >> about to >> commit a patch to this function so that this anomaly in the >> Ensembl >> data causes a warning instead of an error. Also the warning >> will >> display the first 6 affected transcripts. The patch will >> make it >> into >> GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will >> become >> available via biocLite() in the next 24-36 hours. >> >> Cheers, >> H. >> >> >> On 02/06/2012 02:18 PM, Hervé Pagès wrote: >> Hi Rhoda and others, >> >> I still need to check that this error issued by internal >> helper >> .__extractCdsRangesFromBiomartTab__le() about "the cds >> cumulative >> length inferred from the exon and UTR not matching the >> cds_length >> attribute from BioMart" is not a FALSE positive. >> >> I'm planning to patch the code in charge of this sanity check >> so it issues a warning instead of an error and it displays >> something more useful than just "for some transcripts etc...". >> Would be nice to know at least for which transcript. >> >> I'll keep you informed, thanks! >> H. >> >> >> On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: >> Hi Malcolm and Marc, >> Please submit an Ensembl helpdesk ticket about this issue >> along >> with a >> detailed example to (helpdesk@ensembl.org >> <mailto:helpdesk@ensembl.org><mailto:h__elpdesk@ensembl.org>> <mailto:helpdesk@ensembl.org>>) and we will look into it. >> Kind regards >> Rhoda >> >> >> On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: >> >> Hi Marc, and other `library(GenomicFeatures)` users working >> in fly, >> >> I just changed Subject to keep alive one of the issues I >> still have, >> namely: >> >> I get the following error: >> >> library(GenomicFeatures) >> txdb<-__makeTranscriptDbFromBiomart(__biomart="ensembl", >> dataset="dmelanogaster_gene___ensembl", circ_seqs=NULL)) >> Download and preprocess the 'transcripts' data frame ... OK >> Download and preprocess the 'chrominfo' data frame ... OK >> Download and preprocess the 'splicings' data frame ... Error >> in .__extractCdsRangesFromBiomartTab__le(bm_table) : >> BioMart data anomaly: for some transcripts, the cds cumulative >> length inferred from the exon and UTR info doesn't match the >> "cds_length" attribute from BioMart >> >> >> Marc, you already observed that: >> >> the data for cds ranges and total cds length (both from >> biomaRt) no >> longer agree with each other. In other words, the data from >> the >> current >> drosophila ranges in biomaRt seems to disagree with itself, >> and >> so the >> code is refusing to make a package out of this data as a >> result. >> To get the 2nd issue fixed probably involves talking to >> ensembl >> about >> their CDS data for fly to see if we can resolve the >> discrepancy. >> I would be happy to take this to them. >> >> I still wonder: >> >> Can you recommend a best way to get a more diagnostic trace >> from the >> attempt at txdb creation so we can correctly report to >> ensembl team >> the >> errant transcript(s) ? >> >> I would be happy to take this up with Ensembl team, but, need >> details which I don't know how to produce. >> >> >> Finally, one the side, here is a tiny suggestion: >> >> * change the default for circ_seqs in >> makeTranscriptDbFromBiomart >> to be NULL, instead of any organism (human) specific. >> >> Regards, >> >> --Malcolm >> >> >> R version 2.14.0 (2011-10-31) >> Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 >> [4] GenomicRanges_1.6.6 IRanges_1.12.5 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 >> RCurl_1.9-5 >> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 >> rtracklayer_1.14.4 >> [9] tools_2.14.0 zlibbioc_1.0.0 >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> <mailto:bioconductor@r-project.org><__mailto:Bioconductor@r- project.__org >> <mailto:bioconductor@r-project.org>> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > >> >> Rhoda Kinsella Ph.D. >> Ensembl Production Project Leader, >> European Bioinformatics Institute (EMBL-EBI), >> Wellcome Trust Genome Campus, >> Hinxton >> Cambridge CB10 1SD, >> UK. >> >> >> [[alternative HTML version deleted]] >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> <mailto:bioconductor@r-project.org><__mailto:Bioconductor@r- project.__org >> <mailto:bioconductor@r-project.org>> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> <mailto:hpages@fhcrc.org><mailto:hpages__@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> Rhoda Kinsella Ph.D. >> Ensembl Production Project Leader, >> European Bioinformatics Institute (EMBL-EBI), >> Wellcome Trust Genome Campus, >> Hinxton >> Cambridge CB10 1SD, >> UK. >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 Rhoda Kinsella Ph.D. Ensembl Production Project Leader, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK. [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Rhoda, Thanks for letting us know! It seems that the new Ensemble release (Ensembl Genes 66) has finally propagated yesterday to www.biomart.org... ... and broke an example in the GenomicFeatures package (because transcript id ENST00000400840 is gone), but that's OK, some breakage is expected at each new release, and is generally easy to fix (better to have some examples that break sometimes than no example at all). Thanks again, H. On 03/19/2012 07:24 AM, Rhoda Kinsella wrote: > Hi Herv?, > For your information, Arek from the BioMart project posted this on the > biomart users mailing list. It explains the recent update delays: > > Dear All, > I would like to apologize for the delays with the BioMart services > updates that generated a number of complaints on the BioMart users > mailing list. We are now in the process of transferring those services > and BioMart development out of OICR. I am told, that this process will > not be completed until at least the end of the month. In order to > minimize the disruption for the BioMart community I have asked OICR > leadership to perform an urgent update so the services should be > restored shortly. > > a > > -- > > Arek Kasprzyk, MD, MSc, PhD > BioMart Project Lead > > > Regards > > Rhoda > > > On 13 Mar 2012, at 22:09, Hervé Pagès wrote: > >> Hi Steffen, >> >> On 03/13/2012 02:37 PM, Steffen Durinck wrote: >>> Hi Herve, >>> >>> To answer your question: >>> >>> "Bioconductor biomaRt package is still accessing Ensembl Genes 65, >>> I wonder why, but this is a different story..." >>> >>> By default biomaRt queries http://www.biomart.org , which hosts a copy >>> of Ensembl. There is a time lag between an Ensembl update and an update >>> of Ensembl on biomart.org <http: biomart.org=""> >> >> Thanks Steffen for the details. Yes I knew about this lag, we see it at >> each new Ensembl release. I guess the grumbling was more like "why on >> earth every time it takes 2 weeks for the new Ensembl release to >> propagate to http://biomart.org?". Or, "why on earth do we have to wait >> 2 weeks after each new Ensembl release to see our unit tests break in >> the GenomicFeatures package?" ;-) >> >>> >>> An alternative is to query ensembl directly by specifying the host: >>> >>> > library(biomaRt) >>> > listMarts(host="uswest.ensembl.org <http: uswest.ensembl.org="">") >>> biomart version >>> 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 66 >>> 2 ENSEMBL_MART_SNP Ensembl Variation 66 >>> > mart = >>> useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl",hos t="uswest.ensembl.org >>> <http: uswest.ensembl.org="">") >> >> Thanks for the reminder. I wish they could use the same biomart name: >> why "ensembl" on http://biomart.org and "ENSEMBL_MART_ENSEMBL" on >> http://uswest.ensembl.org. Now I'll stop grumbling... >> >>> >>> >>> Note that the normal ensembl host is www.ensembl.org >>> <http: www.ensembl.org=""> >>> <http: www.ensembl.org="">, but for some reason if you use this on the US >>> west coast, I end up in a redirect page to uswest.ensembl.org >>> <http: uswest.ensembl.org=""> . This redirecting is something new and >>> biomaRt won't work currently if you use www.ensembl.org >>> <http: www.ensembl.org=""> >>> <http: www.ensembl.org=""> as host when you're based in the US, so use >>> uswest.ensembl.org <http: uswest.ensembl.org=""> >> >> Thanks for the extra details. >> >> Cheers, >> H. >> >>> >>> Cheers, >>> Steffen >>> >>> >>> >>> >>> 2012/3/13 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> >>> >>> Hi Malcolm, Rhoda, >>> >>> Did you hear back from the Ensembl helpdesk about this issue? >>> >>> AFAICT the issue is still in Ensembl release 66 (released 10 days >>> ago). For example, when querying directly the Ensembl Mart, I get >>> the following for transcript FBtr0079414 (dmelanogaster): >>> >>> Exon Rank in Transcript | Chromosome Name | Strand >>> 1 | 2L | -1 >>> 2 | 2L | -1 >>> >>> Exon Chr Start (bp) | Exon Chr End (bp) >>> 7218909 | 7220029 >>> 7218643 | 7218853 >>> >>> 5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End >>> 7219112 | 7220029 | | >>> | | 7218643 | 7218853 >>> >>> CDS Start | CDS End | CDS Length >>> 1 | 203 | 204 >>> 204 | 204 | 204 >>> >>> Note that querying directly the Ensembl Mart thru the web interface >>> allows me to choose database Ensembl Genes 66 but querying with the >>> Bioconductor biomaRt package is still accessing Ensembl Genes 65, >>> I wonder why, but this is a different story... >>> >>> So the "CDS Length" column (which, IIUC, is actually supposed to >>> report the "Total CDS Length") is still incompatible with the >>> exon/UTR starts and ends. If the exon/UTR starts and ends >>> are correct then the total CDS length should be 203, not 204. >>> >>> But also, it could be that the exon/UTR starts and ends are >>> incorrect. >>> >>> Finally note that there is no CDS region on exon 2 (the 3' UTR >>> entirely spans exon 2) but the Ensembl Mart reports a CDS region >>> of length 1 on this exon (CDS Start = CDS End = 204). This is >>> probably why then the reported CDS Length is 204 (at least it's >>> consistent with the highest "CDS End" value). >>> >>> Would be nice to see this dataset fixed. >>> >>> Thanks, >>> H. >>> >>> >>> On 02/15/2012 06:33 AM, Cook, Malcolm wrote: >>> >>> Dear helpdesk at ensemblgenomes.org <mailto:helpdesk at="" ensemblgenomes.org=""> >>> <mailto:helpdesk at="" ensemblgenomes.org="">, >>> >>> I am following up on this issue which I understand Rhoda >>> Kinsella at EBI to have forwarded to you. >>> >>> I originally identified and reported the issue, first to the >>> bioconductor email list where Rhoda picked up on it and replied >>> as below. >>> >>> I am trying to ensure that there is a tracked issue with >>> ensemblgenomes.org <http: ensemblgenomes.org=""> with my name on >>> it ? not that it has to be resolved with a fix, just I'd like to >>> be assured I know as you resolve it. >>> >>> If there is anything further I can provide pertaining to >>> describing or resolving the issue, please advise. >>> >>> Of course the issue may be in fact even further upstream ? in >>> flybase. I've not tried to find the root cause myself. >>> >>> Thanks, >>> >>> Malcolm Cook >>> >>> >>> From: Rhoda Kinsella<rhoda at="" ebi.ac.uk="" <mailto:rhoda="" at="" ebi.ac.uk=""> >>> <mailto:rhoda at="" ebi.ac.uk=""><__mailto:rhoda at ebi.ac.uk >>> <mailto:rhoda at="" ebi.ac.uk="">>> >>> Date: Wed, 8 Feb 2012 10:27:02 -0600 >>> To: Malcolm Cook<mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> >>> <mailto:mec at="" stowers.org=""><mailto:me__c at="" stowers.org="">>> <mailto:mec at="" stowers.org="">>> >>> Cc: Hervé Pagès<hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> >>> <mailto:hpages at="" fhcrc.org=""><mailto:__hpages at="" fhcrc.org="">>> <mailto:hpages at="" fhcrc.org="">>>, "bioconductor at r-project.org >>> <mailto:bioconductor at="" r-project.org=""> >>> <mailto:bioconductor at="" r-project.org=""><__mailto:bioconductor at r-project.__org >>> <mailto:bioconductor at="" r-project.org="">>"<bioconductor at="" r-project.__org="">>> <mailto:bioconductor at="" r-project.__org=""> >>> <mailto:bioconductor at="" r-project.org=""><mailto:bioconductor at="" r-__project.org="">>> <mailto:bioconductor at="" r-project.org="">>> >>> Subject: Re: [Hinxton #251937] RE: [BioC] >>> GenomicFeatures::__makeTranscriptDbFromBiomart - BioMart data >>> anomaly: for some transcripts, the cds cumulative length >>> inferred from the exon and UTR info doesn't match the >>> "cds_length" attribute from BioMart >>> >>> Hi Malcolm and Herv? >>> This appears to be a data issue with the Drosophila core >>> database which was then propagated into BioMart. I have >>> forwarded the issue to the Ensembl Genomes project as they >>> maintain this database and they will respond as soon as possible. >>> Regards >>> Rhoda >>> >>> >>> On 7 Feb 2012, at 21:35, Cook, Malcolm wrote: >>> >>> Herve, Thanks so much for digging into this. >>> >>> Rhonda, I had submitted a ticket as suggested to Ensembl >>> helpdesk, and have included them as recipients to this message >>> (after changing the subject to include the issue tracker number). >>> >>> Ensembl helpdesk, I expect that Herve's detailed report, below, >>> provides an example of the reported data anomaly that will help >>> resolve the underlying issue. >>> >>> Cheers, >>> >>> ~Malcolm >>> >>> >>> -----Original Message----- >>> From: Hervé Pagès [mailto:hpages at fhcrc.org >>> <mailto:hpages at="" fhcrc.org="">] >>> Sent: Tuesday, February 07, 2012 2:37 PM >>> To: Rhoda Kinsella; bioconductor at r-project.org >>> <mailto:bioconductor at="" r-project.org=""> >>> <mailto:bioconductor at="" r-project.org=""><__mailto:bioconductor at r-project.__org >>> <mailto:bioconductor at="" r-project.org="">> >>> Cc: Cook, Malcolm >>> Subject: Re: [BioC] GenomicFeatures::__makeTranscriptDbFromBiomart - >>> BioMart data anomaly: for some transcripts, the cds cumulative >>> length >>> inferred from the exon and UTR info doesn't match the "cds_length" >>> attribute from BioMart >>> >>> Hi Rhoda, Malcolm, and others, >>> >>> So after taking a closer look at this, I can confirm that the >>> reported >>> "cds_length" looks wrong for some Fly transcripts. Take for example >>> the FBtr0079414 transcript (minus strand): >>> >>> library(biomaRt) >>> mart1<- useMart(biomart="ensembl", >>> dataset="dmelanogaster_gene___ensembl") >>> attributes<- c("ensembl_transcript_id", "strand", >>> + "rank", "exon_chrom_start", "exon_chrom_end", >>> + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end", >>> + "cds_length") >>> filters<- "ensembl_transcript_id" >>> values<- "FBtr0079414" >>> getBM(attributes=attributes, filters=filters, values=values, >>> mart=mart1) >>> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end >>> 5_utr_start >>> 1 FBtr0079414 -1 1 7218909 7220029 >>> 7219112 >>> 2 FBtr0079414 -1 2 7218643 7218853 >>> NA >>> 5_utr_end 3_utr_start 3_utr_end cds_length >>> 1 7220029 NA NA 204 >>> 2 NA 7218643 7218853 204 >>> >>> 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no >>> CDS on this exon. The start of the 5' UTR (located on exon 1) is 203 >>> bases upstream of the exon start. But the reported cds_length is >>> 204. >>> Something looks wrong. >>> >>> For other transcripts, e.g. FBtr0300689 (plus strand), things >>> look OK: >>> >>> getBM(attributes=attributes, filters=filters, values=values, >>> mart=mart1) >>> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end >>> 5_utr_start >>> 1 FBtr0300689 1 1 7529 8116 >>> 7529 >>> 2 FBtr0300689 1 2 8193 9484 >>> NA >>> 5_utr_end 3_utr_start 3_utr_end cds_length >>> 1 7679 NA NA 855 >>> 2 NA 8611 9484 855 >>> >>> 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases >>> upstream of the exon end. The start of the 3' UTR (located on >>> exon 2) >>> is 418 bases downstream of the exon start. So the CDS total >>> length is >>> 437 + 418 = 855, as reported. >>> >>> @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to >>> commit a patch to this function so that this anomaly in the Ensembl >>> data causes a warning instead of an error. Also the warning will >>> display the first 6 affected transcripts. The patch will make it >>> into >>> GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will >>> become >>> available via biocLite() in the next 24-36 hours. >>> >>> Cheers, >>> H. >>> >>> >>> On 02/06/2012 02:18 PM, Hervé Pagès wrote: >>> Hi Rhoda and others, >>> >>> I still need to check that this error issued by internal helper >>> .__extractCdsRangesFromBiomartTab__le() about "the cds cumulative >>> length inferred from the exon and UTR not matching the cds_length >>> attribute from BioMart" is not a FALSE positive. >>> >>> I'm planning to patch the code in charge of this sanity check >>> so it issues a warning instead of an error and it displays >>> something more useful than just "for some transcripts etc...". >>> Would be nice to know at least for which transcript. >>> >>> I'll keep you informed, thanks! >>> H. >>> >>> >>> On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: >>> Hi Malcolm and Marc, >>> Please submit an Ensembl helpdesk ticket about this issue along >>> with a >>> detailed example to (helpdesk at ensembl.org <mailto:helpdesk at="" ensembl.org=""> >>> <mailto:helpdesk at="" ensembl.org=""><mailto:h__elpdesk at="" ensembl.org="">>> <mailto:helpdesk at="" ensembl.org="">>) and we will look into it. >>> Kind regards >>> Rhoda >>> >>> >>> On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: >>> >>> Hi Marc, and other `library(GenomicFeatures)` users working in fly, >>> >>> I just changed Subject to keep alive one of the issues I still have, >>> namely: >>> >>> I get the following error: >>> >>> library(GenomicFeatures) >>> txdb<-__makeTranscriptDbFromBiomart(__biomart="ensembl", >>> dataset="dmelanogaster_gene___ensembl", circ_seqs=NULL)) >>> Download and preprocess the 'transcripts' data frame ... OK >>> Download and preprocess the 'chrominfo' data frame ... OK >>> Download and preprocess the 'splicings' data frame ... Error >>> in .__extractCdsRangesFromBiomartTab__le(bm_table) : >>> BioMart data anomaly: for some transcripts, the cds cumulative >>> length inferred from the exon and UTR info doesn't match the >>> "cds_length" attribute from BioMart >>> >>> >>> Marc, you already observed that: >>> >>> the data for cds ranges and total cds length (both from biomaRt) no >>> longer agree with each other. In other words, the data from the >>> current >>> drosophila ranges in biomaRt seems to disagree with itself, and >>> so the >>> code is refusing to make a package out of this data as a result. >>> To get the 2nd issue fixed probably involves talking to ensembl >>> about >>> their CDS data for fly to see if we can resolve the discrepancy. >>> I would be happy to take this to them. >>> >>> I still wonder: >>> >>> Can you recommend a best way to get a more diagnostic trace from the >>> attempt at txdb creation so we can correctly report to ensembl team >>> the >>> errant transcript(s) ? >>> >>> I would be happy to take this up with Ensembl team, but, need >>> details which I don't know how to produce. >>> >>> >>> Finally, one the side, here is a tiny suggestion: >>> >>> * change the default for circ_seqs in makeTranscriptDbFromBiomart >>> to be NULL, instead of any organism (human) specific. >>> >>> Regards, >>> >>> --Malcolm >>> >>> >>> R version 2.14.0 (2011-10-31) >>> Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 >>> [4] GenomicRanges_1.6.6 IRanges_1.12.5 >>> >>> loaded via a namespace (and not attached): >>> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 >>> RCurl_1.9-5 >>> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 >>> rtracklayer_1.14.4 >>> [9] tools_2.14.0 zlibbioc_1.0.0 >>> >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> <mailto:bioconductor at="" r-project.org=""><__mailto:Bioconductor at r-project.__org >>> <mailto:bioconductor at="" r-project.org="">> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> Rhoda Kinsella Ph.D. >>> Ensembl Production Project Leader, >>> European Bioinformatics Institute (EMBL-EBI), >>> Wellcome Trust Genome Campus, >>> Hinxton >>> Cambridge CB10 1SD, >>> UK. >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> <mailto:bioconductor at="" r-project.org=""><__mailto:Bioconductor at r-project.__org >>> <mailto:bioconductor at="" r-project.org="">> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >>> <mailto:hpages at="" fhcrc.org=""><mailto:hpages__ at="" fhcrc.org="">>> <mailto:hpages at="" fhcrc.org="">> >>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>> >>> Rhoda Kinsella Ph.D. >>> Ensembl Production Project Leader, >>> European Bioinformatics Institute (EMBL-EBI), >>> Wellcome Trust Genome Campus, >>> Hinxton >>> Cambridge CB10 1SD, >>> UK. >>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > Rhoda Kinsella Ph.D. > Ensembl Production Project Leader, > European Bioinformatics Institute (EMBL-EBI), > Wellcome Trust Genome Campus, > Hinxton > Cambridge CB10 1SD, > UK. > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 4 months ago
United States
Herve, I'm following up on this by bringing you into an exchange with the Ensembl member handling dmel. I hope with your help they can completely address the issue. Thanks, ~Malcolm > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Tuesday, March 13, 2012 3:32 PM > To: Cook, Malcolm > Cc: Rhoda Kinsella; bioconductor at r-project.org > Subject: Re: [Hinxton #251937] RE: [BioC] > GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: > for some transcripts, the cds cumulative length inferred from the exon and > UTR info doesn't match the "cds_length" attribute from BioMart > > Hi Malcolm, Rhoda, > > Did you hear back from the Ensembl helpdesk about this issue? > > AFAICT the issue is still in Ensembl release 66 (released 10 days > ago). For example, when querying directly the Ensembl Mart, I get > the following for transcript FBtr0079414 (dmelanogaster): > > Exon Rank in Transcript | Chromosome Name | Strand > 1 | 2L | -1 > 2 | 2L | -1 > > Exon Chr Start (bp) | Exon Chr End (bp) > 7218909 | 7220029 > 7218643 | 7218853 > > 5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End > 7219112 | 7220029 | | > | | 7218643 | 7218853 > > CDS Start | CDS End | CDS Length > 1 | 203 | 204 > 204 | 204 | 204 > > Note that querying directly the Ensembl Mart thru the web interface > allows me to choose database Ensembl Genes 66 but querying with the > Bioconductor biomaRt package is still accessing Ensembl Genes 65, > I wonder why, but this is a different story... > > So the "CDS Length" column (which, IIUC, is actually supposed to > report the "Total CDS Length") is still incompatible with the > exon/UTR starts and ends. If the exon/UTR starts and ends > are correct then the total CDS length should be 203, not 204. > > But also, it could be that the exon/UTR starts and ends are > incorrect. > > Finally note that there is no CDS region on exon 2 (the 3' UTR > entirely spans exon 2) but the Ensembl Mart reports a CDS region > of length 1 on this exon (CDS Start = CDS End = 204). This is > probably why then the reported CDS Length is 204 (at least it's > consistent with the highest "CDS End" value). > > Would be nice to see this dataset fixed. > > Thanks, > H. > > > On 02/15/2012 06:33 AM, Cook, Malcolm wrote: > > Dear helpdesk at ensemblgenomes.org, > > > > I am following up on this issue which I understand Rhoda Kinsella at EBI to > have forwarded to you. > > > > I originally identified and reported the issue, first to the bioconductor email > list where Rhoda picked up on it and replied as below. > > > > I am trying to ensure that there is a tracked issue with > ensemblgenomes.org with my name on it - not that it has to be resolved > with a fix, just I'd like to be assured I know as you resolve it. > > > > If there is anything further I can provide pertaining to describing or > resolving the issue, please advise. > > > > Of course the issue may be in fact even further upstream - in flybase. I've > not tried to find the root cause myself. > > > > Thanks, > > > > Malcolm Cook > > > > > > From: Rhoda Kinsella<rhoda at="" ebi.ac.uk<mailto:rhoda="" at="" ebi.ac.uk="">> > > Date: Wed, 8 Feb 2012 10:27:02 -0600 > > To: Malcolm Cook<mec at="" stowers.org<mailto:mec="" at="" stowers.org="">> > > Cc: Hervé Pagès<hpages at="" fhcrc.org<mailto:hpages="" at="" fhcrc.org="">>, > "bioconductor at r-project.org<mailto:bioconductor at="" r-=""> project.org>"<bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-=""> project.org>> > > Subject: Re: [Hinxton #251937] RE: [BioC] > GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly: > for some transcripts, the cds cumulative length inferred from the exon and > UTR info doesn't match the "cds_length" attribute from BioMart > > > > Hi Malcolm and Herv? > > This appears to be a data issue with the Drosophila core database which > was then propagated into BioMart. I have forwarded the issue to the > Ensembl Genomes project as they maintain this database and they will > respond as soon as possible. > > Regards > > Rhoda > > > > > > On 7 Feb 2012, at 21:35, Cook, Malcolm wrote: > > > > Herve, Thanks so much for digging into this. > > > > Rhonda, I had submitted a ticket as suggested to Ensembl helpdesk, and > have included them as recipients to this message (after changing the subject > to include the issue tracker number). > > > > Ensembl helpdesk, I expect that Herve's detailed report, below, provides > an example of the reported data anomaly that will help resolve the > underlying issue. > > > > Cheers, > > > > ~Malcolm > > > > > > -----Original Message----- > > From: Hervé Pagès [mailto:hpages at fhcrc.org] > > Sent: Tuesday, February 07, 2012 2:37 PM > > To: Rhoda Kinsella; bioconductor at r-project.org<mailto:bioconductor at="" r-=""> project.org> > > Cc: Cook, Malcolm > > Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart - > > BioMart data anomaly: for some transcripts, the cds cumulative length > > inferred from the exon and UTR info doesn't match the "cds_length" > > attribute from BioMart > > > > Hi Rhoda, Malcolm, and others, > > > > So after taking a closer look at this, I can confirm that the reported > > "cds_length" looks wrong for some Fly transcripts. Take for example > > the FBtr0079414 transcript (minus strand): > > > > library(biomaRt) > > mart1<- useMart(biomart="ensembl", > > dataset="dmelanogaster_gene_ensembl") > > attributes<- c("ensembl_transcript_id", "strand", > > + "rank", "exon_chrom_start", "exon_chrom_end", > > + "5_utr_start", "5_utr_end", "3_utr_start", "3_utr_end", > > + "cds_length") > > filters<- "ensembl_transcript_id" > > values<- "FBtr0079414" > > getBM(attributes=attributes, filters=filters, values=values, mart=mart1) > > ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end > > 5_utr_start > > 1 FBtr0079414 -1 1 7218909 7220029 > > 7219112 > > 2 FBtr0079414 -1 2 7218643 7218853 > > NA > > 5_utr_end 3_utr_start 3_utr_end cds_length > > 1 7220029 NA NA 204 > > 2 NA 7218643 7218853 204 > > > > 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no > > CDS on this exon. The start of the 5' UTR (located on exon 1) is 203 > > bases upstream of the exon start. But the reported cds_length is 204. > > Something looks wrong. > > > > For other transcripts, e.g. FBtr0300689 (plus strand), things look OK: > > > > getBM(attributes=attributes, filters=filters, values=values, mart=mart1) > > ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end > > 5_utr_start > > 1 FBtr0300689 1 1 7529 8116 > > 7529 > > 2 FBtr0300689 1 2 8193 9484 > > NA > > 5_utr_end 3_utr_start 3_utr_end cds_length > > 1 7679 NA NA 855 > > 2 NA 8611 9484 855 > > > > 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases > > upstream of the exon end. The start of the 3' UTR (located on exon 2) > > is 418 bases downstream of the exon start. So the CDS total length is > > 437 + 418 = 855, as reported. > > > > @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to > > commit a patch to this function so that this anomaly in the Ensembl > > data causes a warning instead of an error. Also the warning will > > display the first 6 affected transcripts. The patch will make it into > > GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will become > > available via biocLite() in the next 24-36 hours. > > > > Cheers, > > H. > > > > > > On 02/06/2012 02:18 PM, Hervé Pagès wrote: > > Hi Rhoda and others, > > > > I still need to check that this error issued by internal helper > > .extractCdsRangesFromBiomartTable() about "the cds cumulative > > length inferred from the exon and UTR not matching the cds_length > > attribute from BioMart" is not a FALSE positive. > > > > I'm planning to patch the code in charge of this sanity check > > so it issues a warning instead of an error and it displays > > something more useful than just "for some transcripts etc...". > > Would be nice to know at least for which transcript. > > > > I'll keep you informed, thanks! > > H. > > > > > > On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: > > Hi Malcolm and Marc, > > Please submit an Ensembl helpdesk ticket about this issue along with a > > detailed example to > (helpdesk at ensembl.org<mailto:helpdesk at="" ensembl.org="">) and we will look > into it. > > Kind regards > > Rhoda > > > > > > On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: > > > > Hi Marc, and other `library(GenomicFeatures)` users working in fly, > > > > I just changed Subject to keep alive one of the issues I still have, > > namely: > > > > I get the following error: > > > > library(GenomicFeatures) > > txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", > > dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) > > Download and preprocess the 'transcripts' data frame ... OK > > Download and preprocess the 'chrominfo' data frame ... OK > > Download and preprocess the 'splicings' data frame ... Error > > in .extractCdsRangesFromBiomartTable(bm_table) : > > BioMart data anomaly: for some transcripts, the cds cumulative > > length inferred from the exon and UTR info doesn't match the > > "cds_length" attribute from BioMart > > > > > > Marc, you already observed that: > > > > the data for cds ranges and total cds length (both from biomaRt) no > > longer agree with each other. In other words, the data from the > > current > > drosophila ranges in biomaRt seems to disagree with itself, and > > so the > > code is refusing to make a package out of this data as a result. > > To get the 2nd issue fixed probably involves talking to ensembl > > about > > their CDS data for fly to see if we can resolve the discrepancy. > > I would be happy to take this to them. > > > > I still wonder: > > > > Can you recommend a best way to get a more diagnostic trace from the > > attempt at txdb creation so we can correctly report to ensembl team > > the > > errant transcript(s) ? > > > > I would be happy to take this up with Ensembl team, but, need > > details which I don't know how to produce. > > > > > > Finally, one the side, here is a tiny suggestion: > > > > * change the default for circ_seqs in makeTranscriptDbFromBiomart > > to be NULL, instead of any organism (human) specific. > > > > Regards, > > > > --Malcolm > > > > > > R version 2.14.0 (2011-10-31) > > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > > > locale: > > [1] C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 > > [4] GenomicRanges_1.6.6 IRanges_1.12.5 > > > > loaded via a namespace (and not attached): > > [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 > > RCurl_1.9-5 > > [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 > > rtracklayer_1.14.4 > > [9] tools_2.14.0 zlibbioc_1.0.0 > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Rhoda Kinsella Ph.D. > > Ensembl Production Project Leader, > > European Bioinformatics Institute (EMBL-EBI), > > Wellcome Trust Genome Campus, > > Hinxton > > Cambridge CB10 1SD, > > UK. > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org<mailto:hpages at="" fhcrc.org=""> > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > Rhoda Kinsella Ph.D. > > Ensembl Production Project Leader, > > European Bioinformatics Institute (EMBL-EBI), > > Wellcome Trust Genome Campus, > > Hinxton > > Cambridge CB10 1SD, > > UK. > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
@rhoda-kinsella-3200
Last seen 10.2 years ago
Hi Hervé and Malcolm, I have contacted the Ensembl genomes team who produce this database and have asked them to respond to you with an update. Regards Rhoda On 13 Mar 2012, at 20:31, Hervé Pagès wrote: > Hi Malcolm, Rhoda, > > Did you hear back from the Ensembl helpdesk about this issue? > > AFAICT the issue is still in Ensembl release 66 (released 10 days > ago). For example, when querying directly the Ensembl Mart, I get > the following for transcript FBtr0079414 (dmelanogaster): > > Exon Rank in Transcript | Chromosome Name | Strand > 1 | 2L | -1 > 2 | 2L | -1 > > Exon Chr Start (bp) | Exon Chr End (bp) > 7218909 | 7220029 > 7218643 | 7218853 > > 5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End > 7219112 | 7220029 | | > | | 7218643 | 7218853 > > CDS Start | CDS End | CDS Length > 1 | 203 | 204 > 204 | 204 | 204 > > Note that querying directly the Ensembl Mart thru the web interface > allows me to choose database Ensembl Genes 66 but querying with the > Bioconductor biomaRt package is still accessing Ensembl Genes 65, > I wonder why, but this is a different story... > > So the "CDS Length" column (which, IIUC, is actually supposed to > report the "Total CDS Length") is still incompatible with the > exon/UTR starts and ends. If the exon/UTR starts and ends > are correct then the total CDS length should be 203, not 204. > > But also, it could be that the exon/UTR starts and ends are > incorrect. > > Finally note that there is no CDS region on exon 2 (the 3' UTR > entirely spans exon 2) but the Ensembl Mart reports a CDS region > of length 1 on this exon (CDS Start = CDS End = 204). This is > probably why then the reported CDS Length is 204 (at least it's > consistent with the highest "CDS End" value). > > Would be nice to see this dataset fixed. > > Thanks, > H. > > > On 02/15/2012 06:33 AM, Cook, Malcolm wrote: >> Dear helpdesk@ensemblgenomes.org, >> >> I am following up on this issue which I understand Rhoda Kinsella >> at EBI to have forwarded to you. >> >> I originally identified and reported the issue, first to the >> bioconductor email list where Rhoda picked up on it and replied as >> below. >> >> I am trying to ensure that there is a tracked issue with >> ensemblgenomes.org with my name on it – not that it has to be >> resolved with a fix, just I'd like to be assured I know as you >> resolve it. >> >> If there is anything further I can provide pertaining to describing >> or resolving the issue, please advise. >> >> Of course the issue may be in fact even further upstream – in >> flybase. I've not tried to find the root cause myself. >> >> Thanks, >> >> Malcolm Cook >> >> >> From: Rhoda Kinsella<rhoda@ebi.ac.uk<mailto:rhoda@ebi.ac.uk>> >> Date: Wed, 8 Feb 2012 10:27:02 -0600 >> To: Malcolm Cook<mec@stowers.org<mailto:mec@stowers.org>> >> Cc: Hervé Pagès<hpages@fhcrc.org<mailto:hpages@fhcrc.org>>, "bioconductor@r-project.org >> <mailto:bioconductor@r-project.org>"<bioconductor@r-project.org<mai lto:bioconductor@r-project.org="">> >> >> Subject: Re: [Hinxton #251937] RE: [BioC] >> GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data >> anomaly: for some transcripts, the cds cumulative length inferred >> from the exon and UTR info doesn't match the "cds_length" attribute >> from BioMart >> >> Hi Malcolm and Hervé >> This appears to be a data issue with the Drosophila core database >> which was then propagated into BioMart. I have forwarded the issue >> to the Ensembl Genomes project as they maintain this database and >> they will respond as soon as possible. >> Regards >> Rhoda >> >> >> On 7 Feb 2012, at 21:35, Cook, Malcolm wrote: >> >> Herve, Thanks so much for digging into this. >> >> Rhonda, I had submitted a ticket as suggested to Ensembl helpdesk, >> and have included them as recipients to this message (after >> changing the subject to include the issue tracker number). >> >> Ensembl helpdesk, I expect that Herve's detailed report, below, >> provides an example of the reported data anomaly that will help >> resolve the underlying issue. >> >> Cheers, >> >> ~Malcolm >> >> >> -----Original Message----- >> From: Hervé Pagès [mailto:hpages@fhcrc.org] >> Sent: Tuesday, February 07, 2012 2:37 PM >> To: Rhoda Kinsella; bioconductor@r-project.org<mailto:bioconductor@r-project.org>> > >> Cc: Cook, Malcolm >> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart - >> BioMart data anomaly: for some transcripts, the cds cumulative length >> inferred from the exon and UTR info doesn't match the "cds_length" >> attribute from BioMart >> >> Hi Rhoda, Malcolm, and others, >> >> So after taking a closer look at this, I can confirm that the >> reported >> "cds_length" looks wrong for some Fly transcripts. Take for example >> the FBtr0079414 transcript (minus strand): >> >> library(biomaRt) >> mart1<- useMart(biomart="ensembl", >> dataset="dmelanogaster_gene_ensembl") >> attributes<- c("ensembl_transcript_id", "strand", >> + "rank", "exon_chrom_start", "exon_chrom_end", >> + "5_utr_start", "5_utr_end", "3_utr_start", >> "3_utr_end", >> + "cds_length") >> filters<- "ensembl_transcript_id" >> values<- "FBtr0079414" >> getBM(attributes=attributes, filters=filters, values=values, >> mart=mart1) >> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end >> 5_utr_start >> 1 FBtr0079414 -1 1 7218909 7220029 >> 7219112 >> 2 FBtr0079414 -1 2 7218643 7218853 >> NA >> 5_utr_end 3_utr_start 3_utr_end cds_length >> 1 7220029 NA NA 204 >> 2 NA 7218643 7218853 204 >> >> 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no >> CDS on this exon. The start of the 5' UTR (located on exon 1) is 203 >> bases upstream of the exon start. But the reported cds_length is 204. >> Something looks wrong. >> >> For other transcripts, e.g. FBtr0300689 (plus strand), things look >> OK: >> >> getBM(attributes=attributes, filters=filters, values=values, >> mart=mart1) >> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end >> 5_utr_start >> 1 FBtr0300689 1 1 7529 8116 >> 7529 >> 2 FBtr0300689 1 2 8193 9484 >> NA >> 5_utr_end 3_utr_start 3_utr_end cds_length >> 1 7679 NA NA 855 >> 2 NA 8611 9484 855 >> >> 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases >> upstream of the exon end. The start of the 3' UTR (located on exon 2) >> is 418 bases downstream of the exon start. So the CDS total length is >> 437 + 418 = 855, as reported. >> >> @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to >> commit a patch to this function so that this anomaly in the Ensembl >> data causes a warning instead of an error. Also the warning will >> display the first 6 affected transcripts. The patch will make it into >> GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will become >> available via biocLite() in the next 24-36 hours. >> >> Cheers, >> H. >> >> >> On 02/06/2012 02:18 PM, Hervé Pagès wrote: >> Hi Rhoda and others, >> >> I still need to check that this error issued by internal helper >> .extractCdsRangesFromBiomartTable() about "the cds cumulative >> length inferred from the exon and UTR not matching the cds_length >> attribute from BioMart" is not a FALSE positive. >> >> I'm planning to patch the code in charge of this sanity check >> so it issues a warning instead of an error and it displays >> something more useful than just "for some transcripts etc...". >> Would be nice to know at least for which transcript. >> >> I'll keep you informed, thanks! >> H. >> >> >> On 02/06/2012 12:53 AM, Rhoda Kinsella wrote: >> Hi Malcolm and Marc, >> Please submit an Ensembl helpdesk ticket about this issue along >> with a >> detailed example to >> (helpdesk@ensembl.org<mailto:helpdesk@ensembl.org>) and we will >> look into it. >> Kind regards >> Rhoda >> >> >> On 3 Feb 2012, at 20:32, Cook, Malcolm wrote: >> >> Hi Marc, and other `library(GenomicFeatures)` users working in fly, >> >> I just changed Subject to keep alive one of the issues I still have, >> namely: >> >> I get the following error: >> >> library(GenomicFeatures) >> txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", >> dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) >> Download and preprocess the 'transcripts' data frame ... OK >> Download and preprocess the 'chrominfo' data frame ... OK >> Download and preprocess the 'splicings' data frame ... Error >> in .extractCdsRangesFromBiomartTable(bm_table) : >> BioMart data anomaly: for some transcripts, the cds cumulative >> length inferred from the exon and UTR info doesn't match the >> "cds_length" attribute from BioMart >> >> >> Marc, you already observed that: >> >> the data for cds ranges and total cds length (both from biomaRt) no >> longer agree with each other. In other words, the data from the >> current >> drosophila ranges in biomaRt seems to disagree with itself, and >> so the >> code is refusing to make a package out of this data as a result. >> To get the 2nd issue fixed probably involves talking to ensembl >> about >> their CDS data for fly to see if we can resolve the discrepancy. >> I would be happy to take this to them. >> >> I still wonder: >> >> Can you recommend a best way to get a more diagnostic trace from the >> attempt at txdb creation so we can correctly report to ensembl team >> the >> errant transcript(s) ? >> >> I would be happy to take this up with Ensembl team, but, need >> details which I don't know how to produce. >> >> >> Finally, one the side, here is a tiny suggestion: >> >> * change the default for circ_seqs in makeTranscriptDbFromBiomart >> to be NULL, instead of any organism (human) specific. >> >> Regards, >> >> --Malcolm >> >> >> R version 2.14.0 (2011-10-31) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0 >> [4] GenomicRanges_1.6.6 IRanges_1.12.5 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5 >> RCurl_1.9-5 >> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0 >> rtracklayer_1.14.4 >> [9] tools_2.14.0 zlibbioc_1.0.0 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> Rhoda Kinsella Ph.D. >> Ensembl Production Project Leader, >> European Bioinformatics Institute (EMBL-EBI), >> Wellcome Trust Genome Campus, >> Hinxton >> Cambridge CB10 1SD, >> UK. >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org<mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> Rhoda Kinsella Ph.D. >> Ensembl Production Project Leader, >> European Bioinformatics Institute (EMBL-EBI), >> Wellcome Trust Genome Campus, >> Hinxton >> Cambridge CB10 1SD, >> UK. >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 Rhoda Kinsella Ph.D. Ensembl Production Project Leader, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK. [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6