Entering edit mode
Herve, Thanks so much for digging into this.
Rhonda, I had submitted a ticket as suggested to Ensembl helpdesk, and
have included them as recipients to this message (after changing the
subject to include the issue tracker number).
Ensembl helpdesk, I expect that Herve's detailed report, below,
provides an example of the reported data anomaly that will help
resolve the underlying issue.
Cheers,
~Malcolm
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Tuesday, February 07, 2012 2:37 PM
> To: Rhoda Kinsella; bioconductor at r-project.org
> Cc: Cook, Malcolm
> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart -
> BioMart data anomaly: for some transcripts, the cds cumulative
length
> inferred from the exon and UTR info doesn't match the "cds_length"
> attribute from BioMart
>
> Hi Rhoda, Malcolm, and others,
>
> So after taking a closer look at this, I can confirm that the
reported
> "cds_length" looks wrong for some Fly transcripts. Take for example
> the FBtr0079414 transcript (minus strand):
>
> > library(biomaRt)
> > mart1 <- useMart(biomart="ensembl",
> dataset="dmelanogaster_gene_ensembl")
> > attributes <- c("ensembl_transcript_id", "strand",
> + "rank", "exon_chrom_start", "exon_chrom_end",
> + "5_utr_start", "5_utr_end", "3_utr_start",
"3_utr_end",
> + "cds_length")
> > filters <- "ensembl_transcript_id"
> > values <- "FBtr0079414"
> > getBM(attributes=attributes, filters=filters, values=values,
mart=mart1)
> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
> 5_utr_start
> 1 FBtr0079414 -1 1 7218909 7220029
> 7219112
> 2 FBtr0079414 -1 2 7218643 7218853
> NA
> 5_utr_end 3_utr_start 3_utr_end cds_length
> 1 7220029 NA NA 204
> 2 NA 7218643 7218853 204
>
> 2 exons: The 3' UTR (located on exon 2) spans the entire exon so no
> CDS on this exon. The start of the 5' UTR (located on exon 1) is 203
> bases upstream of the exon start. But the reported cds_length is
204.
> Something looks wrong.
>
> For other transcripts, e.g. FBtr0300689 (plus strand), things look
OK:
>
> > getBM(attributes=attributes, filters=filters, values=values,
mart=mart1)
> ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
> 5_utr_start
> 1 FBtr0300689 1 1 7529 8116
> 7529
> 2 FBtr0300689 1 2 8193 9484
> NA
> 5_utr_end 3_utr_start 3_utr_end cds_length
> 1 7679 NA NA 855
> 2 NA 8611 9484 855
>
> 2 exons: The end of the 5' UTR (located on exon 1) is 437 bases
> upstream of the exon end. The start of the 3' UTR (located on exon
2)
> is 418 bases downstream of the exon start. So the CDS total length
is
> 437 + 418 = 855, as reported.
>
> @Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to
> commit a patch to this function so that this anomaly in the Ensembl
> data causes a warning instead of an error. Also the warning will
> display the first 6 affected transcripts. The patch will make it
into
> GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will
become
> available via biocLite() in the next 24-36 hours.
>
> Cheers,
> H.
>
>
> On 02/06/2012 02:18 PM, Hervé Pagès wrote:
> > Hi Rhoda and others,
> >
> > I still need to check that this error issued by internal helper
> > .extractCdsRangesFromBiomartTable() about "the cds cumulative
> > length inferred from the exon and UTR not matching the cds_length
> > attribute from BioMart" is not a FALSE positive.
> >
> > I'm planning to patch the code in charge of this sanity check
> > so it issues a warning instead of an error and it displays
> > something more useful than just "for some transcripts etc...".
> > Would be nice to know at least for which transcript.
> >
> > I'll keep you informed, thanks!
> > H.
> >
> >
> > On 02/06/2012 12:53 AM, Rhoda Kinsella wrote:
> >> Hi Malcolm and Marc,
> >> Please submit an Ensembl helpdesk ticket about this issue along
with a
> >> detailed example to (helpdesk at ensembl.org) and we will look
into it.
> >> Kind regards
> >> Rhoda
> >>
> >>
> >> On 3 Feb 2012, at 20:32, Cook, Malcolm wrote:
> >>
> >>> Hi Marc, and other `library(GenomicFeatures)` users working in
fly,
> >>>
> >>> I just changed Subject to keep alive one of the issues I still
have,
> >>> namely:
> >>>
> >>> I get the following error:
> >>>
> >>>> library(GenomicFeatures)
> >>>> txdb<-makeTranscriptDbFromBiomart(biomart="ensembl",
> >>>> dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL))
> >>> Download and preprocess the 'transcripts' data frame ... OK
> >>> Download and preprocess the 'chrominfo' data frame ... OK
> >>> Download and preprocess the 'splicings' data frame ... Error
> >>> in .extractCdsRangesFromBiomartTable(bm_table) :
> >>> BioMart data anomaly: for some transcripts, the cds cumulative
> >>> length inferred from the exon and UTR info doesn't match the
> >>> "cds_length" attribute from BioMart
> >>>
> >>>
> >>> Marc, you already observed that:
> >>>
> >>>>>> the data for cds ranges and total cds length (both from
biomaRt) no
> >>>>>> longer agree with each other. In other words, the data from
the
> >>>>>> current
> >>>>>> drosophila ranges in biomaRt seems to disagree with itself,
and
> >>>>>> so the
> >>>>>> code is refusing to make a package out of this data as a
result.
> >>>>>> To get the 2nd issue fixed probably involves talking to
ensembl
> >>>>>> about
> >>>>>> their CDS data for fly to see if we can resolve the
discrepancy.
> >>>>> I would be happy to take this to them.
> >>>
> >>> I still wonder:
> >>>
> >>>> Can you recommend a best way to get a more diagnostic trace
from the
> >>>> attempt at txdb creation so we can correctly report to ensembl
team
> >>>> the
> >>>> errant transcript(s) ?
> >>>
> >>> I would be happy to take this up with Ensembl team, but, need
> >>> details which I don't know how to produce.
> >>>
> >>>
> >>> Finally, one the side, here is a tiny suggestion:
> >>>
> >>> * change the default for circ_seqs in
makeTranscriptDbFromBiomart
> >>> to be NULL, instead of any organism (human) specific.
> >>>
> >>> Regards,
> >>>
> >>> --Malcolm
> >>>
> >>>
> >>> R version 2.14.0 (2011-10-31)
> >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> >>>
> >>> locale:
> >>> [1] C
> >>>
> >>> attached base packages:
> >>> [1] stats graphics grDevices utils datasets methods base
> >>>
> >>> other attached packages:
> >>> [1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0
> >>> [4] GenomicRanges_1.6.6 IRanges_1.12.5
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5
> >>> RCurl_1.9-5
> >>> [5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0
> >>> rtracklayer_1.14.4
> >>> [9] tools_2.14.0 zlibbioc_1.0.0
> >>>>
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >> Rhoda Kinsella Ph.D.
> >> Ensembl Production Project Leader,
> >> European Bioinformatics Institute (EMBL-EBI),
> >> Wellcome Trust Genome Campus,
> >> Hinxton
> >> Cambridge CB10 1SD,
> >> UK.
> >>
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319