Entering edit mode
Dear Ensembl (genomes?) Help
Can I please to get an update on where this issue stands?
The last update I heard was from Karyn Megy, who wrote:
I should be able to correct it for the next EnsemblGenome release
(EG13, planned for mid/end-March), otherwise it'll be for EG14 (in
May).
Can I get a confirmation that Ensembl does indeed have the plan to
address these data problems and whether the May timeframe is likely?
Thanks! If there is anything I can do to assist further please
advise.....
~Malcolm
From: Rhoda Kinsella [mailto:rhoda@ebi.ac.uk]
Sent: Monday, March 19, 2012 9:17 AM
To: Hervé Pagès
Cc: Cook, Malcolm; bioconductor@r-project.org
Subject: Re: [Hinxton #251937] RE: [BioC]
GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly:
for some transcripts, the cds cumulative length inferred from the exon
and UTR info doesn't match the "cds_length" attribute from BioMart
Hi Hervé and Malcolm,
I have contacted the Ensembl genomes team who produce this database
and have asked them to respond to you with an update.
Regards
Rhoda
On 13 Mar 2012, at 20:31, Hervé Pagès wrote:
Hi Malcolm, Rhoda,
Did you hear back from the Ensembl helpdesk about this issue?
AFAICT the issue is still in Ensembl release 66 (released 10 days
ago). For example, when querying directly the Ensembl Mart, I get
the following for transcript FBtr0079414 (dmelanogaster):
Exon Rank in Transcript | Chromosome Name | Strand
1 | 2L | -1
2 | 2L | -1
Exon Chr Start (bp) | Exon Chr End (bp)
7218909 | 7220029
7218643 | 7218853
5' UTR Start | 5' UTR End | 3' UTR Start | 3' UTR End
7219112 | 7220029 | |
| | 7218643 | 7218853
CDS Start | CDS End | CDS Length
1 | 203 | 204
204 | 204 | 204
Note that querying directly the Ensembl Mart thru the web interface
allows me to choose database Ensembl Genes 66 but querying with the
Bioconductor biomaRt package is still accessing Ensembl Genes 65,
I wonder why, but this is a different story...
So the "CDS Length" column (which, IIUC, is actually supposed to
report the "Total CDS Length") is still incompatible with the
exon/UTR starts and ends. If the exon/UTR starts and ends
are correct then the total CDS length should be 203, not 204.
But also, it could be that the exon/UTR starts and ends are
incorrect.
Finally note that there is no CDS region on exon 2 (the 3' UTR
entirely spans exon 2) but the Ensembl Mart reports a CDS region
of length 1 on this exon (CDS Start = CDS End = 204). This is
probably why then the reported CDS Length is 204 (at least it's
consistent with the highest "CDS End" value).
Would be nice to see this dataset fixed.
Thanks,
H.
On 02/15/2012 06:33 AM, Cook, Malcolm wrote:
Dear helpdesk@ensemblgenomes.org<mailto:helpdesk@ensemblgenomes.org>,
I am following up on this issue which I understand Rhoda Kinsella at
EBI to have forwarded to you.
I originally identified and reported the issue, first to the
bioconductor email list where Rhoda picked up on it and replied as
below.
I am trying to ensure that there is a tracked issue with
ensemblgenomes.org with my name on it - not that it has to be resolved
with a fix, just I'd like to be assured I know as you resolve it.
If there is anything further I can provide pertaining to describing or
resolving the issue, please advise.
Of course the issue may be in fact even further upstream - in flybase.
I've not tried to find the root cause myself.
Thanks,
Malcolm Cook
From: Rhoda Kinsella<rhoda@ebi.ac.uk<mailto:rhoda@ebi.ac.uk>>
Date: Wed, 8 Feb 2012 10:27:02 -0600
To: Malcolm Cook<mec@stowers.org<mailto:mec@stowers.org>>
Cc: Hervé Pagès<hpages@fhcrc.org<mailto:hpages@fhcrc.org>>, "bioconduc
tor@r-project.org<mailto:bioconductor@r-project.org>"<bioconductor@r-p roject.org<mailto:bioconductor@r-project.org="">>
Subject: Re: [Hinxton #251937] RE: [BioC]
GenomicFeatures::makeTranscriptDbFromBiomart - BioMart data anomaly:
for some transcripts, the cds cumulative length inferred from the exon
and UTR info doesn't match the "cds_length" attribute from BioMart
Hi Malcolm and Hervé
This appears to be a data issue with the Drosophila core database
which was then propagated into BioMart. I have forwarded the issue to
the Ensembl Genomes project as they maintain this database and they
will respond as soon as possible.
Regards
Rhoda
On 7 Feb 2012, at 21:35, Cook, Malcolm wrote:
Herve, Thanks so much for digging into this.
Rhonda, I had submitted a ticket as suggested to Ensembl helpdesk, and
have included them as recipients to this message (after changing the
subject to include the issue tracker number).
Ensembl helpdesk, I expect that Herve's detailed report, below,
provides an example of the reported data anomaly that will help
resolve the underlying issue.
Cheers,
~Malcolm
-----Original Message-----
From: Hervé Pagès [mailto:hpages@fhcrc.org]
Sent: Tuesday, February 07, 2012 2:37 PM
To: Rhoda Kinsella;
bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Cc: Cook, Malcolm
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromBiomart -
BioMart data anomaly: for some transcripts, the cds cumulative length
inferred from the exon and UTR info doesn't match the "cds_length"
attribute from BioMart
Hi Rhoda, Malcolm, and others,
So after taking a closer look at this, I can confirm that the reported
"cds_length" looks wrong for some Fly transcripts. Take for example
the FBtr0079414 transcript (minus strand):
library(biomaRt)
mart1<- useMart(biomart="ensembl",
dataset="dmelanogaster_gene_ensembl")
attributes<- c("ensembl_transcript_id", "strand",
+ "rank", "exon_chrom_start", "exon_chrom_end",
+ "5_utr_start", "5_utr_end", "3_utr_start",
"3_utr_end",
+ "cds_length")
filters<- "ensembl_transcript_id"
values<- "FBtr0079414"
getBM(attributes=attributes, filters=filters, values=values,
mart=mart1)
ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
5_utr_start
1 FBtr0079414 -1 1 7218909 7220029
7219112
2 FBtr0079414 -1 2 7218643 7218853
NA
5_utr_end 3_utr_start 3_utr_end cds_length
1 7220029 NA NA 204
2 NA 7218643 7218853 204
2 exons: The 3' UTR (located on exon 2) spans the entire exon so no
CDS on this exon. The start of the 5' UTR (located on exon 1) is 203
bases upstream of the exon start. But the reported cds_length is 204.
Something looks wrong.
For other transcripts, e.g. FBtr0300689 (plus strand), things look OK:
getBM(attributes=attributes, filters=filters, values=values,
mart=mart1)
ensembl_transcript_id strand rank exon_chrom_start exon_chrom_end
5_utr_start
1 FBtr0300689 1 1 7529 8116
7529
2 FBtr0300689 1 2 8193 9484
NA
5_utr_end 3_utr_start 3_utr_end cds_length
1 7679 NA NA 855
2 NA 8611 9484 855
2 exons: The end of the 5' UTR (located on exon 1) is 437 bases
upstream of the exon end. The start of the 3' UTR (located on exon 2)
is 418 bases downstream of the exon start. So the CDS total length is
437 + 418 = 855, as reported.
@Malcolm and other makeTranscriptDbFromBiomart() users: I'm about to
commit a patch to this function so that this anomaly in the Ensembl
data causes a warning instead of an error. Also the warning will
display the first 6 affected transcripts. The patch will make it into
GenomicFeatures 1.6.8 (release) and 1.7.14 (devel), which will become
available via biocLite() in the next 24-36 hours.
Cheers,
H.
On 02/06/2012 02:18 PM, Hervé Pagès wrote:
Hi Rhoda and others,
I still need to check that this error issued by internal helper
.extractCdsRangesFromBiomartTable() about "the cds cumulative
length inferred from the exon and UTR not matching the cds_length
attribute from BioMart" is not a FALSE positive.
I'm planning to patch the code in charge of this sanity check
so it issues a warning instead of an error and it displays
something more useful than just "for some transcripts etc...".
Would be nice to know at least for which transcript.
I'll keep you informed, thanks!
H.
On 02/06/2012 12:53 AM, Rhoda Kinsella wrote:
Hi Malcolm and Marc,
Please submit an Ensembl helpdesk ticket about this issue along with a
detailed example to
(helpdesk@ensembl.org<mailto:helpdesk@ensembl.org>) and we will look
into it.
Kind regards
Rhoda
On 3 Feb 2012, at 20:32, Cook, Malcolm wrote:
Hi Marc, and other `library(GenomicFeatures)` users working in fly,
I just changed Subject to keep alive one of the issues I still have,
namely:
I get the following error:
library(GenomicFeatures)
txdb<-makeTranscriptDbFromBiomart(biomart="ensembl",
dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL))
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... Error
in .extractCdsRangesFromBiomartTable(bm_table) :
BioMart data anomaly: for some transcripts, the cds cumulative
length inferred from the exon and UTR info doesn't match the
"cds_length" attribute from BioMart
Marc, you already observed that:
the data for cds ranges and total cds length (both from biomaRt) no
longer agree with each other. In other words, the data from the
current
drosophila ranges in biomaRt seems to disagree with itself, and
so the
code is refusing to make a package out of this data as a result.
To get the 2nd issue fixed probably involves talking to ensembl
about
their CDS data for fly to see if we can resolve the discrepancy.
I would be happy to take this to them.
I still wonder:
Can you recommend a best way to get a more diagnostic trace from the
attempt at txdb creation so we can correctly report to ensembl team
the
errant transcript(s) ?
I would be happy to take this up with Ensembl team, but, need
details which I don't know how to produce.
Finally, one the side, here is a tiny suggestion:
* change the default for circ_seqs in makeTranscriptDbFromBiomart
to be NULL, instead of any organism (human) specific.
Regards,
--Malcolm
R version 2.14.0 (2011-10-31)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.6.7 AnnotationDbi_1.16.11 Biobase_2.14.0
[4] GenomicRanges_1.6.6 IRanges_1.12.5
loaded via a namespace (and not attached):
[1] BSgenome_1.22.0 Biostrings_2.22.0 DBI_0.2-5
RCurl_1.9-5
[5] RSQLite_0.11.1 XML_3.9-4 biomaRt_2.10.0
rtracklayer_1.14.4
[9] tools_2.14.0 zlibbioc_1.0.0
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Rhoda Kinsella Ph.D.
Ensembl Production Project Leader,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages@fhcrc.org<mailto:hpages@fhcrc.org>
Phone: (206) 667-5791
Fax: (206) 667-1319
Rhoda Kinsella Ph.D.
Ensembl Production Project Leader,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages@fhcrc.org<mailto:hpages@fhcrc.org>
Phone: (206) 667-5791
Fax: (206) 667-1319
Rhoda Kinsella Ph.D.
Ensembl Production Project Leader,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.
[[alternative HTML version deleted]]