Question: makeTranscriptDbFromBiomart failure from Data Anomaly
0
gravatar for Chris Seidel
4.1 years ago by
Chris Seidel50
United States
Chris Seidel50 wrote:

I get an error while trying to create a TranscriptDb object using the makeTranscriptDbFromBiomart() function from the GenomicFeatures library. The error message is below (truncated to show just the first transcript). The last time I ran this code was with Ensembl74, whereas today it is failing with Ensembl78. Any tips? Is there anything I can do, or it simply a problem with data at biomaRt?

 

> library(GenomicFeatures)
> txdb <- makeTranscriptDbFromBiomart(host="ensembl.org"
+               ,biomart ="ENSEMBL_MART_ENSEMBL"
+               ,dataset = "dmelanogaster_gene_ensembl")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'splicings' data frame ... Error in .stopWithBioMartDataAnomalyReport(bm_table, idx, id_prefix, msg) :
  BioMart data anomaly: in the following transcripts,
  located on the minus strand, the 3' UTRs don't start
  where their corresponding exon starts.
  (Showing only the first 6 out of 15131 transcripts.)
  1. Transcript FBtr0082757:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start
     1     -1    1         13415139       13415286  FBtr0082757-E1    13415139
     2     -1    2         13410791       13411615  FBtr0082757-E2    13411603
     3     -1    3         13409543       13410413  FBtr0082757-E3          NA
       5_utr_end 3_utr_start 3_utr_end cds_length
     1  13415286          NA        NA       1593
     2  13411615          NA        NA       1593
     3        NA    13409543  13409632       1593

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] GenomicFeatures_1.16.3 AnnotationDbi_1.26.1   Biobase_2.24.0
[4] GenomicRanges_1.16.4   GenomeInfoDb_1.0.2     IRanges_1.22.10
[7] BiocGenerics_0.10.0

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8
 [4] BiocParallel_0.6.1      biomaRt_2.20.0          Biostrings_2.32.1
 [7] bitops_1.0-6            brew_1.0-6              BSgenome_1.32.0
[10] checkmate_1.5.1         codetools_0.2-11        DBI_0.3.1
[13] digest_0.6.6            fail_1.2                foreach_1.4.2
[16] GenomicAlignments_1.0.6 iterators_1.0.7         Rcpp_0.11.5
[19] RCurl_1.95-4.3          Rsamtools_1.16.1        RSQLite_1.0.0
[22] rtracklayer_1.24.2      sendmailR_1.2-1         stats4_3.1.0
[25] stringr_0.6.2           tools_3.1.0             XML_3.98-1.1
[28] XVector_0.4.0           zlibbioc_1.10.0

 

biomart genomicfeatures • 1.1k views
ADD COMMENTlink modified 4.1 years ago by Hervé Pagès ♦♦ 13k • written 4.1 years ago by Chris Seidel50
Answer: makeTranscriptDbFromBiomart failure from Data Anomaly
2
gravatar for Hervé Pagès
4.1 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Chris,

--------------------------------------------------------------------------------------------------------

Edit (03/27/2015): I was indeed able to reproduce the issue by specifying host="ensembl.org" and biomart="ENSEMBL_MART_ENSEMBL" like in the original post. So please disregard my answer below and sorry for the confusion. A patch was applied to GenomicFeatures (1.18.5 in BioC release, and 1.19.35 in BioC devel) to address the issue. The new behavior is that makeTranscriptDbFromBiomart() now drops these problematic transcripts with a warning instead of failing (note that makeTranscriptDbFromBiomart() was renamed makeTxDbFromBiomart() in BioC devel).

--------------------------------------------------------------------------------------------------------

You seem to be using BioC 2.14, which is not the current version (current is 3.0). With BioC 3.0, everything looks OK to me:

library(GenomicFeatures)
txdb <- makeTranscriptDbFromBiomart(dataset="dmelanogaster_gene_ensembl")
txdb
# TxDb object:
# | Db type: TxDb
# | Supporting package: GenomicFeatures
# | Data source: BioMart
# | Organism: Drosophila melanogaster
# | Resource URL: www.biomart.org:80
# | BioMart database: ensembl
# | BioMart database version: ENSEMBL GENES 78 (SANGER UK)
# | BioMart dataset: dmelanogaster_gene_ensembl
# | BioMart dataset description: Drosophila melanogaster genes (BDGP5)
# | BioMart dataset version: BDGP5
# | Full dataset: yes
# | miRBase build ID: NA
# | transcript_nrow: 29173
# | exon_nrow: 80215
# | cds_nrow: 62135
# | Db created by: GenomicFeatures package from Bioconductor
# | Creation time: 2015-03-19 11:24:38 -0700 (Thu, 19 Mar 2015)
# | GenomicFeatures version at creation time: 1.18.3
# | RSQLite version at creation time: 1.0.0
# | DBSCHEMAVERSION: 1.0

Checking transcript FBtr0082757:

exonsBy(txdb, by="tx", use.names=TRUE)$FBtr0082757
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames             ranges strand |   exon_id   exon_name exon_rank
#          <Rle>          <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]       3R [9240861, 9241006]      - |     59041    CG9792:1         1
#   [2]       3R [9236513, 9237337]      - |     59040    CG9792:2         2
#   [3]       3R [9235264, 9236135]      - |     59039    CG9792:3         3
#   -------
#   seqinfo: 15 sequences from an unspecified genome

cdsBy(txdb, by="tx", use.names=TRUE)$FBtr0082757
# GRanges object with 2 ranges and 3 metadata columns:
#       seqnames             ranges strand |    cds_id    cds_name exon_rank
#          <Rle>          <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]       3R [9236513, 9237324]      - |     46142        <NA>         2
#   [2]       3R [9235355, 9236135]      - |     46141        <NA>         3
#   -------
#   seqinfo: 15 sequences from an unspecified genome

threeUTRsByTranscript(txdb, use.names=TRUE)$FBtr0082757
# GRanges object with 1 range and 3 metadata columns:
#       seqnames             ranges strand |   exon_id   exon_name exon_rank
#          <Rle>          <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]       3R [9235264, 9235354]      - |     59039    CG9792:3         3
#   -------
#   seqinfo: 15 sequences from an unspecified genome

Please update to BioC 3.0 as we don't support old versions of Bioconductor.

Cheers,

H.

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.18.3 AnnotationDbi_1.28.2   Biobase_2.26.0        
[4] GenomicRanges_1.18.4   GenomeInfoDb_1.2.4     IRanges_2.0.1         
[7] S4Vectors_0.4.0        BiocGenerics_0.12.1   

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.6           BBmisc_1.9             
 [4] BiocParallel_1.0.3      biomaRt_2.22.0          Biostrings_2.34.1      
 [7] bitops_1.0-6            brew_1.0-6              checkmate_1.5.1        
[10] codetools_0.2-11        DBI_0.3.1               digest_0.6.8           
[13] fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.2
[16] iterators_1.0.7         RCurl_1.95-4.5          Rsamtools_1.18.3       
[19] RSQLite_1.0.0           rtracklayer_1.26.2      sendmailR_1.2-1        
[22] stringr_0.6.2           tools_3.1.2             XML_3.98-1.1           
[25] XVector_0.6.0           zlibbioc_1.12.0        

 

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Hervé Pagès ♦♦ 13k

Thanks very much. I'll update. And next time I'll check versions first.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Chris Seidel50
Answer: makeTranscriptDbFromBiomart failure from Data Anomaly
0
gravatar for Thomas Maurel
4.1 years ago by
Thomas Maurel770
United Kingdom
Thomas Maurel770 wrote:
Dear Chris, We’ve actually released ensembl 79 last week so you are looking at our new release since your host is pointing at ensembl.org <http: ensembl.org=""/>. The Drosophila assembly was updated from BDGP5 to BDGP6 and gene set to version 6.02 (FB2014_05) (more information on the Ensembl declaration page: http://www.ensembl.org/Drosophila_melanogaster/Info/WhatsNew?db=core#change_1793 <http: www.ensembl.org="" drosophila_melanogaster="" info="" whatsnew?db="core#change_1793">. I had a look at your query and the result looks sound to me, the issue might be coming from the makeTranscriptDbFromBiomart package. In the Ensembl databases we store all the genomic information on forward strand orientation. If we look at the following transcript “FBtr0290037” on the forward strand on the ensembl website: http://www.ensembl.org/Drosophila_melanogaster/Transcript/Exons?db=core;g=FBgn0038135;r=3R:13294748-13298288;t=FBtr0290037 <http: www.ensembl.org="" drosophila_melanogaster="" transcript="" exons?db="core;g=FBgn0038135;r=3R:13294748-13298288;t=FBtr0290037"> and compare the result with Biomart, you can see that both informations are matching: > ensembl_79 <- useMart(biomart=“ENSEMBL_MART_ENSEMBL", host="ensembl.org", path="/biomart/martservice", dataset="dmelanogaster_gene_ensembl") > fruitfly_FBtr0290037 <- getBM(attributes=c(‘chromosome_name','exon_chrom_start','exon_chrom_end','ensembl_exon_id','5_utr_start','5_utr_end','3_utr_start','3_utr_end','cds_start','cds_end','strand','ensembl_transcript_id'),filters = 'ensembl_transcript_id', values = 'FBtr0290037', mart = ensembl_79) > fruitfly_FBtr0290037 chromosome_name exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_start cds_end strand ensembl_transcript_id 1 3R 13294748 13296098 FBtr0290037-E1 13294748 13294789 NA NA 1 1309 1 FBtr0290037 2 3R 13296170 13296267 FBtr0290037-E2 NA NA NA NA 1310 1407 1 FBtr0290037 3 3R 13296327 13296644 FBtr0290037-E3 NA NA NA NA 1408 1725 1 FBtr0290037 4 3R 13296709 13296827 FBtr0290037-E4 NA NA NA NA 1726 1844 1 FBtr0290037 5 3R 13296893 13296978 FBtr0290037-E5 NA NA NA NA 1845 1930 1 FBtr0290037 6 3R 13297058 13297463 FBtr0290037-E6 NA NA NA NA 1931 2336 1 FBtr0290037 7 3R 13297522 13298288 FBtr0290037-E7 NA NA 13298171 13298288 2337 2985 1 FBtr0290037 In this example, 5_utr_start match the exon_chrom_start (13294748) and 3_utr_end match the exon_chrom_end (13298288). You can also see the same information on the Exons sequence page of the Ensembl website: http://www.ensembl.org/Drosophila_melanogaster/Transcript/Exons?db=core;g=FBgn0038135;r=3R:13294748-13298288;t=FBtr0290037 <http: www.ensembl.org="" drosophila_melanogaster="" transcript="" exons?db="core;g=FBgn0038135;r=3R:13294748-13298288;t=FBtr0290037"> Now if we look at your example: > ensembl_79 <- useMart(biomart=“ENSEMBL_MART_ENSEMBL", host="ensembl.org", path="/biomart/martservice", dataset="dmelanogaster_gene_ensembl") > fruitfly_FBtr0082757 <- getBM(attributes=c(‘chromosome_name','exon_chrom_start','exon_chrom_end','ensembl_exon_id','5_utr_start','5_utr_end','3_utr_start','3_utr_end','strand','ensembl_transcript_id'),filters = 'ensembl_transcript_id', values = 'FBtr0082757', mart = ensembl_79) > fruitfly_FBtr0082757 chromosome_name exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start 5_utr_end 3_utr_start 3_utr_end strand ensembl_transcript_id 1 3R 13415139 13415286 FBtr0082757-E1 13415139 13415286 NA NA -1 FBtr0082757 2 3R 13410791 13411615 FBtr0082757-E2 13411603 13411615 NA NA -1 FBtr0082757 3 3R 13409543 13410413 FBtr0082757-E3 NA NA 13409543 13409632 -1 FBtr0082757 Because this Transcript is on the reverse strand, information in the biomart start column will actually mean the end of the feature and the same for information in the end column: Biomart 5_utr_end match the exon_chrom_end (13415286), this actually mean that the 5’ UTR start match the exon chromosome start (FBtr0082757-E1) as you can see on the exons sequence page of the Ensembl website: http://www.ensembl.org/Drosophila_melanogaster/Transcript/Exons?db=core;g=FBgn0041711;r=3R:13409543-13415286;t=FBtr0082757 <http: www.ensembl.org="" drosophila_melanogaster="" transcript="" exons?db="core;g=FBgn0041711;r=3R:13409543-13415286;t=FBtr0082757"> Biomart 3_utr_start match the exon_chrom_start (13409543), this actually mean that the 3’ UTR end match the exon chromosome end (FBtr0082757-E3) as you can see on the exons sequence page of the Ensembl website: http://www.ensembl.org/Drosophila_melanogaster/Transcript/Exons?db=core;g=FBgn0041711;r=3R:13409543-13415286;t=FBtr0082757 <http: www.ensembl.org="" drosophila_melanogaster="" transcript="" exons?db="core;g=FBgn0041711;r=3R:13409543-13415286;t=FBtr0082757"> My feeling is that the “TranscriptDbFromBiomart” package need to be updated since 3’UTR start don’t have to match the exon start, only 3’UTR end need to match the exon end. Hope this helps, Regards, Thomas > On 18 Mar 2015, at 17:57, Chris Seidel [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org <https: support.bioconductor.org=""/> > User Chris Seidel <https: support.bioconductor.org="" u="" 5840=""/> wrote Question: makeTranscriptDbFromBiomart failure from Data Anomaly <https: support.bioconductor.org="" p="" 65796=""/>: > > > I get an error while trying to create a TranscriptDb object using the makeTranscriptDbFromBiomart() function from the GenomicFeatures library. The error message is below (truncated to show just the first transcript). The last time I ran this code was with Ensembl74, whereas today it is failing with Ensembl78. Any tips? Is there anything I can do, or it simply a problem with data at biomaRt? > > > > library(GenomicFeatures) > > txdb <- makeTranscriptDbFromBiomart(host="ensembl.org" > + ,biomart ="ENSEMBL_MART_ENSEMBL" > + ,dataset = "dmelanogaster_gene_ensembl") > Download and preprocess the 'transcripts' data frame ... OK > Download and preprocess the 'splicings' data frame ... Error in .stopWithBioMartDataAnomalyReport(bm_table, idx, id_prefix, msg) : > BioMart data anomaly: in the following transcripts, > located on the minus strand, the 3' UTRs don't start > where their corresponding exon starts. > (Showing only the first 6 out of 15131 transcripts.) > 1. Transcript FBtr0082757: > strand rank exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start > 1 -1 1 13415139 13415286 FBtr0082757-E1 13415139 > 2 -1 2 13410791 13411615 FBtr0082757-E2 13411603 > 3 -1 3 13409543 13410413 FBtr0082757-E3 NA > 5_utr_end 3_utr_start 3_utr_end cds_length > 1 13415286 NA NA 1593 > 2 13411615 NA NA 1593 > 3 NA 13409543 13409632 1593 > > > sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] GenomicFeatures_1.16.3 AnnotationDbi_1.26.1 Biobase_2.24.0 > [4] GenomicRanges_1.16.4 GenomeInfoDb_1.0.2 IRanges_1.22.10 > [7] BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 > [4] BiocParallel_0.6.1 biomaRt_2.20.0 Biostrings_2.32.1 > [7] bitops_1.0-6 brew_1.0-6 BSgenome_1.32.0 > [10] checkmate_1.5.1 codetools_0.2-11 DBI_0.3.1 > [13] digest_0.6.6 fail_1.2 foreach_1.4.2 > [16] GenomicAlignments_1.0.6 iterators_1.0.7 Rcpp_0.11.5 > [19] RCurl_1.95-4.3 Rsamtools_1.16.1 RSQLite_1.0.0 > [22] rtracklayer_1.24.2 sendmailR_1.2-1 stats4_3.1.0 > [25] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 > [28] XVector_0.4.0 zlibbioc_1.10.0 > > > You may reply via email or visit makeTranscriptDbFromBiomart failure from Data Anomaly > -- Thomas Maurel Bioinformatician - Ensembl Production Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom
ADD COMMENTlink written 4.1 years ago by Thomas Maurel770

Hi Thomas,

Just to clarify, here is what the FBtr0082757 transcript looks like:

                  exon 3               exon 2            exon 1
exons:   <<<<<<<<<<<<<<<     <<<<<<<<<<<<<<<<     <<<<<<<<<<<<<

CDS:     --------<<<<<<<     <<<<<<<<--------     -------------

UTRs:    <<<<<<<<-------     --------<<<<<<<<     <<<<<<<<<<<<<
           3' utr                      5' utr            5' utr

Because the transcript is on the minus strand, the start (i.e. the left coordinate) of the 3' UTR matches the start of the exon, and the end (i.e. the right coordinate) of the 5' UTR matches the end of the exon. This is true even when the UTR spans more than 1 exon (like it's the case here for the 5' UTR). In this case the ranges of the individual 3' UTR parts must have the same start as the range of the exon that they belong to, and the ranges of the individual 5' UTR parts must have the same end as the range of the exon that they belong to. For transcripts that are on the plus strand, this is the other way around.

This is one of the many sanity checks that makeTranscriptDbFromBiomart() performs on the data before it stores it into the TxDb SQLite database. In the case reported by the OP, the mystery to me is that makeTranscriptDbFromBiomart() is complaining even though the start of the 3' UTR matches the start of the exon (13409543 for both of them), as it should be. Unfortunately I'm not in a position to troubleshoot this because we cannot afford supporting old versions of BioC (if I find the bug and fix it, it won't propagate anyway because old BioC versions are frozen).

Hope this helps,

H.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Hervé Pagès ♦♦ 13k

Yes but what appears upon cursory inspection to be the identical problem is still present in R 3.1.3 with BioConductor version 3.0.   

Witness:

> txdb <- makeTranscriptDbFromBiomart(host="ensembl.org"
    ,biomart ="ENSEMBL_MART_ENSEMBL"
  ,dataset = "dmelanogaster_gene_ensembl")
txdb <- makeTranscriptDbFromBiomart(host="ensembl.org"
+     ,biomart ="ENSEMBL_MART_ENSEMBL"
+   ,dataset = "dmelanogaster_gene_ensembl")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'splicings' data frame ... Error in .stopWithBioMartDataAnomalyReport(bm_result, idx, id_prefix,  : 
  BioMart data anomaly: in the following transcripts, 
  located on the minus strand, the 3' UTRs don't start
  where their corresponding exon starts.
  (Showing only the first 6 out of 15131 transcripts.)
  1. Transcript FBtr0082757:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         13415139       13415286  FBtr0082757-E1
     2     -1    2         13410791       13411615  FBtr0082757-E2
     3     -1    3         13409543       13410413  FBtr0082757-E3
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end
     1                   NA                 NA    13415139  13415286
     2             13410791           13411602    13411603  13411615
     3             13409633           13410413          NA        NA
       3_utr_start 3_utr_end cds_start cds_end cds_length
     1          NA        NA        NA      NA       1593
     2          NA        NA         1     812       1593
     3    13409543  13409632       813    1593       1593
  2. Transcript FBtr0080129:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         10686391       10686989  FBtr0080129-E1
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end
     1             10686462           10686947    10686948  10686989
       3_utr_start 3_utr_end cds_start cds_end cds_length
     1    10686391  10686461         1     486        486
  3. Transcript FBtr0336850:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         10686124       10687066  FBtr0336850-E1
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end
     1             10686462           10686947    10686948  10687066
       3_utr_start 3_utr_end cds_start cds_end cds_length
     1    10686124  10686461         1     486        486
  4. Transcript FBtr0113346:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         11691610       11691799  FBtr0113346-E1
     2     -1    2         11691067       11691551  FBtr0113346-E2
     3     -1    3         11690651       11691005  FBtr0113346-E3
     4     -1    4         11690386       11690596  FBtr0113346-E4
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end
     1             11691610           11691799          NA        NA
     2             11691067           11691551          NA        NA
     3             11690651           11691005          NA        NA
     4             11690445           11690596          NA        NA
       3_utr_start 3_utr_end cds_start cds_end cds_length
     1          NA        NA         1     190       1182
     2          NA        NA       191     675       1182
     3          NA        NA       676    1030       1182
     4    11690386  11690444      1031    1182       1182
  5. Transcript FBtr0082908:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         14131864       14132072  FBtr0082908-E1
     2     -1    2         14130680       14131210  FBtr0082908-E2
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end
     1             14131864           14131872    14131873  14132072
     2             14130764           14131210          NA        NA
       3_utr_start 3_utr_end cds_start cds_end cds_length
     1          NA        NA         1       9        456
     2    14130680  14130763        10     456        456
  6. Transcript FBtr0084333:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         22701369       22702084  FBtr0084333-E1
     2     -1    2         22701104       22701167  FBtr0084333-E2
     3     -1    3         22700582       22701050  FBtr0084333-E3
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end
     1             22701369           22701479    22701480  22702084
     2             22701104           22701167          NA        NA
     3             22700758           22701050          NA        NA
       3_utr_start 3_utr_end cds_start cds_end cds_length
     1          NA        NA         1     111        468
     2          NA        NA       112     175        468
     3    22700582  22700757       176     468      
> sessionInfo()
sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.18.3 AnnotationDbi_1.28.1   Biobase_2.26.0        
[4] GenomicRanges_1.18.4   GenomeInfoDb_1.2.4     IRanges_2.0.1         
[7] S4Vectors_0.4.0        BiocGenerics_0.12.1    BiocInstaller_1.16.1  

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.9             
 [4] BiocParallel_1.0.3      biomaRt_2.22.0          Biostrings_2.34.1      
 [7] bitops_1.0-6            brew_1.0-6              checkmate_1.5.1        
[10] codetools_0.2-10        DBI_0.3.1               digest_0.6.8           
[13] fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.2
[16] iterators_1.0.7         RCurl_1.95-4.5          Rsamtools_1.18.3       
[19] RSQLite_1.0.0           rtracklayer_1.26.2      sendmailR_1.2-1        
[22] stringr_0.6.2           tools_3.1.3             XML_3.98-1.1           
[25] XVector_0.6.0           zlibbioc_1.12.0        
> source("http://bioconductor.org/biocLite.R")
Bioconductor version 3.0 (BiocInstaller 1.16.1), ?biocLite for help

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Malcolm Cook1.5k

That's right Malcolm!

1) Now I can reproduce this too. I forgot to specify host="ensembl.org" like you and the OP did in order to get the new ensembl release (79). So I was actually querying the ensembl mart at www.biomart.org which is still hosting the previous release of ensembl (78). I should have read Thomas post more carefully.

2) Another problem is that there was an issue with the error reporting code in makeTxDbFromBiomart(): when some transcripts failed to pass the sanity checks, the error message was displaying the wrong transcripts. More precisely, many "good" transcripts (like FBtr0082757) were mistakenly added to the set of "bad" transcripts and included in the error message. This is now fixed in GenomicFeatures release (1.18.4) and devel (1.19.33).

3) With GenomicFeatures 1.18.4:

library(GenomicFeatures)
txdb <- makeTranscriptDbFromBiomart(host="ensembl.org",
                                biomart="ENSEMBL_MART_ENSEMBL",
                                dataset="dmelanogaster_gene_ensembl")
# Download and preprocess the 'transcripts' data frame ... OK
# Download and preprocess the 'splicings' data frame ... Error in .stopWithBioMartDataAnomalyReport(bm_result, idx[bad_idx2], id_prefix,  : 
#  BioMart data anomaly: in the following transcripts, 
#  located on the minus strand, the start of some 3' UTRs 
#  (3_utr_start) doesn't match the start of the exon 
#  (exon_chrom_start).
#  (Showing only the first 6 out of 9 transcripts.)
#  1. Transcript FBtr0084081:
#       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
#     1     -1    1         21368380       21369062  FBtr0084081-E2
#     2     -1    2         21377288       21377399  FBtr0084081-E1
#     3     -1    3         21376819       21377076  FBtr0084081-E3
#     4     -1    4         21376602       21376741  FBtr0084081-E4
#     5     -1    5         21375060       21375912  FBtr0084081-E5
#       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end
#     1             21368871           21369062    21368380  21368870
#     2                   NA                 NA          NA        NA
#     3             21376819           21377035          NA        NA
#     4             21376602           21376741          NA        NA
#     5             21375060           21375912          NA        NA
#       3_utr_start 3_utr_end cds_start cds_end cds_length
#     1          NA        NA         1     192        521
#     2    21377288  21377399       193     304        521
#     3    21377036  21377076       305     521        521
#     4          NA        NA        NA      NA        521
#     5          NA        NA        NA      NA        521
# <output truncated>

4) Looking at the exons sequence page for FBtr0084081 on the Ensembl website confirms the problem:

http://www.ensembl.org/Drosophila_melanogaster/Transcript/Exons?db=core;g=FBgn0002781;r=3R:21368380-21377399;t=FBtr0084081

Chunks of translated sequences alternate with chunks of UTRs! Also the ranking of the exons is suspect.

5) It's also puzzling that FBtr0084081 is not in

ftp://ftp.ensembl.org/pub/current_gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.79.gtf.gz

Shouldn't this file be in sync with what's available on the mart at ensembl.org?

6) The exon sequences for FBtr0084081 on the Ensembl website don't match the transcript sequence on the FlyBase website (expand the Sequence section to see it):

http://flybase.org/reports/FBtr0084081.html

7) Last but not least. If you concatenate the Ensembl exon sequences in the order 1, 5, 4, 3, 2, and after reverse complementing sequences 2 to 5, that gives exactly the FlyBase transcript sequence. After doing that shuffling of the Ensembl exons, then everything makes sense again: now the exons are ranked by ascending genomic position (which suggests that they are in fact on the + strand and not the - strand) and the UTRs are now flanking the translated sequences (and not found in the middle of them).

Conclusion: something went wrong when the FlyBase annotations were imported into the Ensembl db.

H.

ADD REPLYlink written 4.1 years ago by Hervé Pagès ♦♦ 13k

Dear Hervé,

Thanks a lot for the thorough investigation.

Regarding item number 5, we intentionally excluded the Fruit fly gene FBgn0002781 from our GTF dumps as it was making our sanity checks failing. FBgn0002781 is a trans-spliced gene and doesn't really fit the GTF format definition. We will update our README files for our next release to warn users that we don't dump trans-spliced genes for the moment (there is only one in Fruit fly at the moment).

Regarding items 4,6,7, I've contacted my colleague from Ensembl genomes, James Allen who did the Fruit fly import in Ensembl 79. Please find his explanation below:

Point 4) is correct, the display is confusing because it's (understandably) not been written to deal with trans-spliced genes. However; trans-splicing is rare (or, at least, the annotation of trans-splicing is rare...) and this gene is well-known (cf https://www.google.co.uk/search?q=mdg4+trans+spliced&oq=mdg4+trans+spliced). So if someone is interested in this specific gene it's likely that they'll realise why the display is backwards.

Point 6), FlyBase show the transcript sequence having done the appropriate flipping and reversing that the user describes. We show the exon sequence before that's done, so yes, the sequences are different but I don't think that's an error.

Point 7) certainly describes the way to get sequence from the transpliced transcript; but to flip the exon-strandedness would be to misrepresent the FlyBase data. 

Hope this helps,

Best Regards,

Thomas

 

ADD REPLYlink written 4.1 years ago by Thomas Maurel770

Hi Thomas,

Thanks for checking with James and getting back to us. It's true that *gene* FBgn0002781 contains trans-splicing events (most exons come from the minus strand but some of them come from the plus strand). However, if we focus on *transcript* FBtr0084081, there is no trans-splicing event in it: its 5 exons come from the plus strand and they are ranked by ascending genomic position. A very normal situation.

So I wouldn't say that the display is confusing because it's not been written to deal with trans-spliced genes (note that the display is only showing the FBtr0084081 transcript in the first place, not the associated gene). It's confusing simply because the data in the db is incorrect: the exons are on the wrong strand and ranked incorrectly. Note that correct exon ranking and strand information is crucial to programmatically extract the sequence of the transcript from the reference sequence of the chromosome. It becomes critical when the transcript is a protein-coding transcript (like FBtr0084081) and people want to predict the consequences of small variations in the reference sequence.

Maybe the underlying problem is that the Ensembl db schema itself doesn't support trans-spliced genes (i.e. it can only represent genes where all exons are on the same chromosome and strand), I don't know. If that's the case, then maybe FBgn0002781 should be excluded (like it was excluded from the GTF dump). Otherwise, I hope that the representation of transcript FBtr0084081 can be fixed.

Thanks,

H.

ADD REPLYlink written 4.1 years ago by Hervé Pagès ♦♦ 13k

Dear Hervé,

Regarding FBtr0084081, I've got more comments from James. The FlyBase record for the transcript (http://flybase.org/reports/FBtr0084081.html) clearly states that this is trans-spliced in the comments, with the SO term specific to transcripts: SO:0000479. In addition the FlyBase gff that was loaded from the following file: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.02_FB2014_05/gff/dmel-all-no-analysis-r6.02.gff.gz. Unzipping and grepping for FBtr0084081, you get the rows below. These indicate that the transcript is on the -ve strand, except for one exon on the +ve strand. (Note that there's a labelling error in the gff, where 3' UTR is incorrectly labelled as 5' UTR on the +ve exon. But we don't import UTR information in Ensembl)

3R    FlyBase    mRNA    21368380    21377399    .    .    .    ID=FBtr0084081;Name=mod(mdg4)-RV;Parent=FBgn0002781;Alias=CG32491-RV,62.3;Dbxref=FlyBase_Annotation_IDs:CG32491-RV,REFSEQ:NM_176526;score_text=Weakly Supported;score=3
3R    FlyBase    exon    21368380    21369062    .    +    .    Parent=FBtr0084081;parent_type=mRNA
3R    FlyBase    five_prime_UTR    21368380    21368867    .    +    .    Parent=FBtr0084081;parent_type=mRNA
3R    FlyBase    CDS    21368868    21369062    .    +    2    Parent=FBtr0084081;parent_type=mRNA
3R    FlyBase    protein    21368871    21377032    .    .    .    ID=FBpp0083480;Name=mod(mdg4)-PV;Derives_from=FBtr0084081;Alias=CG32491-PV;Dbxref=FlyBase_Annotation_IDs:CG32491-PV,REFSEQ:NP_788703,GB_protein:AAO41585;derived_isoelectric_point=5.25;derived_molecular_weight=62298.7
3R    FlyBase    CDS    21375060    21375912    .    -    0    Parent=FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0307759,FBtr0307760;parent_type=mRNA
3R    FlyBase    exon    21375060    21375912    .    -    .    Parent=FBtr0084066,FBtr0084067,FBtr0084068,FBtr0084069,FBtr0084070,FBtr0084071,FBtr0084072,FBtr0084073,FBtr0084074,FBtr0084075,FBtr0084077,FBtr0084078,FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0084060,FBtr0084061,FBtr0084062,FBtr0084063,FBtr0084064,FBtr0084065,FBtr0114359,FBtr0114360,FBtr0114361,FBtr0307759,FBtr0307760,FBtr0343765;parent_type=mRNA
3R    FlyBase    intron    21375913    21376601    .    -    .    Parent=FBtr0084066,FBtr0084067,FBtr0084068,FBtr0084069,FBtr0084070,FBtr0084071,FBtr0084072,FBtr0084073,FBtr0084074,FBtr0084075,FBtr0084077,FBtr0084078,FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0084060,FBtr0084061,FBtr0084062,FBtr0084063,FBtr0084064,FBtr0084065,FBtr0114359,FBtr0114360,FBtr0114361,FBtr0307759,FBtr0307760,FBtr0343765;parent_type=mRNA
3R    FlyBase    CDS    21376602    21376741    .    -    2    Parent=FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0307759,FBtr0307760;parent_type=mRNA
3R    FlyBase    exon    21376602    21376741    .    -    .    Parent=FBtr0084066,FBtr0084067,FBtr0084068,FBtr0084069,FBtr0084070,FBtr0084071,FBtr0084072,FBtr0084073,FBtr0084074,FBtr0084075,FBtr0084077,FBtr0084078,FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0084060,FBtr0084061,FBtr0084062,FBtr0084063,FBtr0084064,FBtr0084065,FBtr0114359,FBtr0114360,FBtr0114361,FBtr0307759,FBtr0307760,FBtr0343765;parent_type=mRNA
3R    FlyBase    intron    21376742    21376818    .    -    .    Parent=FBtr0084066,FBtr0084067,FBtr0084068,FBtr0084069,FBtr0084070,FBtr0084071,FBtr0084072,FBtr0084073,FBtr0084074,FBtr0084075,FBtr0084077,FBtr0084078,FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0084060,FBtr0084061,FBtr0084062,FBtr0084063,FBtr0084064,FBtr0084065,FBtr0114359,FBtr0114360,FBtr0114361,FBtr0307759,FBtr0307760,FBtr0343765;parent_type=mRNA
3R    FlyBase    exon    21376819    21377076    .    -    .    Parent=FBtr0084066,FBtr0084067,FBtr0084068,FBtr0084069,FBtr0084070,FBtr0084071,FBtr0084072,FBtr0084073,FBtr0084074,FBtr0084075,FBtr0084077,FBtr0084078,FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0084060,FBtr0084061,FBtr0084062,FBtr0084063,FBtr0084064,FBtr0084065,FBtr0114359,FBtr0114360,FBtr0114361,FBtr0307759,FBtr0307760,FBtr0343765;parent_type=mRNA
3R    FlyBase    CDS    21376819    21377032    .    -    0    Parent=FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0307759,FBtr0307760;parent_type=mRNA
3R    FlyBase    five_prime_UTR    21377033    21377076    .    -    .    Parent=FBtr0084079,FBtr0084080,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0307759,FBtr0307760;parent_type=mRNA
3R    FlyBase    intron    21377077    21377287    .    -    .    Parent=FBtr0084079,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0307759,FBtr0307760;parent_type=mRNA
3R    FlyBase    exon    21377288    21377399    .    -    .    Parent=FBtr0084079,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0307759,FBtr0307760;parent_type=mRNA
3R    FlyBase    five_prime_UTR    21377288    21377399    .    -    .    Parent=FBtr0084079,FBtr0084081,FBtr0084082,FBtr0084083,FBtr0084084,FBtr0084085,FBtr0307759,FBtr0307760;parent_type=mRNA

 

Regarding the trans-spliced genes, I totally agree with you, the Ensembl website display was not written to deal with trans-spliced genes. The Ensembl db schema  do support the trans-spliced genes as you can see below data for FBtr0084081, the strand information is stored on the gene, transcript and exon level:

mysql -h ensembldb.ensembl.org -u anonymous -P 3306 drosophila_melanogaster_core_79_602 -e "select t.seq_region_start, t.seq_region_end, t.seq_region_strand,e.exon_id, e.seq_region_start, e.seq_region_end, e.seq_region_strand,e.phase, e.end_phase, e.stable_id from exon_transcript et, exon e,transcript t where e.exon_id = et.exon_id and t.transcript_id =et.transcript_id and t.stable_id = 'FBtr0084081'  order by e.stable_id"

+------------------+----------------+-------------------+---------+------------------+----------------+-------------------+-------+-----------+----------------+

| seq_region_start | seq_region_end | seq_region_strand | exon_id | seq_region_start | seq_region_end | seq_region_strand | phase | end_phase | stable_id      |

+------------------+----------------+-------------------+---------+------------------+----------------+-------------------+-------+-----------+----------------+

|         21368380 |       21377399 |                 1 |   47802 |         21377288 |       21377399 |                -1 |    -1 |        -1 | FBtr0084081-E1 |

|         21368380 |       21377399 |                 1 |   47801 |         21368380 |       21369062 |                 1 |    -1 |         0 | FBtr0084081-E2 |

|         21368380 |       21377399 |                 1 |   47803 |         21376819 |       21377076 |                -1 |     0 |        -1 | FBtr0084081-E3 |

|         21368380 |       21377399 |                 1 |   47804 |         21376602 |       21376741 |                -1 |     0 |         2 | FBtr0084081-E4 |

|         21368380 |       21377399 |                 1 |   47805 |         21375060 |       21375912 |                -1 |     2 |         0 | FBtr0084081-E5 |

+—————————+----------------+-------------------+---------+------------------+----------------+-------------------+-------+-----------+----------------+

In Biomart, it's an other story since the ensembl mart is gene centric and we only display strand on the Ensembl Gene level since in 99.9% of the case the strand information on gene level is the same on the Transcript and Exon level. 

> ensembl_79 <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="ensembl.org", path="/biomart/martservice", dataset="dmelanogaster_gene_ensembl")


> fruitfly_FBtr0084081 <- getBM(attributes=c('chromosome_name','exon_chrom_start','exon_chrom_end','ensembl_exon_id','5_utr_start','5_utr_end','3_utr_start','3_utr_end','cds_start','cds_end','strand','ensembl_transcript_id'),filters = 'ensembl_transcript_id', values = 'FBtr0084081', mart = ensembl_79)
> fruitfly_FBtr0084081
  chromosome_name exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_start cds_end strand
1              3R         21368380       21369062  FBtr0084081-E2    21368380  21368870          NA        NA         1     192     -1
2              3R         21377288       21377399  FBtr0084081-E1          NA        NA    21377288  21377399       193     304     -1
3              3R         21376819       21377076  FBtr0084081-E3          NA        NA    21377036  21377076       305     521     -1
4              3R         21376602       21376741  FBtr0084081-E4          NA        NA          NA        NA        NA      NA     -1
5              3R         21375060       21375912  FBtr0084081-E5          NA        NA          NA        NA        NA      NA     -1
  ensembl_transcript_id
1           FBtr0084081
2           FBtr0084081
3           FBtr0084081
4           FBtr0084081
5           FBtr0084081

We will discuss internally what we should do with this gene and will try to improve the display in future releases.

Thank you for your feedback,

Best Regards,

Thomas

 

ADD REPLYlink written 4.1 years ago by Thomas Maurel770

Hi Thomas,

Thanks again for the useful feedback.

I believe that at the heart of the problem is a representation of the FBtr0084081 transcript by the FlyBase folks that is confusing everybody. Even though they report that 4 of the 5 exons that constitute this transcript are located on the minus strand, this contradicts the transcript sequence they display on the FlyBase website:

http://flybase.org/reports/FBtr0084081.html

It could be that setting the strand to minus for these 4 exons reflects the biology, in the sense that maybe transcription really happens on the minus strand (whatever that means), but it's the reverse complement of the exon sequences that end up in the transcript. So everything happens as if the exons were coming from the plus strand. In other words, the transcript sequence is made of concatenation of regions that belong to the plus strand, not the minus strand.

From a data representation point of view, this is very misleading and there is of course little chance that the Ensembl display tool can be tweaked to display the correct exon sequences ("correct" in the sense that the concatenation of the exon sequences should be identical to the transcript sequence).

Also the ranking of the exons is crucial. The concatenation of the exon sequences needs to be done in a particular order. Again, the ranking of the 5 exons provided by Ensembl doesn't seem right. But maybe the situation is hopeless because the FlyBase GFF file itself you guys used to import the data (dmel-all-no-analysis-r6.02.gff.gz) doesn't seem to contain exon ranking information. Even more confusin is that the FlyBase chado-xml file

ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.02_FB2014_05/chado-xml/chado_dmel_gene_models.xml.gz

does contain exon ranking information and the rank for the 5 exons in transcript FBtr0084081 is FBtr0084081-E1, FBtr0084081-E3, FBtr0084081-E4, FBtr0084081-E5, FBtr0084081-E2. This is not the same as the Ensembl ranking but doesn't make sense either because it contradicts the order in which the exons are actually concatenated to form the transcript sequence.

Anyway, back to the original issue. Since the Ensembl mart is gene centric (thanks for clarifying this), then there is no way that a transcript coming from a different strand than the strand of the gene can be represented correctly. Problem is that the user has no way to know that something is wrong with the representation of the transcript provided by the mart, unless s/he runs its own sanity checks. So that's what makeTxDbFromBiomart() does. And in the absence of any short term resolution to address the data mis-representation issues on FlyBase and the Ensembl mart, I will just modify makeTxDbFromBiomart() to drop the transcripts (with a warning) that don't pass the sanity checks, instead of failing.

Thanks again for your feedback,

H.

ADD REPLYlink written 4.1 years ago by Hervé Pagès ♦♦ 13k
Dear Hervé, Again, thank you for your thorough investigation and explanation. I’ve fed back your comments to James. Thank you for updating makeTxDbFromBiomart(), I think it make sense that this Transcript won’t pass the sanity checks instead of failing. Regards, Thomas > On 23 Mar 2015, at 18:28, Hervé Pagès [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org <https: support.bioconductor.org=""/> > User Hervé Pagès <https: support.bioconductor.org="" u="" 1542=""/> wrote Comment: makeTranscriptDbFromBiomart failure from Data Anomaly <https: support.bioconductor.org="" p="" 65796="" #65944="">: > > > Hi Thomas, > > Thanks again for the useful feedback. > > I believe that at the heart of the problem is a representation of the FBtr0084081 transcript by the FlyBase folks that is confusing everybody. Even though they report that 4 of the 5 exons that constitute this transcript are located on the minus strand, this contradicts the transcript sequence they display on the FlyBase website: > > http://flybase.org/reports/FBtr0084081.html <http: flybase.org="" reports="" fbtr0084081.html=""> > It could be that setting the strand to minus for these 4 exons reflects the biology, in the sense that maybe transcription really happens on the minus strand (whatever that means), but it's the reverse complement of the exon sequences that end up in the transcript. So everything happens as if the exons were coming from the plus strand. In other words, the transcript sequence is made of concatenation of regions that belong to the plus strand, not the minus strand. > > From a data representation point of view, this is very misleading and there is of course little chance that the Ensembl display tool can be tweaked to display the correct exon sequences ("correct" in the sense that the concatenation of the exon sequences should be identical to the transcript sequence). > > Also the ranking of the exons is crucial. The concatenation of the exon sequences needs to be done in a particular order. Again, the ranking of the 5 exons provided by Ensembl doesn't seem right. But maybe the situation is hopeless because the FlyBase GFF file itself you guys used to import the data (dmel-all-no-analysis-r6.02.gff.gz) doesn't seem to contain exon ranking information. Even more confusin is that the FlyBase chado-xml file > > ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.02_FB2014_05/chado-xml/chado_dmel_gene_models.xml.gz <ftp: ftp.flybase.net="" genomes="" drosophila_melanogaster="" dmel_r6.02_fb2014_05="" chado-xml="" chado_dmel_gene_models.xml.gz=""> > does contain exon ranking information and the rank for the 5 exons in transcript FBtr0084081 is FBtr0084081-E1, FBtr0084081-E3, FBtr0084081-E4, FBtr0084081-E5, FBtr0084081-E2. This is not the same as the Ensembl ranking but doesn't make sense either because it contradicts the order in which the exons are actually concatenated to form the transcript sequence. > > Anyway, back to the original issue. Since the Ensembl mart is gene centric (thanks for clarifying this), then there is no way that a transcript coming from a different strand than the strand of the gene can be represented correctly. Problem is that the user has no way to know that something is wrong with the representation of the transcript provided by the mart, unless s/he runs its own sanity checks. So that's what makeTxDbFromBiomart() does. And in the absence of any short term resolution to address the data mis-representation issues on FlyBase and the Ensembl mart, I will just modify makeTxDbFromBiomart() to drop the transcripts (with a warning) that don't pass the sanity checks, instead of failing. > > Thanks again for your feedback, > > H. > > > You may reply via email or visit C: makeTranscriptDbFromBiomart failure from Data Anomaly > -- Thomas Maurel Bioinformatician - Ensembl Production Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom
ADD REPLYlink written 4.1 years ago by Thomas Maurel770

Hi Thomas, Chris,

A patch was applied to GenomicFeatures (1.18.5 in BioC release, and 1.19.35 in BioC devel) to address the issue. I also edited my original answer above to rectify and provide some details.

Cheers,

H.

ADD REPLYlink written 4.1 years ago by Hervé Pagès ♦♦ 13k

Thanks for the investigation, and the explanation!

ADD REPLYlink written 4.1 years ago by Chris Seidel50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 163 users visited in the last hour