makeTranscriptDbFromGFF Unexpected transcript duplicates
4
0
Entering edit mode
@miss-agnieszka-aleksandra-golicz-6531
Last seen 18 months ago
Australia

Hello,

I have been trying to use makeTranscriptDbFromGFF() to read my gff file but I get the following error:

extracting transcript information
Error in .prepareGFF3TXS(data, useGenesAsTranscripts) :
  Unexpected transcript duplicates
In addition: Warning message:
In .local(con, format, text, ...) :
  gff-version directive indicates version is   3, not 3

I have read that this was a problem when he file does not have transcript features. My file does and I have checked all the transcript IDs are unique. This error did not happen in Bioconductor 2.14 but appeared when I upgraded to 3.0. useGenesAsTranscripts=TRUE does not help.

The file I am using seems legitimate, its has been validated by genometools gt gff3, which is pretty strict. I am including my R code, session info and the two genes from my gff3 (if I use just them everything is fine, but using the full file causes a problem, although all genes are formatted in the same manner, I am afraid in order to reproduce the error I would have to share the entire gff3 file).

txdb <- makeTranscriptDbFromGFF("annotation.gff3")

> sessionInfo()
R version 3.1.0 RC (2014-04-05 r65382)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] VariantAnnotation_1.10.5 Rsamtools_1.16.1         Biostrings_2.32.1
 [4] XVector_0.4.0            GenomicFeatures_1.16.3   AnnotationDbi_1.26.1
 [7] Biobase_2.22.0           GenomicRanges_1.16.4     GenomeInfoDb_1.0.2
[10] IRanges_1.22.10          BiocGenerics_0.10.0

loaded via a namespace (and not attached):
 [1] biomaRt_2.18.0     bitops_1.0-6       BSgenome_1.30.0    DBI_0.3.1
 [5] RCurl_1.95-4.3     RSQLite_0.11.4     rtracklayer_1.22.7 stats4_3.1.0
 [9] tools_3.1.0        XML_3.98-1.1       zlibbioc_1.8.0

head -n 100 annotation.gff3

##gff-version   3

C1      JCVI    gene    1449    1625    .       -       .       ID=Bo1g001000;Note=cyclic nucleotide gated channel
C1      JCVI    mRNA    1449    1625    .       -       .       ID=Bo1g001000.1;Parent=Bo1g001000;Note=cyclic nucleotide gated channel
C1      JCVI    exon    1449    1625    .       -       .       ID=Bo1g001000.1_exon_1;Parent=Bo1g001000.1
C1      JCVI    CDS     1449    1625    .       -       0       ID=Bo1g001000.1_cds_1;Parent=Bo1g001000.1
###
C1      JCVI    gene    1711    2833    .       +       .       ID=Bo1g001010;Note=Amino acid transporter family protein
C1      JCVI    mRNA    1711    2833    .       +       .       ID=Bo1g001010.1;Parent=Bo1g001010;Note=Amino acid transporter family protein
C1      JCVI    exon    1711    1799    .       +       .       ID=Bo1g001010.1_exon_1;Parent=Bo1g001010.1
C1      JCVI    CDS     1711    1799    .       +       0       ID=Bo1g001010.1_cds_1;Parent=Bo1g001010.1
C1      JCVI    exon    1937    2288    .       +       .       ID=Bo1g001010.1_exon_2;Parent=Bo1g001010.1
C1      JCVI    CDS     1937    2288    .       +       1       ID=Bo1g001010.1_cds_2;Parent=Bo1g001010.1
C1      JCVI    exon    2366    2833    .       +       .       ID=Bo1g001010.1_exon_3;Parent=Bo1g001010.1
C1      JCVI    CDS     2366    2833    .       +       0       ID=Bo1g001010.1_cds_3;Parent=Bo1g001010.1

genomicfeatures • 2.4k views
0
Entering edit mode
Keith Hughitt ▴ 180
@keith-hughitt-6740
Last seen 9 months ago
United States

Hello,

It looks like you are using old versions of R/GenomicFeatures. My first thought would be to update R to 3.1.2, and then update bioconductor:

source("http://bioconductor.org/biocLite.R")
biocLite()

Also, are the gene IDs unique as well? E.g.

Do you get the same results for:

    grep '      gene    ' annotation.gff | grep -o "ID=[^;]*;" | sort | wc -l

and

    grep '      gene    ' annotation.gff | grep -o "ID=[^;]*;" | sort | uniq| wc -l

(Use cntrl-v + 'tab' to insert tabs before and after "gene" in the regex.)

 

ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 18 hours ago
Seattle, WA, United States

Hi Agnieszka,

In BioC 3.1 (the current devel version, will be released in April), the GenomicFeatures package has a new function: makeTxDbFromGRanges(). This is still a work-in-progress (the man page will be added soon). Can you please try it? You'll need to install R-3.2 (current devel version of R) first, then:

source("http://bioconductor.org/biocLite.R")
biocLite(GenomicFeatures)

Then import your file first as a GRanges object with:

library(rtracklayer)
gr <- import("path/to/your/file")

And finally do:

makeTxDbFromGRanges(gr)

If that still doesn't work, or if you cannot get R-3.2, then it would be great if you could give us access to your file so we can troubleshoot.

The plan is to switch the current implementation of makeTxDbFromGFF() to make it use makeTxDbFromGRanges() internally in the near future.

Thanks,

H.

ADD COMMENT
0
Entering edit mode
I love the new makeTxDbFromGRanges function. It makes things much more transparent and easier to debug. Thanks for adding this useful feature. Thomas
ADD REPLY
0
Entering edit mode

Thanks for the encouragements Thomas, always nice to hear!  H.

ADD REPLY
0
Entering edit mode
@tjeerd-dijkstra-7350
Last seen 9.3 years ago
Netherlands

Hi Agnieszka,

 

I had a similar problem recently and my solution was much more mundane. makeTranscriptDbFromGFF() is apparently very strict in the gff-version pragma. Thus, it want only a single space between the pragma and the number as in "##gff-version 3". However, the  sequence ontology web site where the format is defined is mute on the number of spaces between  the pragma and the number and my GeneDB gff files have three spaces.

In summary, just edit the gff file and remove some spaces did the trick for me. Hope this helps for you too,

Tjeerd

ADD COMMENT
0
Entering edit mode

Tjeerd are you saying that you were getting an error in addition to the warning and that the error also went away when you got rid of the warning?

Just to clarify:

  1. What is picky here is rtracklayer::import() which makeTranscriptDbFromGFF() relies on for importing the GFF file as a GRanges object.
  2. This pickiness is only causing the warning that Agnieszka gets. I would be surprised if this was also the cause of the error she gets thru some obscure domino effect (rtracklayer::import() gives me the exact same GRanges object independently of the number of spaces I put between the pragma and the version number).

Cheers,

H.

ADD REPLY
0
Entering edit mode

 

Hi Herve,

I double checked and you are right the number of spaces is only causing a warning without any further effect. Sorry for the noise.

Tjeerd

ADD REPLY
0
Entering edit mode
@miss-agnieszka-aleksandra-golicz-6531
Last seen 18 months ago
Australia

Hello,

Thank you very much for your responses. In my case it seems that upgrade to R/3.1.2 and fresh installation of bioconductor solved the issue. I have upgraded on my personal windows 8 machine, I am still waiting for the upgrade on the server I was using (don't have root permissions). I will report back if I encounter further problems. So far I have this:

> txdb <- makeTranscriptDbFromGFF("C:\\Users\\Aga\\Documents\\bioinf\\brassica\\thesis\\snpeff\\variantannotation.gff3\\variantannotation.gff3")
extracting transcript information
Extracting gene IDs
extracting transcript information
Processing splicing information for gff3 file.
Deducing exon rank from relative coordinates provided
Prepare the 'metadata' data frame ... metadata: OK
Now generating chrominfo from available sequence names. No chromosome length information is available.
Warning messages:
1: In .local(con, format, text, ...) :
  gff-version directive indicates version is   3, not 3
2: In .deduceExonRankings(exs, format = "gff") :
  Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
3: In matchCircularity(chroms, circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
4: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated") 
5: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated") 
6: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated") 
7: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated") 
8: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated") 
9: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")

Agnieszka

Login before adding your answer.

Traffic: 527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6