I would like to make a transcript-based annotation file (TxDb) for Arabidopsis, based on the recent Araport11 genome release. I am using the gff3 file (Araport11_GFF3_genes_transposons.201606.gff, 22 June 2016), available from here.
However, this fails because of an error:
Error in makeTxDbFromGRanges(araport):
some exons are linked to transcripts not found in the file.
While the error message is crystal clear, and I realize the error originates from an apparent mistake in the gff3 file (which has to be corrected by the people at the Arabidopsis Biological Resource Center), I wondered whether it somehow would be possible to have these exons and transcripts identified and returned. This would better enable troubleshooting.
Thanks,
Guido
> library("rtracklayer")
> library("GenomicFeatures")
>
>
> araport <- import.gff3("Araport11_GFF3_genes_transposons.201606.gff", format="gff3")
>
> araport
GRanges object with 789890 ranges and 21 metadata columns:
seqnames ranges strand | source type score phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
[1] Chr1 [3631, 5899] + | Araport11 gene <NA> <NA>
[2] Chr1 [3631, 5899] + | Araport11 mRNA <NA> <NA>
[3] Chr1 [3631, 3759] + | Araport11 five_prime_UTR <NA> <NA>
[4] Chr1 [3631, 3913] + | Araport11 exon <NA> <NA>
[5] Chr1 [3760, 3913] + | Araport11 CDS <NA> 0
... ... ... ... . ... ... ... ...
[789886] ChrM [366086, 366700] - | Araport11 gene <NA> <NA>
[789887] ChrM [366086, 366700] - | Araport11 mRNA <NA> <NA>
[789888] ChrM [366086, 366700] - | Araport11 CDS <NA> 0
[789889] ChrM [366086, 366700] - | Araport11 exon <NA> <NA>
[789890] ChrM [366086, 366700] - | Araport11 protein <NA> <NA>
ID Name Note symbol
<character> <character> <CharacterList> <character>
[1] AT1G01010 AT1G01010 NAC domain containing protein 1 NAC001
[2] AT1G01010.1 AT1G01010.1 NAC domain containing protein 1 NAC001
[3] AT1G01010:five_prime_UTR:1 NAC001:five_prime_UTR:1 <NA>
[4] AT1G01010:exon:1 NAC001:exon:1 <NA>
[5] AT1G01010:CDS:1 NAC001:CDS:1 <NA>
... ... ... ... ...
[789886] ATMG01410 ATMG01410 open reading frame 204 ORF204
[789887] ATMG01410.1 ATMG01410.1 open reading frame 204 ORF204
[789888] ATMG01410:CDS:1 ORF204:CDS:1 <NA>
[789889] ATMG01410:exon:1 ORF204:exon:1 <NA>
[789890] ATMG01410.1-Protein ATMG01410.1 <NA>
Alias full_name
<CharacterList> <character>
[1] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1
[2] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1
[3] <NA>
[4] <NA>
[5] <NA>
... ... ...
[789886] open reading frame 204
[789887] open reading frame 204
[789888] <NA>
[789889] <NA>
[789890] <NA>
Dbxref locus_type Parent conf_class
<CharacterList> <character> <CharacterList> <character>
[1] PMID:11118137,PMID:12820902,PMID:15029955,... protein_coding <NA>
[2] PMID:11118137,gene:2200934,UniProt:Q0WV96 <NA> AT1G01010 2
[3] <NA> AT1G01010.1 <NA>
[4] <NA> AT1G01010.1 <NA>
[5] <NA> AT1G01010.1 <NA>
... ... ... ... ...
[789886] locus:504954624 protein_coding <NA>
[789887] gene:1009022691 <NA> ATMG01410 1
[789888] <NA> ATMG01410.1 <NA>
[789889] <NA> ATMG01410.1 <NA>
[789890] <NA> <NA>
conf_rating Derives_from curator_summary description index nochangenat-description
<character> <character> <character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA> <NA> <NA>
[2] **** <NA> <NA> <NA> <NA> <NA>
[3] <NA> <NA> <NA> <NA> <NA> <NA>
[4] <NA> <NA> <NA> <NA> <NA> <NA>
[5] <NA> <NA> <NA> <NA> <NA> <NA>
... ... ... ... ... ... ...
[789886] <NA> <NA> <NA> <NA> <NA> <NA>
[789887] ***** <NA> <NA> <NA> 1 <NA>
[789888] <NA> <NA> <NA> <NA> <NA> <NA>
[789889] <NA> <NA> <NA> <NA> <NA> <NA>
[789890] <NA> ATMG01410.1 <NA> <NA> <NA> <NA>
-------
seqinfo: 7 sequences from an unspecified genome; no seqlengths
>
> txdb <- makeTxDbFromGRanges(araport)
Error in makeTxDbFromGRanges(araport) :
some exons are linked to transcripts not found in the file
>
> sessionInfo()
R version 3.3.1 Patched (2016-06-28 r70853)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.24.3 AnnotationDbi_1.34.3 Biobase_2.32.0 rtracklayer_1.32.1
[5] GenomicRanges_1.24.2 GenomeInfoDb_1.8.2 IRanges_2.6.1 S4Vectors_0.10.1
[9] BiocGenerics_0.18.0
loaded via a namespace (and not attached):
[1] XML_3.98-1.4 Rsamtools_1.24.0 Biostrings_2.40.2
[4] bitops_1.0-6 GenomicAlignments_1.8.3 DBI_0.4-1
[7] RSQLite_1.0.0 zlibbioc_1.18.0 XVector_0.12.0
[10] BiocParallel_1.6.2 tools_3.3.1 biomaRt_2.28.0
[13] RCurl_1.95-4.8 SummarizedExperiment_1.2.3
>

Thanks Herve, working nicely now!