I would like to make a transcript-based annotation file (TxDb) for Arabidopsis, based on the recent Araport11 genome release. I am using the gff3 file (Araport11_GFF3_genes_transposons.201606.gff, 22 June 2016), available from here.
However, this fails because of an error:
Error in makeTxDbFromGRanges(araport):
some exons are linked to transcripts not found in the file.
While the error message is crystal clear, and I realize the error originates from an apparent mistake in the gff3 file (which has to be corrected by the people at the Arabidopsis Biological Resource Center), I wondered whether it somehow would be possible to have these exons and transcripts identified and returned. This would better enable troubleshooting.
Thanks,
Guido
> library("rtracklayer")
> library("GenomicFeatures")
>
>
> araport <- import.gff3("Araport11_GFF3_genes_transposons.201606.gff", format="gff3")
>
> araport
GRanges object with 789890 ranges and 21 metadata columns:
           seqnames           ranges strand |    source           type     score     phase
              <Rle>        <IRanges>  <Rle> |  <factor>       <factor> <numeric> <integer>
       [1]     Chr1     [3631, 5899]      + | Araport11           gene      <NA>      <NA>
       [2]     Chr1     [3631, 5899]      + | Araport11           mRNA      <NA>      <NA>
       [3]     Chr1     [3631, 3759]      + | Araport11 five_prime_UTR      <NA>      <NA>
       [4]     Chr1     [3631, 3913]      + | Araport11           exon      <NA>      <NA>
       [5]     Chr1     [3760, 3913]      + | Araport11            CDS      <NA>         0
       ...      ...              ...    ... .       ...            ...       ...       ...
  [789886]     ChrM [366086, 366700]      - | Araport11           gene      <NA>      <NA>
  [789887]     ChrM [366086, 366700]      - | Araport11           mRNA      <NA>      <NA>
  [789888]     ChrM [366086, 366700]      - | Araport11            CDS      <NA>         0
  [789889]     ChrM [366086, 366700]      - | Araport11           exon      <NA>      <NA>
  [789890]     ChrM [366086, 366700]      - | Araport11        protein      <NA>      <NA>
                                   ID                    Name                            Note      symbol
                          <character>             <character>                 <CharacterList> <character>
       [1]                  AT1G01010               AT1G01010 NAC domain containing protein 1      NAC001
       [2]                AT1G01010.1             AT1G01010.1 NAC domain containing protein 1      NAC001
       [3] AT1G01010:five_prime_UTR:1 NAC001:five_prime_UTR:1                                        <NA>
       [4]           AT1G01010:exon:1           NAC001:exon:1                                        <NA>
       [5]            AT1G01010:CDS:1            NAC001:CDS:1                                        <NA>
       ...                        ...                     ...                             ...         ...
  [789886]                  ATMG01410               ATMG01410          open reading frame 204      ORF204
  [789887]                ATMG01410.1             ATMG01410.1          open reading frame 204      ORF204
  [789888]            ATMG01410:CDS:1            ORF204:CDS:1                                        <NA>
  [789889]           ATMG01410:exon:1           ORF204:exon:1                                        <NA>
  [789890]        ATMG01410.1-Protein             ATMG01410.1                                        <NA>
                                             Alias                       full_name
                                   <CharacterList>                     <character>
       [1] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1
       [2] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1
       [3]                                                                    <NA>
       [4]                                                                    <NA>
       [5]                                                                    <NA>
       ...                                     ...                             ...
  [789886]                                                  open reading frame 204
  [789887]                                                  open reading frame 204
  [789888]                                                                    <NA>
  [789889]                                                                    <NA>
  [789890]                                                                    <NA>
                                                                                                                                                                                                                                                                                                                                                                                                          Dbxref     locus_type          Parent  conf_class
                                         <CharacterList>    <character> <CharacterList> <character>
       [1] PMID:11118137,PMID:12820902,PMID:15029955,... protein_coding                        <NA>
       [2]     PMID:11118137,gene:2200934,UniProt:Q0WV96           <NA>       AT1G01010           2
       [3]                                                         <NA>     AT1G01010.1        <NA>
       [4]                                                         <NA>     AT1G01010.1        <NA>
       [5]                                                         <NA>     AT1G01010.1        <NA>
       ...                                           ...            ...             ...         ...
  [789886]                               locus:504954624 protein_coding                        <NA>
  [789887]                               gene:1009022691           <NA>       ATMG01410           1
  [789888]                                                         <NA>     ATMG01410.1        <NA>
  [789889]                                                         <NA>     ATMG01410.1        <NA>
  [789890]                                                         <NA>                        <NA>
           conf_rating Derives_from curator_summary description       index nochangenat-description
           <character>  <character>     <character> <character> <character>             <character>
       [1]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [2]        ****         <NA>            <NA>        <NA>        <NA>                    <NA>
       [3]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [4]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [5]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       ...         ...          ...             ...         ...         ...                     ...
  [789886]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789887]       *****         <NA>            <NA>        <NA>           1                    <NA>
  [789888]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789889]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789890]        <NA>  ATMG01410.1            <NA>        <NA>        <NA>                    <NA>
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths
>
> txdb <- makeTxDbFromGRanges(araport)
Error in makeTxDbFromGRanges(araport) :
  some exons are linked to transcripts not found in the file
>
> sessionInfo()
R version 3.3.1 Patched (2016-06-28 r70853)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    
attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] GenomicFeatures_1.24.3 AnnotationDbi_1.34.3   Biobase_2.32.0         rtracklayer_1.32.1    
[5] GenomicRanges_1.24.2   GenomeInfoDb_1.8.2     IRanges_2.6.1          S4Vectors_0.10.1      
[9] BiocGenerics_0.18.0   
loaded via a namespace (and not attached):
 [1] XML_3.98-1.4               Rsamtools_1.24.0           Biostrings_2.40.2         
 [4] bitops_1.0-6               GenomicAlignments_1.8.3    DBI_0.4-1                 
 [7] RSQLite_1.0.0              zlibbioc_1.18.0            XVector_0.12.0            
[10] BiocParallel_1.6.2         tools_3.3.1                biomaRt_2.28.0            
[13] RCurl_1.95-4.8             SummarizedExperiment_1.2.3
>
                    
                
                
Thanks Herve, working nicely now!