Search
Question: troubleshooting makeTxDbFromGRanges: some exons are linked to transcripts not found in the file
0
gravatar for Guido Hooiveld
16 months ago by
Guido Hooiveld2.2k
Wageningen University, Wageningen, the Netherlands
Guido Hooiveld2.2k wrote:

I would like to make a transcript-based annotation file (TxDb) for Arabidopsis, based on the recent Araport11 genome release. I am using the gff3 file (Araport11_GFF3_genes_transposons.201606.gff, 22 June 2016), available from here.

However, this fails because of an error:

Error in makeTxDbFromGRanges(araport):

some exons are linked to transcripts not found in the file.

While the error message is crystal clear, and I realize the error originates from an apparent mistake in the gff3 file (which has to be corrected by the people at the Arabidopsis Biological Resource Center), I wondered whether it somehow would be possible to have these exons and transcripts identified and returned. This would better enable troubleshooting.

Thanks,

Guido

 

> library("rtracklayer")
> library("GenomicFeatures")
>
>
> araport <- import.gff3("Araport11_GFF3_genes_transposons.201606.gff", format="gff3")
>
> araport
GRanges object with 789890 ranges and 21 metadata columns:
           seqnames           ranges strand |    source           type     score     phase
              <Rle>        <IRanges>  <Rle> |  <factor>       <factor> <numeric> <integer>
       [1]     Chr1     [3631, 5899]      + | Araport11           gene      <NA>      <NA>
       [2]     Chr1     [3631, 5899]      + | Araport11           mRNA      <NA>      <NA>
       [3]     Chr1     [3631, 3759]      + | Araport11 five_prime_UTR      <NA>      <NA>
       [4]     Chr1     [3631, 3913]      + | Araport11           exon      <NA>      <NA>
       [5]     Chr1     [3760, 3913]      + | Araport11            CDS      <NA>         0
       ...      ...              ...    ... .       ...            ...       ...       ...
  [789886]     ChrM [366086, 366700]      - | Araport11           gene      <NA>      <NA>
  [789887]     ChrM [366086, 366700]      - | Araport11           mRNA      <NA>      <NA>
  [789888]     ChrM [366086, 366700]      - | Araport11            CDS      <NA>         0
  [789889]     ChrM [366086, 366700]      - | Araport11           exon      <NA>      <NA>
  [789890]     ChrM [366086, 366700]      - | Araport11        protein      <NA>      <NA>
                                   ID                    Name                            Note      symbol
                          <character>             <character>                 <CharacterList> <character>
       [1]                  AT1G01010               AT1G01010 NAC domain containing protein 1      NAC001
       [2]                AT1G01010.1             AT1G01010.1 NAC domain containing protein 1      NAC001
       [3] AT1G01010:five_prime_UTR:1 NAC001:five_prime_UTR:1                                        <NA>
       [4]           AT1G01010:exon:1           NAC001:exon:1                                        <NA>
       [5]            AT1G01010:CDS:1            NAC001:CDS:1                                        <NA>
       ...                        ...                     ...                             ...         ...
  [789886]                  ATMG01410               ATMG01410          open reading frame 204      ORF204
  [789887]                ATMG01410.1             ATMG01410.1          open reading frame 204      ORF204
  [789888]            ATMG01410:CDS:1            ORF204:CDS:1                                        <NA>
  [789889]           ATMG01410:exon:1           ORF204:exon:1                                        <NA>
  [789890]        ATMG01410.1-Protein             ATMG01410.1                                        <NA>
                                             Alias                       full_name
                                   <CharacterList>                     <character>
       [1] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1
       [2] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1
       [3]                                                                    <NA>
       [4]                                                                    <NA>
       [5]                                                                    <NA>
       ...                                     ...                             ...
  [789886]                                                  open reading frame 204
  [789887]                                                  open reading frame 204
  [789888]                                                                    <NA>
  [789889]                                                                    <NA>
  [789890]                                                                    <NA>
                                                                                                                                                                                                                                                                                                                                                                                                          Dbxref     locus_type          Parent  conf_class
                                         <CharacterList>    <character> <CharacterList> <character>
       [1] PMID:11118137,PMID:12820902,PMID:15029955,... protein_coding                        <NA>
       [2]     PMID:11118137,gene:2200934,UniProt:Q0WV96           <NA>       AT1G01010           2
       [3]                                                         <NA>     AT1G01010.1        <NA>
       [4]                                                         <NA>     AT1G01010.1        <NA>
       [5]                                                         <NA>     AT1G01010.1        <NA>
       ...                                           ...            ...             ...         ...
  [789886]                               locus:504954624 protein_coding                        <NA>
  [789887]                               gene:1009022691           <NA>       ATMG01410           1
  [789888]                                                         <NA>     ATMG01410.1        <NA>
  [789889]                                                         <NA>     ATMG01410.1        <NA>
  [789890]                                                         <NA>                        <NA>
           conf_rating Derives_from curator_summary description       index nochangenat-description
           <character>  <character>     <character> <character> <character>             <character>
       [1]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [2]        ****         <NA>            <NA>        <NA>        <NA>                    <NA>
       [3]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [4]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [5]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       ...         ...          ...             ...         ...         ...                     ...
  [789886]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789887]       *****         <NA>            <NA>        <NA>           1                    <NA>
  [789888]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789889]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789890]        <NA>  ATMG01410.1            <NA>        <NA>        <NA>                    <NA>
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths
>
> txdb <- makeTxDbFromGRanges(araport)
Error in makeTxDbFromGRanges(araport) :
  some exons are linked to transcripts not found in the file
>
> sessionInfo()
R version 3.3.1 Patched (2016-06-28 r70853)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.24.3 AnnotationDbi_1.34.3   Biobase_2.32.0         rtracklayer_1.32.1    
[5] GenomicRanges_1.24.2   GenomeInfoDb_1.8.2     IRanges_2.6.1          S4Vectors_0.10.1      
[9] BiocGenerics_0.18.0   

loaded via a namespace (and not attached):
 [1] XML_3.98-1.4               Rsamtools_1.24.0           Biostrings_2.40.2         
 [4] bitops_1.0-6               GenomicAlignments_1.8.3    DBI_0.4-1                 
 [7] RSQLite_1.0.0              zlibbioc_1.18.0            XVector_0.12.0            
[10] BiocParallel_1.6.2         tools_3.3.1                biomaRt_2.28.0            
[13] RCurl_1.95-4.8             SummarizedExperiment_1.2.3
>
ADD COMMENTlink modified 16 months ago by Hervé Pagès ♦♦ 13k • written 16 months ago by Guido Hooiveld2.2k
1
gravatar for Hervé Pagès
16 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Guido,

You make a good point that error reporting could be improved in makeTxDbFromGRanges(). In that case the error was caused by makeTxDbFromGRanges() not recognizing features of type lnc_RNA, antisense_lncRNA, transcript_region, and pseudogenic_tRNA, as transcripts. I fixed this in GenomicFeatures 1.24.4 (release) and 1.25.15 (devel).

With GenomicFeatures 1.24.4:

library(rtracklayer)
library(GenomicFeatures)
araport <- import("~/Downloads/Araport11_GFF3_genes_transposons.201606.gff")
txdb <- makeTxDbFromGRanges(araport)
txdb
# TxDb object:
## Db type: TxDb
## Supporting package: GenomicFeatures
## Genome: NA
## transcript_nrow: 60204
## exon_nrow: 203352
## cds_nrow: 221781
## Db created by: GenomicFeatures package from Bioconductor
## Creation time: 2016-07-07 10:31:12 -0700 (Thu, 07 Jul 2016)
## GenomicFeatures version at creation time: 1.24.4
## RSQLite version at creation time: 1.0.0
## DBSCHEMAVERSION: 1.1

Both versions should become available via biocLite() in the next couple of days.

Cheers,

H.

 

ADD COMMENTlink written 16 months ago by Hervé Pagès ♦♦ 13k

Thanks Herve, working nicely now!

ADD REPLYlink written 16 months ago by Guido Hooiveld2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 302 users visited in the last hour