Question

makeTxDbFromGFF Error: subscript contains NAs

0

Entering edit mode

marc_bes • 0

@marc_bes-13544

Last seen 7.5 years ago

Hello all,

I am trying to create an TxDB from a GFF3 file.

> gtfFile <- "../../Emihu1_best_genes_altered.gff"

> txdb <- makeTxDbFromGFF(gtfFile, format="auto")

However it gives me the following error:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error: subscript contains NAs


> traceback()
18: stop(wmsg(...), call. = FALSE)
17: .subscript_error("subscript contains NAs")
16: NSBS(i, x, exact = exact, strict.upper.bound = !allow.append,
        allow.NAs = allow.NAs)
15: NSBS(i, x, exact = exact, strict.upper.bound = !allow.append,
        allow.NAs = allow.NAs)
14: normalizeSingleBracketSubscript(i, x, as.NSBS = TRUE)
13: extractROWS(x, i)
12: extractROWS(x, i)
11: .nextMethod(x, i)
10: eval(call, callEnv)
9: eval(call, callEnv)
8: callNextMethod(x, i)
7: Parent[exon_with_gene_parent_IDX]
6: Parent[exon_with_gene_parent_IDX]
5: unlist(Parent[exon_with_gene_parent_IDX], use.names = FALSE)
4: ID[gene_IDX] %in% unlist(Parent[exon_with_gene_parent_IDX], use.names = FALSE)
3: .get_gene_as_tx_IDX(gene_IDX, ID, exon_with_gene_parent_IDX,
       Parent)
2: makeTxDbFromGRanges(gr, metadata = metadata)
1: makeTxDbFromGFF(gtfFile, format = "gff3")

My GFF files looks as follows (with an additional column for "attributes"):

> read.gff(gtffile, na.strings = c(".", "?"))
                seqid source        type   start     end score strand phase
1          scaffold_1    JGI        exon    2702    2844    NA      -  <NA>
2          scaffold_1    JGI        exon    2995    3157    NA      -  <NA>
3          scaffold_1    JGI         CDS    3042    3157    NA      -     0
4          scaffold_1    JGI  stop_codon    3042    3044    NA      -     0
5          scaffold_1    JGI        exon    3042    3176    NA      -  <NA>
6          scaffold_1    JGI         CDS    3042    3176    NA      -     2
7          scaffold_1    JGI  stop_codon    3042    3044    NA      -     0
8          scaffold_1    JGI        exon    3442    3678    NA      -  <NA>

Is the error coming from the NAs in the "score" column?

If so, are there any solutions to prevent this?

Thanks in advance.

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] ape_4.1                    ensembldb_2.0.3
 [3] AnnotationFilter_1.0.0     DESeq2_1.16.1
 [5] BiocInstaller_1.26.0       BiocParallel_1.10.1
 [7] GenomicAlignments_1.12.1   SummarizedExperiment_1.6.3
 [9] DelayedArray_0.2.7         matrixStats_0.52.2
[11] GenomicFeatures_1.28.4     AnnotationDbi_1.38.1
[13] Biobase_2.36.2             Rsamtools_1.28.0
[15] Biostrings_2.44.1          XVector_0.16.0
[17] GenomicRanges_1.28.3       GenomeInfoDb_1.12.2
[19] IRanges_2.10.2             S4Vectors_0.14.3
[21] BiocGenerics_0.22.0

loaded via a namespace (and not attached):
 [1] httr_1.2.1                    bit64_0.9-7
 [3] AnnotationHub_2.8.2           splines_3.4.0
 [5] shiny_1.0.3                   Formula_1.2-1
 [7] interactiveDisplayBase_1.14.0 latticeExtra_0.6-28
 [9] blob_1.1.0                    GenomeInfoDbData_0.99.0
[11] yaml_2.1.14                   RSQLite_2.0
[13] backports_1.1.0               lattice_0.20-35
[15] digest_0.6.12                 RColorBrewer_1.1-2
[17] checkmate_1.8.2               colorspace_1.3-2
[19] httpuv_1.3.3                  htmltools_0.3.6
[21] Matrix_1.2-10                 plyr_1.8.4
[23] pkgconfig_2.0.1               XML_3.98-1.7
[25] biomaRt_2.32.1                genefilter_1.58.1
[27] zlibbioc_1.22.0               xtable_1.8-2
[29] scales_0.4.1                  htmlTable_1.9
[31] tibble_1.3.3                  annotate_1.54.0
[33] ggplot2_2.2.1                 nnet_7.3-12
[35] lazyeval_0.2.0                mime_0.5
[37] survival_2.41-3               magrittr_1.5
[39] memoise_1.1.0                 nlme_3.1-131
[41] foreign_0.8-68                tools_3.4.0
[43] data.table_1.10.4             stringr_1.2.0
[45] munsell_0.4.3                 locfit_1.5-9.1
[47] cluster_2.0.6                 compiler_3.4.0
[49] rlang_0.1.1                   grid_3.4.0
[51] RCurl_1.95-4.8                htmlwidgets_0.8
[53] tcltk_3.4.0                   bitops_1.0-6
[55] base64enc_0.1-3               gtable_0.2.0
[57] curl_2.6                      DBI_0.6-1
[59] R6_2.2.2                      gridExtra_2.2.1
[61] knitr_1.16                    rtracklayer_1.36.4
[63] bit_1.1-12                    Hmisc_4.0-3
[65] ProtGenerics_1.8.0            stringi_1.1.5
[67] Rcpp_0.12.11                  geneplotter_1.54.0
[69] rpart_4.1-11                  acepack_1.4.1

genomicfeatures granges maketxdbfromgff • 4.1k views

ADD COMMENT • link updated 7.3 years ago by Hervé Pagès 16k • written 7.5 years ago by marc_bes • 0

0

Entering edit mode

It is hard for me to try to debug this without the file that is causing problems. Could you please email me off list (lori.shepherd@roswellpark.org) to figure out a solution.

ADD REPLY • link 7.5 years ago shepherl 4.1k

score 0 · Answer 1 · 2017-07-31

It is not coming from the scores being NA, but from non-standard GFF format.

You can get partial results by importing 'by hand'

gr2 <- import("Emihu1_best_genes.gff", colnames=c("name", "transcriptId", "exonNumber", "type"))

and re-naming columns

names(mcols(gr2))[1:3] = c("gene_id", "transcript_id", "exon_id")

and finally creating a partial TxDb

txdb = makeTxDbFromGRanges(gr2)

This still is incomplete.

score 0 · Answer 2 · 2017-08-30

Hi Marc,

Thanks for sending us the Emihu1_best_genes.gff file. I had a chance to look at it. Please note that our software only supports valid GTF or GFF3 files but Emihu1_best_genes.gff does not look like any of that.

Here is a slightly improved hack. Unlike the hack we provided earlier, this one imports the CDS information:

library(GenomicFeatures)
library(rtracklayer)

gr <- import("Emihu1_best_genes.gff")
gr <- gr[!(mcols(gr)$type %in% c("start_codon", "stop_codon"))]

## Create gene_id and transcript_id metadata columns:

transcriptId <- mcols(gr)$transcriptId
proteinId <- mcols(gr)$proteinId
transcript_id <- ifelse(is.na(transcriptId), proteinId, transcriptId)
## Sanity check:
stopifnot(identical(runLength(Rle(transcript_id)),
                    runLength(Rle(mcols(gr)$name))),
          !anyDuplicated(runValue(Rle(transcript_id))),
          !anyDuplicated(runValue(Rle(mcols(gr)$name))))
mcols(gr)$gene_id <- mcols(gr)$name
mcols(gr)$transcript_id <- transcript_id

txdb <- makeTxDbFromGRanges(gr)

However, like with the previous hack, this will also leave some transcripts behind (413 out of 39126). These transcripts left behind have exons that overlap. For example transcript 351861 has the following exons and CDS:

      start    end strand type transcriptId proteinId exonNumber
     526090 526741      + exon       351861      <NA>       <NA>
     526299 526741      +  CDS         <NA>    351861          1
     526622 526744      + exon       351861      <NA>       <NA>

As you can see, the 2 exons for this transcript overlap! So, strictly speaking, one can't say that one exon is located downstream of the other. This makes it unclear what the exact structure of the transcript is (i.e. in which order the 2 exons should be put together to form the transcript). Instead of trying to guess, makeTxDbFromGRanges() drops these transcripts that have ambiguous exon ranking. Note that the exonNumber column here could give a clue of what the exon ranking is but this column is unconventional so makeTxDbFromGRanges() is not making any use of it. (Also note that the column is ranking the CDS instead of the exons: this will allow to guess exon ranking most of the time but not always.) FWIW I'm not aware of any conventional way of storing the exon ranking in a GTF file (which is one of the reasons why GTF is inferior to GFF3). For GFF3 though, what seems to be the convention is to encode the rank in the exon IDs e.g. 351861:1, 351861:2, etc... makeTxDbFromGRanges() knows how to make use of that.

All this to say that the "real true solution" would be to fix the software that was used to generate the Emihu1_best_genes.gff file so it produces a file that is compliant with the GFF3 specs and with the established convention for storing exon ranking.

Best,

H.