makeTxDbFromGFF Error: subscript contains NAs or out-of-bounds indices - after package update
1
0
Entering edit mode
@ben-mansfeld-10812
Last seen 7.8 years ago
Michigan State University

Hello all, 

I appreciate any help in advance. This has been driving me crazy for the last 24hrs...

I am trying to extract exon lengths from my GFF3 file. 

txdb <- makeTxDbFromGFF("../../Chinese Long Genome/cucumber_v2.gff3",format="gff3")
exBytx<-exonsBy(txdb, by="tx", use.names=T)
txlengthData<-sum(width(reduce(exBytx)))

This worked under a previous version of BioC but after updating R and BioC  now I'm getting this:

txdb <- makeTxDbFromGFF("../../Chinese Long Genome/cucumber_v2.gff3",format="gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error: subscript contains NAs or out-of-bounds indices

> traceback()
18: stop(wmsg(...), call. = FALSE)
17: .subscript_error("subscript contains NAs or out-of-bounds indices")
16: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
15: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
14: normalizeSingleBracketSubscript(i, x, as.NSBS = TRUE)
13: extractROWS(x, i)
12: extractROWS(x, i)
11: .nextMethod(x, i)
10: eval(expr, envir, enclos)
9: eval(call, callEnv)
8: callNextMethod(x, i)
7: Parent[exon_with_gene_parent_IDX]
6: Parent[exon_with_gene_parent_IDX]
5: unlist(Parent[exon_with_gene_parent_IDX], use.names = FALSE)
4: ID[gene_IDX] %in% unlist(Parent[exon_with_gene_parent_IDX], use.names = FALSE)
3: .get_gene_as_tx_IDX(gene_IDX, ID, exon_with_gene_parent_IDX, 
       Parent)
2: makeTxDbFromGRanges(gr, metadata = metadata)
1: makeTxDbFromGFF("../../Chinese Long Genome/cucumber_v2.gff3", 
       format = "gff3")

Running the sample a.gff file that comes with the package works. 

What changed between the versions that now doesn't allow me to import the GFF?

My GFF stayed the same. I even downloaded it again from my db. I subset one gene from the GFF and that doesn't work either. Something about the way our GFF is formatted probably doesn't talk well with makeTxDb. But what? and why suddenly not?

Thanks again in advance.

 

Previous sessionInfo that worked:

sessionInfo()

R version 3.2.3 (2015-12-10)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows >= 8 x64 (build 9200)


locale:

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252

[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252   


attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base    


other attached packages:

 [1] goseq_1.22.0              BiocInstaller_1.20.3      rtracklayer_1.28.10       GenomicFeatures_1.20.6  

 [5] AnnotationDbi_1.30.1      Biobase_2.28.0            RSQLite_1.0.0             DBI_0.4-1               

 [9] geneLenDataBase_1.4.0     BiasedUrn_1.07            pheatmap_1.0.8            ggplot2_2.1.0           

[13] DESeq2_1.8.2              RcppArmadillo_0.6.700.6.0 Rcpp_0.12.4               GenomicRanges_1.20.8    

[17] GenomeInfoDb_1.4.3        IRanges_2.2.9             S4Vectors_0.6.6           BiocGenerics_0.14.0     

[21] NCmisc_1.1.4    

Current Session info doesn't work:

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rtracklayer_1.32.0     GenomicFeatures_1.24.2 AnnotationDbi_1.34.3   Biobase_2.32.0         GenomicRanges_1.24.1   GenomeInfoDb_1.8.2     IRanges_2.6.0         
 [8] S4Vectors_0.10.1       BiocGenerics_0.18.0    goseq_1.24.0           geneLenDataBase_1.8.0  BiasedUrn_1.07        

loaded via a namespace (and not attached):
 [1] XVector_0.12.0             zlibbioc_1.18.0            GenomicAlignments_1.8.1    BiocParallel_1.6.2         lattice_0.20-33            tools_3.3.0               
 [7] grid_3.3.0                 SummarizedExperiment_1.2.2 nlme_3.1-128               mgcv_1.8-12                DBI_0.4-1                  Matrix_1.2-6              
[13] bitops_1.0-6               RCurl_1.95-4.8             biomaRt_2.28.0             RSQLite_1.0.0              GO.db_3.3.0                Biostrings_2.40.1         
[19] Rsamtools_1.24.0           XML_3.98-1.4
genomicfeatures • 2.7k views
ADD COMMENT
0
Entering edit mode

It would help to see your GFF file.

ADD REPLY
0
Entering edit mode

Thanks for the reply.

Sorry about that. Here are a few lines from the top:

Chr1    .    gene    2952    4284    .    -    .    ID=Csa1G000010; Name=Csa1G000010
Chr1    .    mRNA    2952    4284    .    -    .    ID=Csa1M000010.1; Parent=Csa1G000010; Name=Csa1M000010.1
Chr1    .    five_prime_utr    4200    4284    .    -    .    ID=Csa1M000010.1.utr5p1; Parent=Csa1M000010.1
Chr1    .    exon    4131    4284    .    -    .    ID=Csa1M000010.1.exon1; Parent=Csa1M000010.1
Chr1    .    CDS    4131    4199    .    -    0    ID=Csa1M000010.1.cds1; Parent=Csa1M000010.1
Chr1    .    exon    3837    4040    .    -    .    ID=Csa1M000010.1.exon2; Parent=Csa1M000010.1
Chr1    .    CDS    3837    4040    .    -    0    ID=Csa1M000010.1.cds2; Parent=Csa1M000010.1
Chr1    .    exon    2952    3377    .    -    .    ID=Csa1M000010.1.exon3; Parent=Csa1M000010.1
Chr1    .    CDS    3324    3377    .    -    0    ID=Csa1M000010.1.cds3; Parent=Csa1M000010.1
Chr1    .    three_prime_utr    2952    3323    .    -    .    ID=Csa1M000010.1.utr3p1; Parent=Csa1M000010.1


ADD REPLY
0
Entering edit mode

At first glance it appears that the format of the file is correct, but somehow the Parent field is not being parsed. The GFF parsing was rewritten in C over the last development cycle, but not by me, so I'll defer to the author.

ADD REPLY
3
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi Ben,,

Your GFF3 file has spaces between the tag=value pairs and the preceding semicolon. The new parser in rtracklayer was considering that this space is part of the tag (so the tags would be " Name" and " Parent"), which was causing problems.

This is fixed in rtracklayer 1.32.1 (release) and 1.33.5 (devel). Both versions should become available via biocLite() in the next 48 hours or so.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Hervé,

Thanks for addressing this. I couldn't figure it out for the life of me.

I appreciate all the work you devs do!

-Ben

ADD REPLY

Login before adding your answer.

Traffic: 672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6