rtracklayer v1.30.0 can no longer import gff files
Jenny Drnevich ★ 2.0k
Last seen 8 weeks ago
United States


I just upgraded to R 3.2.2 / BioC 3.2 / rtracklayer 1.30.0, and some of my code to import NCBI's gff3 files now throws an error when it worked fine with R 3.2.1 / BioC 3.1 / rtracklayer 1.28.6. I have example codes below for both versions trying to import NCBI's mouse ref_GRCm38.p3_top_level.gff3.gz downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/M_musculus/GFF/. The new rtracklayer is throwing an error about "cannnot determine seqnames column unambiguously". I looked through the help files for ?import.gff3 in both versions, but don't see any changes. Is this a new bug? My workaround is to save the GRanges object from the old version as a .RData file and then load it into the new R/BioC, which seems to work fine. Any help in getting the new rtracklayer to read in gff file would be appreciated!



R 3.2.1 / BioC 3.1 / rtracklayer 1.28.6:

R version 3.2.1 (2015-06-18) -- "World-Famous Astronaut"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

#lines removed ...

> .libPaths()
[1] "C:/Users/drnevich/Documents/R/win-library/3.2"
[2] "C:/Program Files/R/R-3.2.1/library"           
> #Change to point to old BioC3.1 packages I saved...
> .libPaths(new = "C:/Users/drnevich/Documents/R/win-library/3.2_BioC3.1")
> library(rtracklayer)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel
#lines removed...
> setwd("D:/Statistics/Freund/Fire_sept2015/ReSeq/")
> #mouse GFF downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/M_musculus/GFF/
> gff0 <- import("ref_GRCm38.p3_top_level.gff3.gz")
> #no errors!
> save(gff0, file = "ref_GRCm38.p3_top_level.gff3.RData")
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] rtracklayer_1.28.6   GenomicRanges_1.20.5 GenomeInfoDb_1.4.1   IRanges_2.2.5       
[5] S4Vectors_0.6.2      BiocGenerics_0.14.0 

loaded via a namespace (and not attached):
 [1] XML_3.98-1.3            Rsamtools_1.20.4        Biostrings_2.36.1      
 [4] bitops_1.0-6            GenomicAlignments_1.4.1 futile.options_1.0.0   
 [7] zlibbioc_1.14.0         XVector_0.8.0           futile.logger_1.4.1    
[10] lambda.r_1.1.7          BiocParallel_1.2.11     tools_3.2.1            
[13] RCurl_1.95-4.7 


R 3.2.2 / BioC 3.2 / rtracklayer 1.30.0:

R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
#lines removed...

> .libPaths()
[1] "C:/Users/drnevich/Documents/R/win-library/3.2"
[2] "C:/Program Files/R/R-3.2.2/library"           
> #keep to use the new BioC 3.2 packages
> library(rtracklayer)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel
#lines removed...
> setwd("D:/Statistics/Freund/Fire_sept2015/ReSeq/")
> #mouse GFF downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/M_musculus/GFF/
> gff0 <- import("ref_GRCm38.p3_top_level.gff3.gz")
Error in .find_seqnames_col(df_colnames0, seqnames.field0, prefix) : 
  cannnot determine seqnames column unambiguously
> #Load in the RData file output from R 3.2.1:
> load("ref_GRCm38.p3_top_level.gff3.RData")
> #Check to see if I can use it: 
> table(gff0$type)

    C_gene_segment         cDNA_match                CDS     D_gene_segment 
                32               8710             937664                 24 
            D_loop               exon               gene     J_gene_segment 
                 1            1170729              48835                156 
             match               mRNA              ncRNA primary_transcript 
              7271              78013              24746               1283 
            region               rRNA   sequence_variant         transcript 
               195                 35                  6               7067 
              tRNA     V_gene_segment 
               437                613 

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] rtracklayer_1.30.0   GenomicRanges_1.22.0
[3] GenomeInfoDb_1.6.0   IRanges_2.4.0       
[5] S4Vectors_0.8.0      BiocGenerics_0.16.0 

loaded via a namespace (and not attached):
 [1] XML_3.98-1.3               Rsamtools_1.22.0          
 [3] Biostrings_2.38.0          GenomicAlignments_1.6.0   
 [5] bitops_1.0-6               futile.options_1.0.0      
 [7] zlibbioc_1.16.0            XVector_0.10.0            
 [9] futile.logger_1.4.1        lambda.r_1.1.7            
[11] BiocParallel_1.4.0         tools_3.2.2               
[13] Biobase_2.30.0             RCurl_1.95-4.7            
[15] SummarizedExperiment_1.0.0


rtracklayer import.gff3 bug
Last seen 6 hours ago
Seattle, WA, United States

Hi Jenny,

This is a regression I introduced in import.gff() when I re-implemented it in BioC 3.2. I just fixed it in rtracklayer 1.30.1, which should become available tomorrow (Oct 22nd) via biocLite(). Thanks for the catch and sorry for the inconvenience.



Ok – it finally came through late Thursday night, in time for my workshop last Friday. Thank you!!
Glad you got this in time for your workshop!  H.


