Reading Large File in Trunk
1
1
Entering edit mode
Haiying.Kong ▴ 110
@haiyingkong-9254
Last seen 5.0 years ago
Germany

  I need to read in a large file which is about 77G in txt format, with about 50M lines.

  I can use read.table function, and set parameters skip and nrows to read the file in trunk. But this will repeatedly count lines each time I read in a trunk.

  Is there any function that can set something similar to pointer and go directly to the right line each time start to read a new trunk?

R • 1.2k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 12 days ago
United States

I happened to be in a package directory

> dir()
 [1] "DESCRIPTION"      "inst"             "LICENSE"          "man"             
 [5] "NAMESPACE"        "NEWS"             "R"                "Rsamtools.mk.win"
 [9] "src"              "tests"            "vignettes"       

I opened a connection

> fl = file("DESCRIPTION", "r")

and read the file in chunks (these would be millions of rows for a real use case)

> readLines(fl, 3)
[1] "Package: Rsamtools"                                                             
[2] "Type: Package"                                                                  
[3] "Title: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import"
> readLines(fl, 3)
[1] "Version: 1.29.0"                                                              
[2] "Author: Martin Morgan, Herv\\'e Pag\\`es, Valerie Obenchain, Nathaniel Hayden"
[3] "Maintainer: Bioconductor Package Maintainer"                                  

and then closed it

> close(fl)

Connections can be used in many places, e.g., read.csv() and probably rtracklayer::import(). GenomicFiles::reduceByYield() and friends enable this for standard bioinformatics files; see the vignette.

 

ADD COMMENT
0
Entering edit mode

Thank you so much for your most prompt help  :)

I was about to switch to another work, because I thought it could take at least one day before any one could reply with helpful information.

ADD REPLY
0
Entering edit mode

*********************************************************************************************************************

My mistake. read.table takes care for the problem.

nrows:

integer: the maximum number of rows to read in.

*********************************************************************************************************************

It seems like I can use only readLines, even though it has disadvantages of not having comment.char and returning each row as collapsed one string. Because read.table gives error message when no more lines left.

*********************************************************************************************************************

 

What could be easiest way of counting the number of remaining rows?

fl = file(maf.file, "r")
repeat  {
  aster = read.table(fl, header=FALSE, stringsAsFactors=FALSE, nrows=trunk.size, quote="", sep="\t")
  }
close(fl)

  nrows has to be min(n.remaining.rows, trunk.size)

 

ADD REPLY
0
Entering edit mode

Do something like

aster = read.table(...)
if (nrow(aster) == 0)
    break

developed more here.

ADD REPLY
0
Entering edit mode

if there are no more lines left, I get error message.

I could use try(read.table....), but there risk that I would treat errors from file format as error for no more lines left.

The file is maf file, so it is not very well structured.

ADD REPLY
0
Entering edit mode

The link in my previous comment uses tryCatch() and tests for specific text in the error message; this isn't robust (e.g., because the error messages are translated into the users' locale) but is ok for user scripts.

ADD REPLY
0
Entering edit mode

Even if I use readLines, it was not as slow as what I thought. This works fine  :)

Thank you very much for your help.

 

apple = c()
fl = file(maf.file, "r")

repeat  {
  aster = readLines(fl, trunk.size)
  if (length(aster) == 0)  break

  idx.del.line = c(grep("^#", aster), grep("Hugo_Symbol", aster))
  if (length(idx.del.line) > 0)  aster = aster[-idx.del.line]
  if (length(aster) > 0)    {
    aster = strsplit(aster, "\t")
    acorn = c()
    for (j in idx.col)    {
      acorn = cbind(acorn, sapply(aster, function(x)  unlist(x)[j]))
      }
    apple = rbind(apple, acorn)
    }
  }
close(fl)
ADD REPLY
1
Entering edit mode

I guess the two inner loops (for j and sapply) could be replaced with acorn = sapply(aster, `[`, idx.col) or similar. The 'copy-and-append' pattern apple = c() ... apple = rbind(apple, acorn) is inefficient when there are a large number of iterations; better to 'pre-allocate and fill', apple = vector("list", 1000) ... apple[[i]] = acorn perhaps with do.call(rbind, apple) at the end, and perhaps growing the result every 1000 iterations.

ADD REPLY
0
Entering edit mode

That looks much better.

Thank you so much  :)

ADD REPLY

Login before adding your answer.

Traffic: 836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6