Question

Reading Large File in Trunk

1

Entering edit mode

Haiying.Kong ▴ 110

@haiyingkong-9254

Last seen 5.0 years ago

Germany

I need to read in a large file which is about 77G in txt format, with about 50M lines.

I can use read.table function, and set parameters skip and nrows to read the file in trunk. But this will repeatedly count lines each time I read in a trunk.

Is there any function that can set something similar to pointer and go directly to the right line each time start to read a new trunk?

R • 1.2k views

ADD COMMENT • link updated 7.0 years ago by Martin Morgan 25k • written 7.0 years ago by Haiying.Kong ▴ 110

Martin Morgan · Answer 1 · 2017-05-11

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 days ago

United States

I happened to be in a package directory

> dir()
 [1] "DESCRIPTION"      "inst"             "LICENSE"          "man"             
 [5] "NAMESPACE"        "NEWS"             "R"                "Rsamtools.mk.win"
 [9] "src"              "tests"            "vignettes"

I opened a connection

> fl = file("DESCRIPTION", "r")

and read the file in chunks (these would be millions of rows for a real use case)

> readLines(fl, 3)
[1] "Package: Rsamtools"                                                             
[2] "Type: Package"                                                                  
[3] "Title: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import"
> readLines(fl, 3)
[1] "Version: 1.29.0"                                                              
[2] "Author: Martin Morgan, Herv\\'e Pag\\`es, Valerie Obenchain, Nathaniel Hayden"
[3] "Maintainer: Bioconductor Package Maintainer"

and then closed it

> close(fl)

Connections can be used in many places, e.g., read.csv() and probably rtracklayer::import(). GenomicFiles::reduceByYield() and friends enable this for standard bioinformatics files; see the vignette.

ADD COMMENT • link 7.0 years ago Martin Morgan 25k

0

Entering edit mode

Thank you so much for your most prompt help :)

I was about to switch to another work, because I thought it could take at least one day before any one could reply with helpful information.

ADD REPLY • link 7.0 years ago Haiying.Kong ▴ 110

0

Entering edit mode

*********************************************************************************************************************

My mistake. read.table takes care for the problem.

nrows:

integer: the maximum number of rows to read in.

*********************************************************************************************************************

It seems like I can use only readLines, even though it has disadvantages of not having comment.char and returning each row as collapsed one string. Because read.table gives error message when no more lines left.

*********************************************************************************************************************

What could be easiest way of counting the number of remaining rows?

fl = file(maf.file, "r")
repeat {
aster = read.table(fl, header=FALSE, stringsAsFactors=FALSE, nrows=trunk.size, quote="", sep="\t")
}
close(fl)

nrows has to be min(n.remaining.rows, trunk.size)

ADD REPLY • link 7.0 years ago Haiying.Kong ▴ 110

0

Entering edit mode

Do something like

aster = read.table(...)
if (nrow(aster) == 0)
    break

developed more here.

ADD REPLY • link 7.0 years ago Martin Morgan 25k

0

Entering edit mode

if there are no more lines left, I get error message.

I could use try(read.table....), but there risk that I would treat errors from file format as error for no more lines left.

The file is maf file, so it is not very well structured.

ADD REPLY • link 7.0 years ago Haiying.Kong ▴ 110

0

Entering edit mode

The link in my previous comment uses tryCatch() and tests for specific text in the error message; this isn't robust (e.g., because the error messages are translated into the users' locale) but is ok for user scripts.

ADD REPLY • link 7.0 years ago Martin Morgan 25k

0

Entering edit mode

Even if I use readLines, it was not as slow as what I thought. This works fine :)

Thank you very much for your help.

apple = c()
fl = file(maf.file, "r")

repeat  {
  aster = readLines(fl, trunk.size)
  if (length(aster) == 0)  break

  idx.del.line = c(grep("^#", aster), grep("Hugo_Symbol", aster))
  if (length(idx.del.line) > 0)  aster = aster[-idx.del.line]
  if (length(aster) > 0)    {
    aster = strsplit(aster, "\t")
    acorn = c()
    for (j in idx.col)    {
      acorn = cbind(acorn, sapply(aster, function(x)  unlist(x)[j]))
      }
    apple = rbind(apple, acorn)
    }
  }
close(fl)

ADD REPLY • link updated 7.0 years ago by Martin Morgan 25k • written 7.0 years ago by Haiying.Kong ▴ 110

1

Entering edit mode

I guess the two inner loops (for j and sapply) could be replaced with acorn = sapply(aster, `[`, idx.col) or similar. The 'copy-and-append' pattern apple = c() ... apple = rbind(apple, acorn) is inefficient when there are a large number of iterations; better to 'pre-allocate and fill', apple = vector("list", 1000) ... apple[[i]] = acorn perhaps with do.call(rbind, apple) at the end, and perhaps growing the result every 1000 iterations.

ADD REPLY • link 7.0 years ago Martin Morgan 25k

0

Entering edit mode

That looks much better.

Thank you so much :)

ADD REPLY • link 7.0 years ago Haiying.Kong ▴ 110