I need to read in a large file which is about 77G in txt format, with about 50M lines.
I can use read.table function, and set parameters skip and nrows to read the file in trunk. But this will repeatedly count lines each time I read in a trunk.
Is there any function that can set something similar to pointer and go directly to the right line each time start to read a new trunk?
Thank you so much for your most prompt help :)
I was about to switch to another work, because I thought it could take at least one day before any one could reply with helpful information.
*********************************************************************************************************************
My mistake. read.table takes care for the problem.
nrows:
integer: the maximum number of rows to read in.
*********************************************************************************************************************
It seems like I can use only readLines, even though it has disadvantages of not having comment.char and returning each row as collapsed one string. Because read.table gives error message when no more lines left.
*********************************************************************************************************************
What could be easiest way of counting the number of remaining rows?
fl = file(maf.file, "r")
repeat {
aster = read.table(fl, header=FALSE, stringsAsFactors=FALSE, nrows=trunk.size, quote="", sep="\t")
}
close(fl)
nrows has to be min(n.remaining.rows, trunk.size)
Do something like
developed more here.
if there are no more lines left, I get error message.
I could use try(read.table....), but there risk that I would treat errors from file format as error for no more lines left.
The file is maf file, so it is not very well structured.
The link in my previous comment uses
tryCatch(
) and tests for specific text in the error message; this isn't robust (e.g., because the error messages are translated into the users' locale) but is ok for user scripts.Even if I use readLines, it was not as slow as what I thought. This works fine :)
Thank you very much for your help.
I guess the two inner loops (for j and sapply) could be replaced with
acorn = sapply(aster, `[`, idx.col)
or similar. The 'copy-and-append' patternapple = c() ... apple = rbind(apple, acorn)
is inefficient when there are a large number of iterations; better to 'pre-allocate and fill',apple = vector("list", 1000) ... apple[[i]] = acorn
perhaps withdo.call(rbind, apple)
at the end, and perhaps growing the result every 1000 iterations.That looks much better.
Thank you so much :)