Question

Importing a .txt file with multiple headers into R

1

Entering edit mode

a.afshinfard ▴ 10

@aafshinfard-7617

Last seen 8.9 years ago

Iran, Islamic Republic Of

Hi everyone

i have a massive report from Mummer, of "multiple" sequences. the starting lines of the file ( to understand the common format ) :

> 1  Len = 354
  203757         1         1        35
  122132         1         1        87
  203756         1         1       354
  1              1         1       354
  42364         12         1        89
  203757        37        37        91
> 1 Reverse  Len = 354
> 2  Len = 127
  203754         1         1       127
  2              1         1       127
  122133         1        19        80
  203753         1        19       109

a bigger example : http://m.uploadedit.com/ba3c/1429271308686.txt

and all i want to do is to importing this report into R, but the problem is that the report has multiple headers as you see! so i can't use read.table() that only supports single header files

i've to mention that the headers are informative ( the first number in the headers are informative ) and i dont want to read the whole file as a string and write a parser for parsing and extracting data.

some of the tables are empty ( like the 1 Reverse here ) but maybe we have a "Reverse" table with records

is there any common solution ?

thanks

input files read.table maxmatch Mummer • 6.7k views

ADD COMMENT • link updated 9.0 years ago by Malcolm Cook ★ 1.6k • written 9.0 years ago by a.afshinfard ▴ 10

score 2 · Accepted Answer · 2015-04-17

I'm not sure whether this is disqualified by your desire not to read the whole file as as string and write a parser. Iread the data in

lns = readLines("http://m.uploadedit.com/ba3c/1429271308686.txt")

Then found all the 'header' lines

idx = grepl(">", lns)

I removed the header lines and input the remainder into a data.frame

df = read.table(text=lns[!idx])

Then added a column to the data frame telling me the header line that the row came from. To do this I had to figure out how many times the header line needed to be replicated

wd = diff(c(which(idx), length(idx) + 1)) - 1
df$label = rep(lns[idx], wd)

I'm not sure what a massive file looks like, but the above is probably good enough for anything that'll be convenient to manipulate in some down-stream way. Hope that helps!

score 2 · Accepted Answer · 2015-04-18

This is a perfect time to dust of your perl one-liners to pre-process the input so it can be read in a tidy fashion with read.csv.

The R pipe function is helpful here.

Assuming you have curl installed to process your example data:

l<-read.csv(pipe("curl -s http://m.uploadedit.com/ba3c/1429271308686.txt | perl -lane 'BEGIN{$,=qq{,}}; unless(m/^> (?<id>\\d+) (?<strand>.)/) {%v=%+; $v{strand} =~ y/R /-+/; print($v{id},$v{strand}, @F)}'") ,col.names=c('QueryID','Strand','i','j','k','l'))

l

works quite nicely
recodes the Reverse into a standard(ish) +/- strand
and produces output as:

QueryID Strand i j k l 1 1 + 122132 1 1 87 2 1 + 203756 1 1 354 3 1 + 1 1 1 354 4 1 + 42364 12 1 89 5 1 + 203757 37 37 91 6 1 + 122132 90 90 38 7 1 + 42364 102 91 37 8 1 + 203757 129 129 168 9 1 + 42364 140 129 212 10 1 + 122132 129 129 212 11 1 + 203757 298 298 43 12 2 + 203754 1 1 127 13 2 + 2 1 1 127 14 2 + 122133 1 19 80 15 2 + 3 1 19 109 16 2 + 203758 1 19 109 17 2 + 203753 1 19 109 18 2 + 42363 1 19 30 19 2 + 42363 32 50 78 20 2 + 203755 1 52 52 21 2 + 4 1 52 52 22 2 + 122133 82 100 28 23 3 + 122133 1 1 80