Question

merge fastq.gz in R? (or in Windows?)

0

Entering edit mode

mk.applebee • 0

@mkapplebee-8280

Last seen 9.4 years ago

United States

I have RNAseq libraries of 28 samples, and I got 2-4 fastq.gz files from each sample. I am just trying to figure out how to merge them. I have already done what I feels is a due-diligence search of the bioconductor tutorials and google, and so far it appears that there is no way in r + windows to easily merge multiple fastq.gz files from the same sample into one fastq file containing all of the reads for that sample. I don't think there should be any quality differences between the reads in these files because these are not technical replicates, the sequencing center just seems to format their data output to start a new fastq file whenever the previous one reaches 603.5 MB. But I need to merge the ones pertaining to a single sample because all of the tools for downstream work all seem to assume that 1 fastq.gz file = 1 sample. Is this true?

I suppose I could map each file individually to my reference genome, and then add the mapped counts from the fasta.gz files from the same sample together. Is this what I am to infer I should do? Or should I spend a day installing a linux virtual box simply so I can use its cat command?

rnaseq bioconductor fastq • 11k views

ADD COMMENT • link updated 5.2 years ago by Rafael • 0 • written 10.6 years ago by mk.applebee • 0

score 3 · Accepted Answer · 2015-07-17

3

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 18 days ago

United States

I think you can use ShortRead::writeFastq() with argument mode="a" to append to an existing file. Is that what you mean by 'merge'? Maybe the idea would be (completely untested!)

fout = "combined.fastq"
fls = dir(pattern="fastq.gz")
for (fl in fls) {
    fq = readFastq(fl)
    writeFastq(fq, fout, mode="a", compress=TRUE)
}

You could use FastqStreamer() and yield() to avoid the need to read the entire fastq file into memory.

ADD COMMENT • link 10.6 years ago Martin Morgan 25k

0

Entering edit mode

Thank you! This worked great. I looked at the ShortRead package, but I only saw that it was used for sampling. Much appreciated!

ADD REPLY • link 10.6 years ago mk.applebee • 0

0

Entering edit mode

I would like to apply this function (it works for me!). However, I have ~20 folders inside a directory to apply this function. Anybody can indicate me how to do apply this function to all my folders? Thanks!!

ADD REPLY • link 5.2 years ago Rafael • 0

0

Entering edit mode

Do you mean 20 directories, and in each directory you want to concatenate all fastq files? Write a function for one directory

concat_directory <- function(directory_path) {
    fout = file.path(directory_path, "combined.fastq"
    fls = dir(directory_path, pattern="fastq.gz", full.names = TRUE)
    for (fl in fls) {
        fq = readFastq(fl)
        writeFastq(fq, fout, mode="a", compress=TRUE)
    }
    fout
}

Make sure that it works on one directory concat_directory(<PATH TO ONE DIRECTORY>) where you'll have to substitute the desired path in your system for <PATH TO ONE DIRECTORY>.

Then get a vector of all directories. Are they nested like main_dir/dir1, main_dir/dir2? Maybe

paths <- dir("main_dir", full.names = TRUE)

and apply concat_directory to each path

lapply(paths, concat_directory)

ADD REPLY • link 5.2 years ago Martin Morgan 25k

0

Entering edit mode

Thank you so much for your help! The function worked for one directory when I tested it, (closing the parenthesis for the "fout").

When I tried for one directory it worked as expected. However, when I apply the lapply function I have this error message:

Error: Input/Output no input files found dirPath: Sample1_S18_L001_R1_001.fastq.gz pattern: character(0)

Any idea how to solve it?

ADD REPLY • link 5.2 years ago Rafael • 0

0

Entering edit mode

Maybe you could provide a sketch of what your directory structure looks like. I'm thinking it is something like

main_dir
-- dir1
---- file1.fastq
---- file2.fastq
-- dir2
---- file3.fastq
---- file4.fastq
...

ADD REPLY • link 5.2 years ago Martin Morgan 25k

0

Entering edit mode

Yes, my directory structure is exactly as you show

ADD REPLY • link 5.2 years ago Rafael • 0

0

Entering edit mode

I think that I made a small mistake in the function, and should have added a 'full.names = TRUE' to the dir() function

concat_directory <- function(directory_path) {
    fout = file.path(directory_path, "combined.fastq"
    fls = dir(directory_path, pattern="fastq.gz", full.names = TRUE)
    for (fl in fls) {
        fq = readFastq(fl)
        writeFastq(fq, fout, mode="a", compress=TRUE)
    }
    fout
}