merge fastq.gz in R? (or in Windows?)
1
0
Entering edit mode
@mkapplebee-8280
Last seen 7.5 years ago
United States

I have RNAseq libraries of 28 samples, and I got 2-4 fastq.gz files from each sample. I am just trying to figure out how to merge them.  I have already done what I feels is a due-diligence search of the bioconductor tutorials and google, and so far it appears that there is no way in r + windows to easily merge multiple fastq.gz files from the same sample into one fastq file containing all of the reads for that sample.  I don't think there should be any quality differences between the reads in these files because these are not technical replicates, the sequencing center just seems to format their data output to start a new fastq file whenever the previous one reaches 603.5 MB.  But I need to merge the ones pertaining to a single sample because all of the tools for downstream work all seem to assume that 1 fastq.gz file = 1 sample.  Is this true?

I suppose I could map each file individually to my reference genome, and then add the mapped counts from the fasta.gz files from the same sample together.  Is this what I am to infer I should do?  Or should I spend a day installing a linux virtual box simply so I can use its cat command?

rnaseq bioconductor fastq • 8.8k views
ADD COMMENT
3
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States

I think you can use ShortRead::writeFastq() with argument mode="a" to append to an existing file. Is that what you mean by 'merge'? Maybe the idea would be (completely untested!)

fout = "combined.fastq"
fls = dir(pattern="fastq.gz")
for (fl in fls) {
    fq = readFastq(fl)
    writeFastq(fq, fout, mode="a", compress=TRUE)
}

You could use FastqStreamer() and yield() to avoid the need to read the entire fastq file into memory.

ADD COMMENT
0
Entering edit mode

Thank you!  This worked great.  I looked at the ShortRead package, but I only saw that it was used for sampling.  Much appreciated!

ADD REPLY
0
Entering edit mode

I would like to apply this function (it works for me!). However, I have ~20 folders inside a directory to apply this function. Anybody can indicate me how to do apply this function to all my folders? Thanks!!

ADD REPLY
0
Entering edit mode

Do you mean 20 directories, and in each directory you want to concatenate all fastq files? Write a function for one directory

concat_directory <- function(directory_path) {
    fout = file.path(directory_path, "combined.fastq"
    fls = dir(directory_path, pattern="fastq.gz", full.names = TRUE)
    for (fl in fls) {
        fq = readFastq(fl)
        writeFastq(fq, fout, mode="a", compress=TRUE)
    }
    fout
}

Make sure that it works on one directory concat_directory(<PATH TO ONE DIRECTORY>) where you'll have to substitute the desired path in your system for <PATH TO ONE DIRECTORY>.

Then get a vector of all directories. Are they nested like main_dir/dir1, main_dir/dir2? Maybe

paths <- dir("main_dir", full.names = TRUE)

and apply concat_directory to each path

lapply(paths, concat_directory)
ADD REPLY
0
Entering edit mode

Thank you so much for your help! The function worked for one directory when I tested it, (closing the parenthesis for the "fout").

When I tried for one directory it worked as expected. However, when I apply the lapply function I have this error message:

Error: Input/Output no input files found dirPath: Sample1_S18_L001_R1_001.fastq.gz pattern: character(0)

Any idea how to solve it?

ADD REPLY
0
Entering edit mode

Maybe you could provide a sketch of what your directory structure looks like. I'm thinking it is something like

main_dir
-- dir1
---- file1.fastq
---- file2.fastq
-- dir2
---- file3.fastq
---- file4.fastq
...
ADD REPLY
0
Entering edit mode

Yes, my directory structure is exactly as you show

ADD REPLY
0
Entering edit mode

I think that I made a small mistake in the function, and should have added a 'full.names = TRUE' to the dir() function

concat_directory <- function(directory_path) {
    fout = file.path(directory_path, "combined.fastq"
    fls = dir(directory_path, pattern="fastq.gz", full.names = TRUE)
    for (fl in fls) {
        fq = readFastq(fl)
        writeFastq(fq, fout, mode="a", compress=TRUE)
    }
    fout
}
ADD REPLY
0
Entering edit mode

Great!! Now it worked! Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6