need help manipulating fastq file
1
1
Entering edit mode
KB ▴ 50
@k-8495
Last seen 14 months ago
United States

Hello,

I am looking to read every sequence of a fastq file, insert a specific set of nucleotides at the beginning of each sequence. Then save that file into a new fastq file.

I am using the shortRead package for that, but am not sure how to overwrite the new sequence. This is what I have so far.

library("ShortRead")
input_1 <- readFastq("Sample_1.fastq") #input forward read

output_1 <- input_1 #making a copy 
insertNuc <- "TTTTT"  #to be inserted into sequence
iCount = 1
while(iCount <= length(input_1)) {
    seq1 <- sread(input_1)[iCount]  #read sequence one by one
    seq1New <- paste(insertNuc,seq1,sep="")  #insert nucleotides
    sread(output_1)[iCount] <- seq1New  # write back new sequence to fastq file  ????

    iCount <- iCount + 1
}
# write fastq
writeFastq(output_1, destination, "output_1") 

I need help with

1)  write back new sequence to fastq file in same order (basically I need to replace old sequence with new sequence)

2)  I am using loops to read through entire fastq file. If anyone knows a faster way to do this, with apply() or anything other packages , I would appreciate any tips on this.

Thanks, K

 

fastq • 2.7k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States

Don't read the records one-by-one, but in 'chunks' of say 1000000. Do this using ?FastqStreamer; the example on this help page provides some illustration of how to do this using repeat {}; see also?GenomicFiles::reduceByYield.

For example, this sets up a stream to read 100 records at a time (as mentioned, work on much larger chunks), and then extracts the first 100 records

example(FastqStreamer)         # defines 'fl'
strm = FastqStreamer(fl, 100)  # much larger chunk, e.g,. 1000000
chunk = yield(strm)

Manipulate the components of the resulting object, maybe using ?replaceAt, and then re-create the object

sr = sread(chunk); sr = replaceAt(sr, 1, "TTTT")
qual = quality(quality(chunk)); qual = replaceAt(qual, 1, "    ")
fq = ShortReadQ(sr, do.call(class(quality(chunk)), list(qual)), id(chunk))

And write the result

writeFastq(fq, "updated.fastq", mode="a")

A complete example is

library(ShortRead)
library(GenomicFiles)
example(FastqStreamer)  # for the path 'fl' to a fastq file

YIELD = yield
MAP = function(x, fout) {
    sr = sread(x); sr = replaceAt(sr, 1, "TTTT")
    qual = quality(quality(x)); qual = replaceAt(qual, 1, "    ")
    x <- ShortReadQ(sr, do.call(class(quality(x)), list(qual)), id(x))
    writeFastq(x, fout, "a")
    length(x)
}
fout = tempfile()
reduceByYield(FastqStreamer(fl), YIELD, MAP, fout=fout)

With 

> readFastq(fl)
class: ShortReadQ
length: 256 reads; width: 36 cycles
> readFastq(fout)
class: ShortReadQ
length: 256 reads; width: 40 cycles
ADD COMMENT
0
Entering edit mode

Wow...incredible !

Thank you every much , I will try this out. 

Is there any specific reason you replaced the quality of the sequence in addition to the sequence ? I'm guessing its just an example. ? 

Thanks !!

ADD REPLY
0
Entering edit mode

Fastq files require that the width of the sequence is the same as the width of the quality string, so the quality strings have to be transformed with the same geometry as the sequences.

ADD REPLY
0
Entering edit mode

Didn't know that, thank you for the explanation ,

ADD REPLY
0
Entering edit mode

Hello,

So I tried out the example and it works perfectly. But when I try out this exact same code on my input fastq file, this is the error I get when I try : readFastq(fout)

Error: Input/Output
  no input files found
  dirPath: C:\Users\kb472\AppData\Local\Temp\RtmpIZXEIX\file2d10696330ea
  pattern: character(0)

Because its something to do with the line "fout = tempfile()", I tried this:

fout = "Abc.fastq"

The code now runs fine, but there is no output into "Abc.fastq" file. Any idea why this is happening ? 

Thank you,

ADD REPLY
0
Entering edit mode

Do you mean that the example runs successfully _and_ there is an output file, but your actual data file does not produce an output file? The code to reduceByYield returns the number of records processed; does that number correspond to the number of records in the input file?

ADD REPLY
0
Entering edit mode
When the example runs, it prints the correct number of records. There is a temp file created, but it has gibberish in it when I open it. But I am able to retrieve the fout object and save it into a fastq file. On my data, when the function runs, it doesn't print number of records. It instead says "list()". When I try to retrieve the " fout " object, or the temp file, I am unable to. There is no output file. Thanks, K
ADD REPLY
0
Entering edit mode

Likely 'gibberish' is because the file is compressed. See ?writeFastq and add 'compress=FALSE' to writeFastq(x, fout, "a", compress=FALSE).

'list()` means that no records were processed. Once a stream is 'used', then subsequent calls to yield() produce no output. I revised my response to include a more fool-proof way of always using a fresh stream with reduceByYield(FastqStreamer(fl), YIELD, MAP, fout=fout). If it still indicates 'list()', then I guess yield(FastqStreamer(fl)) produces a ShortReadQ object with 0 records? If so, I don't know what the problem is; maybe you could try with a subset of your data, tmp = tempfile(); writeLines(readLines(fl, 1000), tmp); fout=tempfile(); reduceByYield(FastqStreamer(tmp), YIELD, MAP).

ADD REPLY
0
Entering edit mode

Thank you, that worked.

Also closing the file connection  also helped.

X <- FastqStreamer(fl)
close(X)
ADD REPLY

Login before adding your answer.

Traffic: 826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6