Question: How to check that the reads in two fastq files are in the same order
0
2.6 years ago by
Frocha10
Frocha10 wrote:

I have two fastq files which contain paired-end reads. How can I check whether the reads in the two fastq files are in the same order (i.e., the n-th read in file 1 is paired with the n-th read in file 2)?

shortread fastq • 1.1k views
ADD COMMENTlink
modified 2.6 years ago by James W. MacDonald50k • written 2.6 years ago by Frocha10
Answer: How to check that the reads in two fastq files are in the same order
0
2.6 years ago by
United States
James W. MacDonald50k wrote:

You could read each in using readRNAStringSet in Biostrings and then test that the names for each XStringSet are the same.

ADD COMMENTlink written 2.6 years ago by James W. MacDonald50k

I guess they're fastq files so same approach but with ShortRead::readFastq(), if the files are large then iterate using FastqStreamer() and check identity of each chunk.

ADD REPLYlink written 2.6 years ago by Martin Morgan ♦♦ 23k

Thanks to the answers. Now I use the id() function to get two BstringSet objects (each contain the id of reads in a file) and use identical() function to check if they are identical. The results is TRUE, even the two ids of the two paired reads differ by 1 character (1 for the first read and 2 for the second read). What is the explanation for this?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Frocha10

trailing character and readFastq? use something like subseq(id(rfq), 1, width(id(rfq)) - 1L) and then test for identity. In general? Maybe Biostrings::stringDist().

ADD REPLYlink written 2.6 years ago by Martin Morgan ♦♦ 23k

Thanks! Now I use the id() function to get two BstringSet objects (each contain the id of reads in a file) and use identical() function to check if they are identical. The results is TRUE, even the two ids of the two paired reads differ by 1 character (1 for the first read and 2 for the second read, this character is in the middle of the id). What is the explanation for this?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Frocha10

sessionInfo(), especially version of Biostrings? If it's old, you could try as.character() on each of the ids.

ADD REPLYlink written 2.6 years ago by Martin Morgan ♦♦ 23k

the version of Biostrings is 2.42.0.

ADD REPLYlink written 2.6 years ago by Frocha10

Do you have a simple reproducible example, e.g., a single ID that illustrates the problem?

ADD REPLYlink written 2.6 years ago by Martin Morgan ♦♦ 23k

Here it is:

> head(subseq(id(fq.1), 1, width(id(fq.1))),1)
A BStringSet instance of length 1
width seq
[1]    45 SRR372747.1.1 B2GA003:1:2:1046:1028 length=36
> head(subseq(id(fq.2), 1, width(id(fq.2))),1)
A BStringSet instance of length 1
width seq
[1]    45 SRR372747.1.2 B2GA003:1:2:1046:1028 length=36
> identical(subseq(id(fq.1), 1, width(id(fq.1))),subseq(id(fq.2), 1, width(id(fq.2))))
[1] TRUE

ADD REPLYlink written 2.6 years ago by Frocha10

Unwinding, I guess all(id(fq1) == id(fq2) & sread(fq1) == sread(fq2)) ?

I think the problem is that identical() returns TRUE when used on an 'external pointer', and that is where the XStringSet stores the data. I thought that some safeguard for this had been introduced into R or Biostrings, so I'm either mis-remembering or it's a regression.

ADD REPLYlink written 2.6 years ago by Martin Morgan ♦♦ 23k
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 132 users visited in the last hour