Question

How to check that the reads in two fastq files are in the same order

1

Entering edit mode

Frocha ▴ 20

@frocha-12039

Last seen 5.9 years ago

I have two fastq files which contain paired-end reads. How can I check whether the reads in the two fastq files are in the same order (i.e., the n-th read in file 1 is paired with the n-th read in file 2)?

shortread fastq • 4.4k views

ADD COMMENT • link updated 7.3 years ago by James W. MacDonald 65k • written 7.3 years ago by Frocha ▴ 20

score 1 · Answer 1 · 2016-12-16

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 4 hours ago

United States

You could read each in using readRNAStringSet in Biostrings and then test that the names for each XStringSet are the same.

ADD COMMENT • link 7.3 years ago James W. MacDonald 65k

1

Entering edit mode

Unwinding, I guess all(id(fq1) == id(fq2) & sread(fq1) == sread(fq2)) ?

I think the problem is that identical() returns TRUE when used on an 'external pointer', and that is where the XStringSet stores the data. I thought that some safeguard for this had been introduced into R or Biostrings, so I'm either mis-remembering or it's a regression.

ADD REPLY • link 7.3 years ago Martin Morgan 25k

0

Entering edit mode

I guess they're fastq files so same approach but with ShortRead::readFastq(), if the files are large then iterate using FastqStreamer() and check identity of each chunk.

ADD REPLY • link 7.3 years ago Martin Morgan 25k

0

Entering edit mode

Thanks to the answers. Now I use the id() function to get two BstringSet objects (each contain the id of reads in a file) and use identical() function to check if they are identical. The results is TRUE, even the two ids of the two paired reads differ by 1 character (1 for the first read and 2 for the second read). What is the explanation for this?

ADD REPLY • link 7.3 years ago Frocha ▴ 20

0

Entering edit mode

trailing character and readFastq? use something like subseq(id(rfq), 1, width(id(rfq)) - 1L) and then test for identity. In general? Maybe Biostrings::stringDist().

ADD REPLY • link 7.3 years ago Martin Morgan 25k

0

Entering edit mode

Thanks! Now I use the id() function to get two BstringSet objects (each contain the id of reads in a file) and use identical() function to check if they are identical. The results is TRUE, even the two ids of the two paired reads differ by 1 character (1 for the first read and 2 for the second read, this character is in the middle of the id). What is the explanation for this?

ADD REPLY • link 7.3 years ago Frocha ▴ 20

0

Entering edit mode

sessionInfo(), especially version of Biostrings? If it's old, you could try as.character() on each of the ids.

ADD REPLY • link 7.3 years ago Martin Morgan 25k

0

Entering edit mode

the version of Biostrings is 2.42.0.

ADD REPLY • link 7.3 years ago Frocha ▴ 20

0

Entering edit mode

Do you have a simple reproducible example, e.g., a single ID that illustrates the problem?

ADD REPLY • link 7.3 years ago Martin Morgan 25k

0

Entering edit mode

Here it is:

> head(subseq(id(fq.1), 1, width(id(fq.1))),1)
A BStringSet instance of length 1
    width seq
[1]    45 SRR372747.1.1 B2GA003:1:2:1046:1028 length=36
> head(subseq(id(fq.2), 1, width(id(fq.2))),1)
A BStringSet instance of length 1
    width seq
[1]    45 SRR372747.1.2 B2GA003:1:2:1046:1028 length=36
> identical(subseq(id(fq.1), 1, width(id(fq.1))),subseq(id(fq.2), 1, width(id(fq.2))))
[1] TRUE

ADD REPLY • link 7.3 years ago Frocha ▴ 20