Q score filter to each base in QualityScaledDNAStringSet
1
0
Entering edit mode
XIA.PAN ▴ 20
@xiapan-12407
Last seen 2.0 years ago

Dear all,

I have a QualityScaledDNAStingSet and want to filter out reads that have bases below a quality threshold (say 20 for instance).

I can get the Q score one by one using as.integer(DNA_seqs@quality[1])

Is there a function to get the Q score of each object in the QualityScaledDNAStringSet and filter them with a threshold?

Thank you!

qualitycontrol • 545 views
2
Entering edit mode
Mike Smith ★ 5.2k
@mike-smith
Last seen 54 minutes ago
EMBL Heidelberg / de.NBI

You can do something like this.  First we'll create an example QualityScaledDNAStringSet using the example in the help page

library(Biostrings)
x1 <- DNAStringSet(c("TTGA", "CTCN"))
q1 <- PhredQuality(c("*+,-", "6789"))
qx1 <- QualityScaledDNAStringSet(x1, q1)

Then we'll create a list containing the integer versions of the quality scores.  This list has one entry for each sequence in our StringSet

quals_list <- as(quality(qx1), "IntegerList")
> quals_list
IntegerList of length 2
[[1]] 9 10 11 12
[[2]] 21 22 23 24

Now we apply a function to each of these vectors of quality scores.  In this case we are going to check if all of the entries in each vector are greater than 20, and return TRUE or FALSE.  You could put whatever function you want in here based on your criteria of 'good quality'.

good_quality <- sapply(quals_list, FUN = function(x) {
return(all(x >= 20))
})

Finally we subset by this set of TRUE/FALSE values to keep only the good ones.

qx1_good <- qx1[good_quality]
> qx1_good
A QualityScaledDNAStringSet instance containing:

A DNAStringSet instance of length 1
width seq
[1]     4 CTCN

A PhredQuality instance of length 1
width seq
[1]     4 6789

1
Entering edit mode

It's much more efficient to work on the *List with single function calls

> all(as(quality(qx1), "IntegerList") >= 20)
[1] FALSE  TRUE