Question

Q score filter to each base in QualityScaledDNAStringSet

0

Entering edit mode

XIA.PAN ▴ 20

@xiapan-12407

Last seen 5.4 years ago

Dear all,

I have a QualityScaledDNAStingSet and want to filter out reads that have bases below a quality threshold (say 20 for instance).

I can get the Q score one by one using as.integer(DNA_seqs@quality[1])

Is there a function to get the Q score of each object in the QualityScaledDNAStringSet and filter them with a threshold?

Thank you!

qualitycontrol • 1.1k views

ADD COMMENT • link updated 8.0 years ago by Mike Smith ★ 6.6k • written 8.0 years ago by XIA.PAN ▴ 20

score 2 · Answer 1 · 2017-03-10

You can do something like this. First we'll create an example QualityScaledDNAStringSet using the example in the help page

library(Biostrings)
x1 <- DNAStringSet(c("TTGA", "CTCN"))
q1 <- PhredQuality(c("*+,-", "6789"))
qx1 <- QualityScaledDNAStringSet(x1, q1)

Then we'll create a list containing the integer versions of the quality scores. This list has one entry for each sequence in our StringSet

quals_list <- as(quality(qx1), "IntegerList")

> quals_list
IntegerList of length 2
[[1]] 9 10 11 12
[[2]] 21 22 23 24

Now we apply a function to each of these vectors of quality scores. In this case we are going to check if all of the entries in each vector are greater than 20, and return TRUE or FALSE. You could put whatever function you want in here based on your criteria of 'good quality'.

good_quality <- sapply(quals_list, FUN = function(x) { 
        return(all(x >= 20))
    })

Finally we subset by this set of TRUE/FALSE values to keep only the good ones.

qx1_good <- qx1[good_quality]

> qx1_good
  A QualityScaledDNAStringSet instance containing:

  A DNAStringSet instance of length 1
    width seq
[1]     4 CTCN

  A PhredQuality instance of length 1
    width seq
[1]     4 6789