Phred encoding
1
0
Entering edit mode
@taylor-sean-d-5800
Last seen 9.6 years ago
Hi, I'm trying to quality filter my NGS reads and want to filter out reads that have bases below a quality threshold (say 23 for instance, using Illumina MiSeq with an offset of 33). Can anyone tell me why the results of the two functions as.raw() and as.numeric() give different results? qual<-PhredQuality(c("BBBBBFFB4!")) as.raw(unlist(qual)) as.numeric(unlist(qual)) Thanks, Sean
• 1.6k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 5 weeks ago
United States
from ?as.raw A raw vector is printed with each byte separately represented as a pair of hex digits. If you want to see a character representation On Wed, Aug 14, 2013 at 6:25 PM, Taylor, Sean D <sdtaylor@fhcrc.org> wrote: > Hi, > > I'm trying to quality filter my NGS reads and want to filter out reads > that have bases below a quality threshold (say 23 for instance, using > Illumina MiSeq with an offset of 33). Can anyone tell me why the results of > the two functions as.raw() and as.numeric() give different results? > > qual<-PhredQuality(c("BBBBBFFB4!")) > as.raw(unlist(qual)) > as.numeric(unlist(qual)) > > Thanks, > Sean > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Thanks for the response Vincent. I'm afraid I still don't understand what as.raw() is doing differently. Did part of your reply get cut off? After consulting ?as.raw() I still would have expected the answer that as.numeric() is generating (based on my limited understanding). From: Vincent Carey [mailto:stvjc@channing.harvard.edu] Sent: Wednesday, August 14, 2013 6:32 PM To: Taylor, Sean D Cc: bioconductor@r-project.org Subject: Re: [BioC] Phred encoding from ?as.raw A raw vector is printed with each byte separately represented as a pair of hex digits. If you want to see a character representation On Wed, Aug 14, 2013 at 6:25 PM, Taylor, Sean D <sdtaylor@fhcrc.org<mailto:sdtaylor@fhcrc.org>> wrote: Hi, I'm trying to quality filter my NGS reads and want to filter out reads that have bases below a quality threshold (say 23 for instance, using Illumina MiSeq with an offset of 33). Can anyone tell me why the results of the two functions as.raw() and as.numeric() give different results? qual<-PhredQuality(c("BBBBBFFB4!")) as.raw(unlist(qual)) as.numeric(unlist(qual)) Thanks, Sean _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On 08/15/2013 09:49 AM, Taylor, Sean D wrote: > Thanks for the response Vincent. I'm afraid I still don't understand what as.raw() is doing differently. Did part of your reply get cut off? After consulting ?as.raw() I still would have expected the answer that as.numeric() is generating (based on my limited understanding). > I'd be interested in knowing what your objective is -- drop reads with some low quality calls? drop reads with overall low quality? trim reads of low quality heads / tails? Anway, 'unlist' strips the 'Quality' class > unlist(qual) 10-letter "BString" instance seq: BBBBBFFB4! so any operations are based on 'BString' without reference to encoding. as.integer / as.numeric then return the ascii symbol (http://www.asciitable.com/) of the corresponding letter > as.integer(unlist(qual)) [1] 66 66 66 66 66 70 70 66 52 33 > as.numeric(unlist(qual)) [1] 66 66 66 66 66 70 70 66 52 33 as.raw on a BString returns the raw (hexadecimal) representation of the ascii encoding > selectMethod(as.raw, "BString") Method Definition: function (x) as.raw(as.integer(x)) <environment: namespace:xvector=""> Signatures: x target "BString" defined "XRaw" > as.raw(1:100) [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 [26] 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 [51] 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a 4b [76] 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 > as.raw(66) [1] 42 From ?PhredQuality and rectangular data perhaps you'd like rowSums(as(qual, "matrix") < 23) == 0 or for irregular data rowSums(as(FastqQuality(qual), "matrix") < 23, na.rm=TRUE) == 0 also ?trimTails in ShortRead might be relevant. Martin > From: Vincent Carey [mailto:stvjc at channing.harvard.edu] > Sent: Wednesday, August 14, 2013 6:32 PM > To: Taylor, Sean D > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Phred encoding > > from ?as.raw > > A raw vector is printed with each byte separately represented as a > pair of hex digits. If you want to see a character representation > > > On Wed, Aug 14, 2013 at 6:25 PM, Taylor, Sean D <sdtaylor at="" fhcrc.org<mailto:sdtaylor="" at="" fhcrc.org="">> wrote: > Hi, > > I'm trying to quality filter my NGS reads and want to filter out reads that have bases below a quality threshold (say 23 for instance, using Illumina MiSeq with an offset of 33). Can anyone tell me why the results of the two functions as.raw() and as.numeric() give different results? > > qual<-PhredQuality(c("BBBBBFFB4!")) > as.raw(unlist(qual)) > as.numeric(unlist(qual)) > > Thanks, > Sean > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
Martin and others, I'm following a methodology in a paper for some NGS sequence analysis (He et al. Nature, 2010, vol 464, p610). In this paper, they apply three quality filters to the reads (36 bp reads from Genome Analyser II instrument): 1. all 36 bases in a read are required to have a Phred score of at least 23 2. No 'N' base was allowed anywhere in the 36 bases 3. not more than 3 mismatches permitted in the 36 bases. After quality filtering, a mismatch was categorized as a true mutation if 1. the identical mutation was identified in 10 distinct reads (i.e. read whose first base was a t10 different positions) 2. the identical mutation was identified in at least 3 forward reads and 3 reverse reads. For my data, I'm using 151bp reads from Illumina MySeq, some of these numbers are adjusted, but this is the overall analysis that I am attempting. My question pertained to the first quality filter. The solution that I came up with implements as.numeric(), but I wasn't sure why as.raw() didn't work. ############### #filter reads by quality ############## gal<-readGAlignments(bamfile, use.names=TRUE, param=param) qual <- mcols(gal)$qual qualNum<-lapply(qual, function(x) as.numeric(x)-33) #list of quality strings idx<-sapply(seq(length(qual)), function(i) length(which(qualNum[[i]]<qualitythreshold))==0) #returns="" true="" for="" any="" quality="" string="" that="" does="" not="" have="" a="" quality="" under="" the="" threshold="" galq<-gal[idx]="" for="" what="" its="" worth="" here="" are="" my="" solutions="" for="" the="" other="" quality="" filtering="" criteria.="" if="" anyone="" has="" a="" more="" efficient="" implementation,="" that="" would="" be="" really="" cool.="" i="" have="" solutions="" for="" the="" mutation="" calling="" that="" i'm="" still="" hammering="" out="" the="" details="" on.="" ############="" #filter="" reads="" by="" ns="" ############="" qseq<-setnames(mcols(galq)$seq,="" names(galq))="" numberofn<-as.vector(letterfrequency(qseq,"n"))="" galqn<-galq[which(numberofn="=0)]" ###########="" #filter="" reads="" by="" mismatches,="" #no="" more="" than="" 2="" mismatches="" allowed="" in="" first="" 50="" bp="" #no="" more="" than="" 3="" mismatches="" allowed="" in="" entire="" read="" #############="" qseq<-mcols(galqn)$seq="" qseq_on_ref="" <-="" sequencelayer(qseq,="" cigar(galqnl),="" from="query" ,="" to="reference" )="" pos<-start(galqn)="" ends<-end(galqn)="" mismatch.index<-sapply(seq(length(qseq_on_ref)),="" function(x){="" alignment<-append(subseq(refseq,="" start="pos[[x]]," end="ends[[x]])," qseq_on_ref[x])="" cm<-consensusmatrix(alignment,="" as.prob="T)" cs<-dnastring(consensusstring(cm,="" ambiguitymap="N" ,="" threshold="0.5))" if(!letterfrequency(cs[1:50],="" "n")<="2)" return="" (f)="" return="" (letterfrequency(cs,="" "n")<="3)" })="" galqnm<-galqn[mismatch.index]="" as="" always,="" thanks="" to="" all="" for="" your="" patient="" explaining="" of="" things="" to="" me.="" this="" mailing="" list="" is="" extraordinarily="" helpful="" and="" i="" hope="" that="" i="" have="" not="" abused="" it="" with="" my="" questioning.="" thanks,="" sean="" david="" taylor="" postdoctoral="" fellow="" fred="" hutchinson="" cancer="" research="" center="" 206-667-5544="" ________________________________________="" from:="" martin="" morgan="" [mtmorgan@fhcrc.org]="" sent:="" thursday,="" august="" 15,="" 2013="" 10:05="" am="" to:="" taylor,="" sean="" d="" cc:="" vincent="" carey;="" bioconductor="" at="" r-project.org="" subject:="" re:="" [bioc]="" phred="" encoding="" on="" 08="" 15="" 2013="" 09:49="" am,="" taylor,="" sean="" d="" wrote:=""> Thanks for the response Vincent. I'm afraid I still don't understand what as.raw() is doing differently. Did part of your reply get cut off? After consulting ?as.raw() I still would have expected the answer that as.numeric() is generating (based on my limited understanding). > I'd be interested in knowing what your objective is -- drop reads with some low quality calls? drop reads with overall low quality? trim reads of low quality heads / tails? Anway, 'unlist' strips the 'Quality' class > unlist(qual) 10-letter "BString" instance seq: BBBBBFFB4! so any operations are based on 'BString' without reference to encoding. as.integer / as.numeric then return the ascii symbol (http://www.asciitable.com/) of the corresponding letter > as.integer(unlist(qual)) [1] 66 66 66 66 66 70 70 66 52 33 > as.numeric(unlist(qual)) [1] 66 66 66 66 66 70 70 66 52 33 as.raw on a BString returns the raw (hexadecimal) representation of the ascii encoding > selectMethod(as.raw, "BString") Method Definition: function (x) as.raw(as.integer(x)) <environment: namespace:xvector=""> Signatures: x target "BString" defined "XRaw" > as.raw(1:100) [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 [26] 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 [51] 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a 4b [76] 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 > as.raw(66) [1] 42 From ?PhredQuality and rectangular data perhaps you'd like rowSums(as(qual, "matrix") < 23) == 0 or for irregular data rowSums(as(FastqQuality(qual), "matrix") < 23, na.rm=TRUE) == 0 also ?trimTails in ShortRead might be relevant. Martin > From: Vincent Carey [mailto:stvjc at channing.harvard.edu] > Sent: Wednesday, August 14, 2013 6:32 PM > To: Taylor, Sean D > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Phred encoding > > from ?as.raw > > A raw vector is printed with each byte separately represented as a > pair of hex digits. If you want to see a character representation > > > On Wed, Aug 14, 2013 at 6:25 PM, Taylor, Sean D <sdtaylor at="" fhcrc.org<mailto:sdtaylor="" at="" fhcrc.org="">> wrote: > Hi, > > I'm trying to quality filter my NGS reads and want to filter out reads that have bases below a quality threshold (say 23 for instance, using Illumina MiSeq with an offset of 33). Can anyone tell me why the results of the two functions as.raw() and as.numeric() give different results? > > qual<-PhredQuality(c("BBBBBFFB4!")) > as.raw(unlist(qual)) > as.numeric(unlist(qual)) > > Thanks, > Sean > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY

Login before adding your answer.

Traffic: 462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6