shortread base quality

0

Entering edit mode

David ▴ 860

@david-3335

Last seen 6.0 years ago

Hi, I'm trying to get the quality scores for each nucleotide for each read from the fastq file. Here is how it starts. I know to get average scores for reads but not for each individual nucleotide of each read. file <- "file.fastq" fqfile <- paste(basename(file),"",sep="") path <- dirname(file) sp <- SolexaPath(path,dataPath=path,analysisPath=path) fq <- readFastq(sp, fqfile) #Get quality scores score <- alphabetScore(fq) #Gives the sum of the base quality scores for each read aveScore <- score / width(fq) #How can i get the score for each base for each read ???? thanks,

• 1.4k views

ADD COMMENT • link updated 11.7 years ago by Martin Morgan 25k • written 11.7 years ago by David ▴ 860

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 hours ago

United States

On 08/24/2012 07:09 AM, David martin wrote: > Hi, > I'm trying to get the quality scores for each nucleotide for each read > from the fastq file. > Here is how it starts. I know to get average scores for reads but not > for each individual nucleotide of each read. > > > > file <- "file.fastq" > fqfile <- paste(basename(file),"",sep="") > path <- dirname(file) > sp <- SolexaPath(path,dataPath=path,analysisPath=path) > fq <- readFastq(sp, fqfile) > > #Get quality scores > score <- alphabetScore(fq) > > #Gives the sum of the base quality scores for each read > aveScore <- score / width(fq) try alphabetFrequency, e.g., ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width alf = alphabetFrequency(ragged) > dim(alf) [1] 256 94 > alf[1:5, 1:5] # not very exciting at this end... ! " # $ [1,] 0 0 0 0 0 [2,] 0 0 0 0 0 [3,] 0 0 0 0 0 [4,] 0 0 0 0 0 [5,] 0 0 0 0 0 Martin > > #How can i get the score for each base for each read ???? > > thanks, > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

Not really Martin, It's the score for each nucleotide so the matrix should be something like this 1 2 3 4 5 6 7 .. 30 [1,] 36 36 36 36 36 .. 36 #score for each nucleotide at each position of the read. [2,] 36 36 36 36 36 .. 36 [3,] 36 36 28 36 36 .. 36 [4,] 36 36 36 36 28 .. 36 [5,] 36 28 36 36 36 .. 36 Any idea ? On 08/24/2012 06:30 PM, Martin Morgan wrote: > On 08/24/2012 07:09 AM, David martin wrote: >> Hi, >> I'm trying to get the quality scores for each nucleotide for each read >> from the fastq file. >> Here is how it starts. I know to get average scores for reads but not >> for each individual nucleotide of each read. >> >> >> >> file <- "file.fastq" >> fqfile <- paste(basename(file),"",sep="") >> path <- dirname(file) >> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >> fq <- readFastq(sp, fqfile) >> >> #Get quality scores >> score <- alphabetScore(fq) >> >> #Gives the sum of the base quality scores for each read >> aveScore <- score / width(fq) > > try alphabetFrequency, e.g., > > ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width > alf = alphabetFrequency(ragged) > > > dim(alf) > [1] 256 94 > > alf[1:5, 1:5] # not very exciting at this end... > ! " # $ > [1,] 0 0 0 0 0 > [2,] 0 0 0 0 0 > [3,] 0 0 0 0 0 > [4,] 0 0 0 0 0 > [5,] 0 0 0 0 0 > > > Martin > >> >> #How can i get the score for each base for each read ???? >> >> thanks, >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 11.7 years ago David ▴ 860

0

Entering edit mode

On 08/27/2012 12:43 AM, David martin wrote: > Not really Martin, > It's the score for each nucleotide so the matrix should be something > like this > > 1 2 3 4 5 6 7 .. 30 > [1,] 36 36 36 36 36 .. 36 #score for each nucleotide at each position > of the read. > [2,] 36 36 36 36 36 .. 36 > [3,] 36 36 28 36 36 .. 36 > [4,] 36 36 36 36 28 .. 36 > [5,] 36 28 36 36 36 .. 36 > > Any idea ? if the reads are all the same width (implied by the matrix above) then as(quality(rfq), "matrix"); sorry about the misleading info. Martin > > On 08/24/2012 06:30 PM, Martin Morgan wrote: >> On 08/24/2012 07:09 AM, David martin wrote: >>> Hi, >>> I'm trying to get the quality scores for each nucleotide for each read >>> from the fastq file. >>> Here is how it starts. I know to get average scores for reads but not >>> for each individual nucleotide of each read. >>> >>> >>> >>> file <- "file.fastq" >>> fqfile <- paste(basename(file),"",sep="") >>> path <- dirname(file) >>> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >>> fq <- readFastq(sp, fqfile) >>> >>> #Get quality scores >>> score <- alphabetScore(fq) >>> >>> #Gives the sum of the base quality scores for each read >>> aveScore <- score / width(fq) >> >> try alphabetFrequency, e.g., >> >> ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width >> alf = alphabetFrequency(ragged) >> >> > dim(alf) >> [1] 256 94 >> > alf[1:5, 1:5] # not very exciting at this end... >> ! " # $ >> [1,] 0 0 0 0 0 >> [2,] 0 0 0 0 0 >> [3,] 0 0 0 0 0 >> [4,] 0 0 0 0 0 >> [5,] 0 0 0 0 0 >> >> >> Martin >> >>> >>> #How can i get the score for each base for each read ???? >>> >>> thanks, >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

Thanks Martin, we are almost there. Reads are not the same size (range from 14 to 30 nt) . Any other idea ? Maybe build an empty matrix and fill in the scores with NA values when reads are short ?? On 08/27/2012 03:07 PM, Martin Morgan wrote: > On 08/27/2012 12:43 AM, David martin wrote: >> Not really Martin, >> It's the score for each nucleotide so the matrix should be something >> like this >> >> 1 2 3 4 5 6 7 .. 30 >> [1,] 36 36 36 36 36 .. 36 #score for each nucleotide at each position >> of the read. >> [2,] 36 36 36 36 36 .. 36 >> [3,] 36 36 28 36 36 .. 36 >> [4,] 36 36 36 36 28 .. 36 >> [5,] 36 28 36 36 36 .. 36 >> >> Any idea ? > > if the reads are all the same width (implied by the matrix above) then > as(quality(rfq), "matrix"); sorry about the misleading info. Martin > >> >> On 08/24/2012 06:30 PM, Martin Morgan wrote: >>> On 08/24/2012 07:09 AM, David martin wrote: >>>> Hi, >>>> I'm trying to get the quality scores for each nucleotide for each read >>>> from the fastq file. >>>> Here is how it starts. I know to get average scores for reads but not >>>> for each individual nucleotide of each read. >>>> >>>> >>>> >>>> file <- "file.fastq" >>>> fqfile <- paste(basename(file),"",sep="") >>>> path <- dirname(file) >>>> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >>>> fq <- readFastq(sp, fqfile) >>>> >>>> #Get quality scores >>>> score <- alphabetScore(fq) >>>> >>>> #Gives the sum of the base quality scores for each read >>>> aveScore <- score / width(fq) >>> >>> try alphabetFrequency, e.g., >>> >>> ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width >>> alf = alphabetFrequency(ragged) >>> >>> > dim(alf) >>> [1] 256 94 >>> > alf[1:5, 1:5] # not very exciting at this end... >>> ! " # $ >>> [1,] 0 0 0 0 0 >>> [2,] 0 0 0 0 0 >>> [3,] 0 0 0 0 0 >>> [4,] 0 0 0 0 0 >>> [5,] 0 0 0 0 0 >>> >>> >>> Martin >>> >>>> >>>> #How can i get the score for each base for each read ???? >>>> >>>> thanks, >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 11.6 years ago David ▴ 860

0

Entering edit mode

On 08/27/2012 06:17 AM, David martin wrote: > Thanks Martin, we are almost there. > Reads are not the same size (range from 14 to 30 nt) . Any other idea ? > Maybe build an empty matrix and fill in the scores with NA values when > reads are short ?? I don't think there is anything built-in q = quality(rfq) w = width(q) m = matrix(NA_integer_, length(q), max(width(q))) for (i in seq(min(w), max(w))) m[w==i, 1:i] = as(q[w==i], "matrix") > > > On 08/27/2012 03:07 PM, Martin Morgan wrote: >> On 08/27/2012 12:43 AM, David martin wrote: >>> Not really Martin, >>> It's the score for each nucleotide so the matrix should be something >>> like this >>> >>> 1 2 3 4 5 6 7 .. 30 >>> [1,] 36 36 36 36 36 .. 36 #score for each nucleotide at each position >>> of the read. >>> [2,] 36 36 36 36 36 .. 36 >>> [3,] 36 36 28 36 36 .. 36 >>> [4,] 36 36 36 36 28 .. 36 >>> [5,] 36 28 36 36 36 .. 36 >>> >>> Any idea ? >> >> if the reads are all the same width (implied by the matrix above) then >> as(quality(rfq), "matrix"); sorry about the misleading info. Martin >> >>> >>> On 08/24/2012 06:30 PM, Martin Morgan wrote: >>>> On 08/24/2012 07:09 AM, David martin wrote: >>>>> Hi, >>>>> I'm trying to get the quality scores for each nucleotide for each read >>>>> from the fastq file. >>>>> Here is how it starts. I know to get average scores for reads but not >>>>> for each individual nucleotide of each read. >>>>> >>>>> >>>>> >>>>> file <- "file.fastq" >>>>> fqfile <- paste(basename(file),"",sep="") >>>>> path <- dirname(file) >>>>> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >>>>> fq <- readFastq(sp, fqfile) >>>>> >>>>> #Get quality scores >>>>> score <- alphabetScore(fq) >>>>> >>>>> #Gives the sum of the base quality scores for each read >>>>> aveScore <- score / width(fq) >>>> >>>> try alphabetFrequency, e.g., >>>> >>>> ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width >>>> alf = alphabetFrequency(ragged) >>>> >>>> > dim(alf) >>>> [1] 256 94 >>>> > alf[1:5, 1:5] # not very exciting at this end... >>>> ! " # $ >>>> [1,] 0 0 0 0 0 >>>> [2,] 0 0 0 0 0 >>>> [3,] 0 0 0 0 0 >>>> [4,] 0 0 0 0 0 >>>> [5,] 0 0 0 0 0 >>>> >>>> >>>> Martin >>>> >>>>> >>>>> #How can i get the score for each base for each read ???? >>>>> >>>>> thanks, >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

Excellent Martin, Works well. One quick question, my scores range from -29 to 8. This is normally an illumina 1.9 format (as detected by fastqc program) . Is there a quick way to ckeck within shortread which encoding has been used by shortRead and which version of illumina was recognized (like what fastqc does). On 08/27/2012 03:44 PM, Martin Morgan wrote: > q = quality(rfq) > w = width(q) > m = matrix(NA_integer_, length(q), max(width(q))) > for (i in seq(min(w), max(w))) > m[w==i, 1:i] = as(q[w==i], "matrix")

ADD REPLY • link 11.6 years ago David ▴ 860

0

Entering edit mode

On 08/27/2012 07:53 AM, David Vilanova wrote: > Excellent Martin, > Works well. > > One quick question, my scores range from -29 to 8. This is normally an > illumina 1.9 format (as detected by fastqc program) . > Is there a quick way to ckeck within shortread which encoding has been > used by shortRead and which version of illumina was recognized (like > what fastqc does). the class of quality(rfq) reflects the encoding ShortRead decided on, FastqQuality or SFastqQuality. Clearly wrong for your file. The heuristic is like alf <- alphabetFrequency(head(quality, 1000), collapse = TRUE) if (any(alf) && min(which(alf != 0)) < 59) { FastqQuality } else SFastqQuality Martin > > On 08/27/2012 03:44 PM, Martin Morgan wrote: >> q = quality(rfq) >> w = width(q) >> m = matrix(NA_integer_, length(q), max(width(q))) >> for (i in seq(min(w), max(w))) >> m[w==i, 1:i] = as(q[w==i], "matrix") > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

Login before adding your answer.