shortread base quality
1
0
Entering edit mode
David ▴ 860
@david-3335
Last seen 6.0 years ago
Hi, I'm trying to get the quality scores for each nucleotide for each read from the fastq file. Here is how it starts. I know to get average scores for reads but not for each individual nucleotide of each read. file <- "file.fastq" fqfile <- paste(basename(file),"",sep="") path <- dirname(file) sp <- SolexaPath(path,dataPath=path,analysisPath=path) fq <- readFastq(sp, fqfile) #Get quality scores score <- alphabetScore(fq) #Gives the sum of the base quality scores for each read aveScore <- score / width(fq) #How can i get the score for each base for each read ???? thanks,
• 1.4k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 hours ago
United States
On 08/24/2012 07:09 AM, David martin wrote: > Hi, > I'm trying to get the quality scores for each nucleotide for each read > from the fastq file. > Here is how it starts. I know to get average scores for reads but not > for each individual nucleotide of each read. > > > > file <- "file.fastq" > fqfile <- paste(basename(file),"",sep="") > path <- dirname(file) > sp <- SolexaPath(path,dataPath=path,analysisPath=path) > fq <- readFastq(sp, fqfile) > > #Get quality scores > score <- alphabetScore(fq) > > #Gives the sum of the base quality scores for each read > aveScore <- score / width(fq) try alphabetFrequency, e.g., ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width alf = alphabetFrequency(ragged) > dim(alf) [1] 256 94 > alf[1:5, 1:5] # not very exciting at this end... ! " # $ [1,] 0 0 0 0 0 [2,] 0 0 0 0 0 [3,] 0 0 0 0 0 [4,] 0 0 0 0 0 [5,] 0 0 0 0 0 Martin > > #How can i get the score for each base for each read ???? > > thanks, > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Not really Martin, It's the score for each nucleotide so the matrix should be something like this 1 2 3 4 5 6 7 .. 30 [1,] 36 36 36 36 36 .. 36 #score for each nucleotide at each position of the read. [2,] 36 36 36 36 36 .. 36 [3,] 36 36 28 36 36 .. 36 [4,] 36 36 36 36 28 .. 36 [5,] 36 28 36 36 36 .. 36 Any idea ? On 08/24/2012 06:30 PM, Martin Morgan wrote: > On 08/24/2012 07:09 AM, David martin wrote: >> Hi, >> I'm trying to get the quality scores for each nucleotide for each read >> from the fastq file. >> Here is how it starts. I know to get average scores for reads but not >> for each individual nucleotide of each read. >> >> >> >> file <- "file.fastq" >> fqfile <- paste(basename(file),"",sep="") >> path <- dirname(file) >> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >> fq <- readFastq(sp, fqfile) >> >> #Get quality scores >> score <- alphabetScore(fq) >> >> #Gives the sum of the base quality scores for each read >> aveScore <- score / width(fq) > > try alphabetFrequency, e.g., > > ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width > alf = alphabetFrequency(ragged) > > > dim(alf) > [1] 256 94 > > alf[1:5, 1:5] # not very exciting at this end... > ! " # $ > [1,] 0 0 0 0 0 > [2,] 0 0 0 0 0 > [3,] 0 0 0 0 0 > [4,] 0 0 0 0 0 > [5,] 0 0 0 0 0 > > > Martin > >> >> #How can i get the score for each base for each read ???? >> >> thanks, >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
On 08/27/2012 12:43 AM, David martin wrote: > Not really Martin, > It's the score for each nucleotide so the matrix should be something > like this > > 1 2 3 4 5 6 7 .. 30 > [1,] 36 36 36 36 36 .. 36 #score for each nucleotide at each position > of the read. > [2,] 36 36 36 36 36 .. 36 > [3,] 36 36 28 36 36 .. 36 > [4,] 36 36 36 36 28 .. 36 > [5,] 36 28 36 36 36 .. 36 > > Any idea ? if the reads are all the same width (implied by the matrix above) then as(quality(rfq), "matrix"); sorry about the misleading info. Martin > > On 08/24/2012 06:30 PM, Martin Morgan wrote: >> On 08/24/2012 07:09 AM, David martin wrote: >>> Hi, >>> I'm trying to get the quality scores for each nucleotide for each read >>> from the fastq file. >>> Here is how it starts. I know to get average scores for reads but not >>> for each individual nucleotide of each read. >>> >>> >>> >>> file <- "file.fastq" >>> fqfile <- paste(basename(file),"",sep="") >>> path <- dirname(file) >>> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >>> fq <- readFastq(sp, fqfile) >>> >>> #Get quality scores >>> score <- alphabetScore(fq) >>> >>> #Gives the sum of the base quality scores for each read >>> aveScore <- score / width(fq) >> >> try alphabetFrequency, e.g., >> >> ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width >> alf = alphabetFrequency(ragged) >> >> > dim(alf) >> [1] 256 94 >> > alf[1:5, 1:5] # not very exciting at this end... >> ! " # $ >> [1,] 0 0 0 0 0 >> [2,] 0 0 0 0 0 >> [3,] 0 0 0 0 0 >> [4,] 0 0 0 0 0 >> [5,] 0 0 0 0 0 >> >> >> Martin >> >>> >>> #How can i get the score for each base for each read ???? >>> >>> thanks, >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
Thanks Martin, we are almost there. Reads are not the same size (range from 14 to 30 nt) . Any other idea ? Maybe build an empty matrix and fill in the scores with NA values when reads are short ?? On 08/27/2012 03:07 PM, Martin Morgan wrote: > On 08/27/2012 12:43 AM, David martin wrote: >> Not really Martin, >> It's the score for each nucleotide so the matrix should be something >> like this >> >> 1 2 3 4 5 6 7 .. 30 >> [1,] 36 36 36 36 36 .. 36 #score for each nucleotide at each position >> of the read. >> [2,] 36 36 36 36 36 .. 36 >> [3,] 36 36 28 36 36 .. 36 >> [4,] 36 36 36 36 28 .. 36 >> [5,] 36 28 36 36 36 .. 36 >> >> Any idea ? > > if the reads are all the same width (implied by the matrix above) then > as(quality(rfq), "matrix"); sorry about the misleading info. Martin > >> >> On 08/24/2012 06:30 PM, Martin Morgan wrote: >>> On 08/24/2012 07:09 AM, David martin wrote: >>>> Hi, >>>> I'm trying to get the quality scores for each nucleotide for each read >>>> from the fastq file. >>>> Here is how it starts. I know to get average scores for reads but not >>>> for each individual nucleotide of each read. >>>> >>>> >>>> >>>> file <- "file.fastq" >>>> fqfile <- paste(basename(file),"",sep="") >>>> path <- dirname(file) >>>> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >>>> fq <- readFastq(sp, fqfile) >>>> >>>> #Get quality scores >>>> score <- alphabetScore(fq) >>>> >>>> #Gives the sum of the base quality scores for each read >>>> aveScore <- score / width(fq) >>> >>> try alphabetFrequency, e.g., >>> >>> ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width >>> alf = alphabetFrequency(ragged) >>> >>> > dim(alf) >>> [1] 256 94 >>> > alf[1:5, 1:5] # not very exciting at this end... >>> ! " # $ >>> [1,] 0 0 0 0 0 >>> [2,] 0 0 0 0 0 >>> [3,] 0 0 0 0 0 >>> [4,] 0 0 0 0 0 >>> [5,] 0 0 0 0 0 >>> >>> >>> Martin >>> >>>> >>>> #How can i get the score for each base for each read ???? >>>> >>>> thanks, >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
On 08/27/2012 06:17 AM, David martin wrote: > Thanks Martin, we are almost there. > Reads are not the same size (range from 14 to 30 nt) . Any other idea ? > Maybe build an empty matrix and fill in the scores with NA values when > reads are short ?? I don't think there is anything built-in q = quality(rfq) w = width(q) m = matrix(NA_integer_, length(q), max(width(q))) for (i in seq(min(w), max(w))) m[w==i, 1:i] = as(q[w==i], "matrix") > > > On 08/27/2012 03:07 PM, Martin Morgan wrote: >> On 08/27/2012 12:43 AM, David martin wrote: >>> Not really Martin, >>> It's the score for each nucleotide so the matrix should be something >>> like this >>> >>> 1 2 3 4 5 6 7 .. 30 >>> [1,] 36 36 36 36 36 .. 36 #score for each nucleotide at each position >>> of the read. >>> [2,] 36 36 36 36 36 .. 36 >>> [3,] 36 36 28 36 36 .. 36 >>> [4,] 36 36 36 36 28 .. 36 >>> [5,] 36 28 36 36 36 .. 36 >>> >>> Any idea ? >> >> if the reads are all the same width (implied by the matrix above) then >> as(quality(rfq), "matrix"); sorry about the misleading info. Martin >> >>> >>> On 08/24/2012 06:30 PM, Martin Morgan wrote: >>>> On 08/24/2012 07:09 AM, David martin wrote: >>>>> Hi, >>>>> I'm trying to get the quality scores for each nucleotide for each read >>>>> from the fastq file. >>>>> Here is how it starts. I know to get average scores for reads but not >>>>> for each individual nucleotide of each read. >>>>> >>>>> >>>>> >>>>> file <- "file.fastq" >>>>> fqfile <- paste(basename(file),"",sep="") >>>>> path <- dirname(file) >>>>> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >>>>> fq <- readFastq(sp, fqfile) >>>>> >>>>> #Get quality scores >>>>> score <- alphabetScore(fq) >>>>> >>>>> #Gives the sum of the base quality scores for each read >>>>> aveScore <- score / width(fq) >>>> >>>> try alphabetFrequency, e.g., >>>> >>>> ragged = narrow(quality(rfq), 1, width=c(20, 30)) ## recycle width >>>> alf = alphabetFrequency(ragged) >>>> >>>> > dim(alf) >>>> [1] 256 94 >>>> > alf[1:5, 1:5] # not very exciting at this end... >>>> ! " # $ >>>> [1,] 0 0 0 0 0 >>>> [2,] 0 0 0 0 0 >>>> [3,] 0 0 0 0 0 >>>> [4,] 0 0 0 0 0 >>>> [5,] 0 0 0 0 0 >>>> >>>> >>>> Martin >>>> >>>>> >>>>> #How can i get the score for each base for each read ???? >>>>> >>>>> thanks, >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
Excellent Martin, Works well. One quick question, my scores range from -29 to 8. This is normally an illumina 1.9 format (as detected by fastqc program) . Is there a quick way to ckeck within shortread which encoding has been used by shortRead and which version of illumina was recognized (like what fastqc does). On 08/27/2012 03:44 PM, Martin Morgan wrote: > q = quality(rfq) > w = width(q) > m = matrix(NA_integer_, length(q), max(width(q))) > for (i in seq(min(w), max(w))) > m[w==i, 1:i] = as(q[w==i], "matrix")
ADD REPLY
0
Entering edit mode
On 08/27/2012 07:53 AM, David Vilanova wrote: > Excellent Martin, > Works well. > > One quick question, my scores range from -29 to 8. This is normally an > illumina 1.9 format (as detected by fastqc program) . > Is there a quick way to ckeck within shortread which encoding has been > used by shortRead and which version of illumina was recognized (like > what fastqc does). the class of quality(rfq) reflects the encoding ShortRead decided on, FastqQuality or SFastqQuality. Clearly wrong for your file. The heuristic is like alf <- alphabetFrequency(head(quality, 1000), collapse = TRUE) if (any(alf) && min(which(alf != 0)) < 59) { FastqQuality } else SFastqQuality Martin > > On 08/27/2012 03:44 PM, Martin Morgan wrote: >> q = quality(rfq) >> w = width(q) >> m = matrix(NA_integer_, length(q), max(width(q))) >> for (i in seq(min(w), max(w))) >> m[w==i, 1:i] = as(q[w==i], "matrix") > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY

Login before adding your answer.

Traffic: 813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6