possible bug in ShortRead qa/ppnCount
1
0
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 4.5 years ago
Fred Hutchinson Cancer Research Center,…
Hi Martin (seems like you will be the one to pick this up), I found an obscure bug in ShortRead:::.ppnCount - it'll be an easy fix, I think. I hope the code below should explain all, but if not let me know. thanks, Janet library(ShortRead) sp <- SolexaPath(system.file("extdata", package = "ShortRead")) file1 <- file.path(analysisPath(sp), pattern="s_1_sequence.txt") ### take the most common sequence as a fake adaptor sequence seqs1 <- readFastq(file1) fakeAdaptor1 <- names(which(table(as.character(sread(seqs1)))==max(tab le(as.character(sread(seqs1)))))[1]) #### run qa qa1 <- qa(readFastq(file1), basename(file1), Lpattern=fakeAdaptor1) #### run qa forcing it to think sample name is different qa1a <- qa(readFastq(file1), paste(basename(file1),"_a",sep=""), Lpattern=fakeAdaptor1) ### combine qaboth <- do.call(rbind, list(qa1,qa1a) ) ### looking directly at adapterContamination gives correct answer (same contamination level in both) qaboth[["adapterContamination"]] # lane contamination # 1 s_1_sequence.txt 0.01171875 # 2 s_1_sequence.txt_a 0.01171875 ### but looking at adapterContamination using ppnCount (as called by report(qa)) gives wrong answer - it's dividing the counts in column 2 by the names (as factors) in column 1. ppnCount should check for column 1 being numeric as well as just not null ShortRead:::.ppnCount(qaboth[["adapterContamination"]]) # lane contamination # 1 s_1_sequence.txt 0.012 # 2 s_1_sequence.txt_a 0.006 sessionInfo() R version 2.14.0 (2011-10-31) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.0 latticeExtra_0.6-19 RColorBrewer_1.0-5 Rsamtools_1.6.1 [5] lattice_0.20-0 Biostrings_2.22.0 GenomicRanges_1.6.2 IRanges_1.12.1 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.0 [5] hwriter_1.3 RCurl_1.7-0 rtracklayer_1.14.2 tools_2.14.0 [9] XML_3.4-3 zlibbioc_1.0.0
• 887 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 days ago
United States
On 11/16/2011 05:32 PM, Janet Young wrote: > Hi Martin (seems like you will be the one to pick this up), > > I found an obscure bug in ShortRead:::.ppnCount - it'll be an easy > fix, I think. I hope the code below should explain all, but if not > let me know. Thanks Janet should be fixed when 1.12.1 or 1.13.3 makes it to the outside world. Martin > > thanks, > > Janet > > > > library(ShortRead) > > sp<- SolexaPath(system.file("extdata", package = "ShortRead")) > file1<- file.path(analysisPath(sp), pattern="s_1_sequence.txt") > > ### take the most common sequence as a fake adaptor sequence seqs1<- > readFastq(file1) fakeAdaptor1<- > names(which(table(as.character(sread(seqs1)))==max(table(as.characte r(sread(seqs1)))))[1]) > > #### run qa qa1<- qa(readFastq(file1), basename(file1), > Lpattern=fakeAdaptor1) > > #### run qa forcing it to think sample name is different qa1a<- > qa(readFastq(file1), paste(basename(file1),"_a",sep=""), > Lpattern=fakeAdaptor1) > > ### combine qaboth<- do.call(rbind, list(qa1,qa1a) ) > > ### looking directly at adapterContamination gives correct answer > (same contamination level in both) qaboth[["adapterContamination"]] > > # lane contamination # 1 s_1_sequence.txt > 0.01171875 # 2 s_1_sequence.txt_a 0.01171875 > > ### but looking at adapterContamination using ppnCount (as called by > report(qa)) gives wrong answer - it's dividing the counts in column 2 > by the names (as factors) in column 1. ppnCount should check for > column 1 being numeric as well as just not null > > ShortRead:::.ppnCount(qaboth[["adapterContamination"]]) > > # lane contamination # 1 s_1_sequence.txt > 0.012 # 2 s_1_sequence.txt_a 0.006 > > > > sessionInfo() > > R version 2.14.0 (2011-10-31) Platform: i386-apple-darwin9.8.0/i386 > (32-bit) > > locale: [1] > en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: [1] stats graphics grDevices utils > datasets methods base > > other attached packages: [1] ShortRead_1.12.0 latticeExtra_0.6-19 > RColorBrewer_1.0-5 Rsamtools_1.6.1 [5] lattice_0.20-0 > Biostrings_2.22.0 GenomicRanges_1.6.2 IRanges_1.12.1 > > loaded via a namespace (and not attached): [1] Biobase_2.14.0 > bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.0 [5] hwriter_1.3 > RCurl_1.7-0 rtracklayer_1.14.2 tools_2.14.0 [9] XML_3.4-3 > zlibbioc_1.0.0 > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6