possible bug in ShortRead qa/ppnCount

0

Entering edit mode

Janet Young ▴ 740

@janet-young-2360

Last seen 4.5 years ago

Fred Hutchinson Cancer Research Center,…

Hi Martin (seems like you will be the one to pick this up), I found an obscure bug in ShortRead:::.ppnCount - it'll be an easy fix, I think. I hope the code below should explain all, but if not let me know. thanks, Janet library(ShortRead) sp <- SolexaPath(system.file("extdata", package = "ShortRead")) file1 <- file.path(analysisPath(sp), pattern="s_1_sequence.txt") ### take the most common sequence as a fake adaptor sequence seqs1 <- readFastq(file1) fakeAdaptor1 <- names(which(table(as.character(sread(seqs1)))==max(tab le(as.character(sread(seqs1)))))[1]) #### run qa qa1 <- qa(readFastq(file1), basename(file1), Lpattern=fakeAdaptor1) #### run qa forcing it to think sample name is different qa1a <- qa(readFastq(file1), paste(basename(file1),"_a",sep=""), Lpattern=fakeAdaptor1) ### combine qaboth <- do.call(rbind, list(qa1,qa1a) ) ### looking directly at adapterContamination gives correct answer (same contamination level in both) qaboth[["adapterContamination"]] # lane contamination # 1 s_1_sequence.txt 0.01171875 # 2 s_1_sequence.txt_a 0.01171875 ### but looking at adapterContamination using ppnCount (as called by report(qa)) gives wrong answer - it's dividing the counts in column 2 by the names (as factors) in column 1. ppnCount should check for column 1 being numeric as well as just not null ShortRead:::.ppnCount(qaboth[["adapterContamination"]]) # lane contamination # 1 s_1_sequence.txt 0.012 # 2 s_1_sequence.txt_a 0.006 sessionInfo() R version 2.14.0 (2011-10-31) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.0 latticeExtra_0.6-19 RColorBrewer_1.0-5 Rsamtools_1.6.1 [5] lattice_0.20-0 Biostrings_2.22.0 GenomicRanges_1.6.2 IRanges_1.12.1 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.0 [5] hwriter_1.3 RCurl_1.7-0 rtracklayer_1.14.2 tools_2.14.0 [9] XML_3.4-3 zlibbioc_1.0.0

• 887 views

ADD COMMENT • link updated 12.4 years ago by Martin Morgan 25k • written 12.4 years ago by Janet Young ▴ 740

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

On 11/16/2011 05:32 PM, Janet Young wrote: > Hi Martin (seems like you will be the one to pick this up), > > I found an obscure bug in ShortRead:::.ppnCount - it'll be an easy > fix, I think. I hope the code below should explain all, but if not > let me know. Thanks Janet should be fixed when 1.12.1 or 1.13.3 makes it to the outside world. Martin > > thanks, > > Janet > > > > library(ShortRead) > > sp<- SolexaPath(system.file("extdata", package = "ShortRead")) > file1<- file.path(analysisPath(sp), pattern="s_1_sequence.txt") > > ### take the most common sequence as a fake adaptor sequence seqs1<- > readFastq(file1) fakeAdaptor1<- > names(which(table(as.character(sread(seqs1)))==max(table(as.characte r(sread(seqs1)))))[1]) > > #### run qa qa1<- qa(readFastq(file1), basename(file1), > Lpattern=fakeAdaptor1) > > #### run qa forcing it to think sample name is different qa1a<- > qa(readFastq(file1), paste(basename(file1),"_a",sep=""), > Lpattern=fakeAdaptor1) > > ### combine qaboth<- do.call(rbind, list(qa1,qa1a) ) > > ### looking directly at adapterContamination gives correct answer > (same contamination level in both) qaboth[["adapterContamination"]] > > # lane contamination # 1 s_1_sequence.txt > 0.01171875 # 2 s_1_sequence.txt_a 0.01171875 > > ### but looking at adapterContamination using ppnCount (as called by > report(qa)) gives wrong answer - it's dividing the counts in column 2 > by the names (as factors) in column 1. ppnCount should check for > column 1 being numeric as well as just not null > > ShortRead:::.ppnCount(qaboth[["adapterContamination"]]) > > # lane contamination # 1 s_1_sequence.txt > 0.012 # 2 s_1_sequence.txt_a 0.006 > > > > sessionInfo() > > R version 2.14.0 (2011-10-31) Platform: i386-apple-darwin9.8.0/i386 > (32-bit) > > locale: [1] > en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: [1] stats graphics grDevices utils > datasets methods base > > other attached packages: [1] ShortRead_1.12.0 latticeExtra_0.6-19 > RColorBrewer_1.0-5 Rsamtools_1.6.1 [5] lattice_0.20-0 > Biostrings_2.22.0 GenomicRanges_1.6.2 IRanges_1.12.1 > > loaded via a namespace (and not attached): [1] Biobase_2.14.0 > bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.0 [5] hwriter_1.3 > RCurl_1.7-0 rtracklayer_1.14.2 tools_2.14.0 [9] XML_3.4-3 > zlibbioc_1.0.0 > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD COMMENT • link 12.4 years ago Martin Morgan 25k

Login before adding your answer.