FastqSampler not sampling correctly in ShortRead

0

Entering edit mode

Marcus Davy ▴ 390

@marcus-davy-5153

Last seen 7.0 years ago

Hi, there appears to be a problem in FastqSampler() which seems to sample the first read at random, but then the next n-1 reads are always the same, so the next sample obtained with yield() is not independent. Is this a bug, or my code implementation wrong? ## From example(FastqSampler) sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") f <- FastqFile(fl) rfq <- readFastq(f) set.seed(1) f <- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) s1 <- yield(f) # sample of size n=50 set.seed(2) f <- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) s2 <- yield(f) # independent sample of size 50 ## Intersection length between samples always 50-1 length(intersect(id(s1), id(s2))) ## First read is different, the next 2:n reads are the same head(sread(s1)) head(sread(s2)) close(f) > head(sread(s1)) A DNAStringSet instance of length 6 width seq [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACTTTACTC # Different [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ... [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ... [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ... [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG Same > head(sread(s2)) A DNAStringSet instance of length 6 width seq [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTGGGGGGA # Different [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ... [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ... [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ... [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG #Same Marcus > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1 [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 stats4_2.15.0 [6] tools_2.15.0 zlibbioc_1.2.0 [[alternative HTML version deleted]]

• 1.2k views

ADD COMMENT • link updated 13.0 years ago by Martin Morgan 25k • written 13.0 years ago by Marcus Davy ▴ 390

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 weeks ago

United States

Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 / 1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4. Records 2:n were actually records 2:n in the file. FWIW in case there is confusion the intended way to draw independent samples is just f = open(FastqSample(fl, 50) s1 = yield(f); s2 = yield(f) close(f) Martin On 04/25/2012 12:50 AM, Marcus Davy wrote: > Hi, > there appears to be a problem in FastqSampler() which seems to sample the > first read at random, but then the next n-1 reads > are always the same, so the next sample obtained with yield() is not > independent. Is this a bug, or my code implementation wrong? > > ## From example(FastqSampler) > sp<- SolexaPath(system.file('extdata', package='ShortRead')) > fl<- file.path(analysisPath(sp), "s_1_sequence.txt") > > f<- FastqFile(fl) > rfq<- readFastq(f) > > set.seed(1) > f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) > s1<- yield(f) # sample of size n=50 > set.seed(2) > f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) > s2<- yield(f) # independent sample of size 50 > > > ## Intersection length between samples always 50-1 > length(intersect(id(s1), id(s2))) > > ## First read is different, the next 2:n reads are the same > head(sread(s1)) > head(sread(s2)) > close(f) > >> head(sread(s1)) > A DNAStringSet instance of length 6 > width seq > [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACTTTACTC # Different > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ... > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ... > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ... > [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG Same >> head(sread(s2)) > A DNAStringSet instance of length 6 > width seq > [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTGGGGGGA # Different > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ... > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ... > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ... > [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG #Same > > Marcus > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1 > [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 > stats4_2.15.0 > [6] tools_2.15.0 zlibbioc_1.2.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD COMMENT • link 13.0 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, thanks for the prompt bug fix. I was wondering if there is a mechanism to allow a seed for reproducible random sample generation, and if so how that might be implemented, set.seed() or otherwise. cheers, Marcus On Wed, Apr 25, 2012 at 11:43 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 / > 1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4. Records 2:n > were actually records 2:n in the file. > > FWIW in case there is confusion the intended way to draw independent > samples is just > > f = open(FastqSample(fl, 50) > s1 = yield(f); s2 = yield(f) > close(f) > > Martin > > > On 04/25/2012 12:50 AM, Marcus Davy wrote: > >> Hi, >> there appears to be a problem in FastqSampler() which seems to sample the >> first read at random, but then the next n-1 reads >> are always the same, so the next sample obtained with yield() is not >> independent. Is this a bug, or my code implementation wrong? >> >> ## From example(FastqSampler) >> sp<- SolexaPath(system.file('**extdata', package='ShortRead')) >> fl<- file.path(analysisPath(sp), "s_1_sequence.txt") >> >> f<- FastqFile(fl) >> rfq<- readFastq(f) >> >> set.seed(1) >> f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) >> s1<- yield(f) # sample of size n=50 >> set.seed(2) >> f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) >> s2<- yield(f) # independent sample of size 50 >> >> >> ## Intersection length between samples always 50-1 >> length(intersect(id(s1), id(s2))) >> >> ## First read is different, the next 2:n reads are the same >> head(sread(s1)) >> head(sread(s2)) >> close(f) >> >> head(sread(s1)) >>> >> A DNAStringSet instance of length 6 >> width seq >> [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACT**TTACTC # Different >> [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCA**TCGGAC #Same >> [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAAT**TTGGGT # ... >> [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAG**GTAAAA # ... >> [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGA**AGGCTT # ... >> [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC**GGGGCG Same >> >>> head(sread(s2)) >>> >> A DNAStringSet instance of length 6 >> width seq >> [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTG**GGGGGA # Different >> [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCA**TCGGAC #Same >> [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAAT**TTGGGT # ... >> [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAG**GTAAAA # ... >> [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGA**AGGCTT # ... >> [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC**GGGGCG #Same >> >> Marcus >> >> sessionInfo() >>> >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-apple-darwin9.8.0/x86_**64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5 >> [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1 >> [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >> stats4_2.15.0 >> [6] tools_2.15.0 zlibbioc_1.2.0 >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago Marcus Davy ▴ 390

0

Entering edit mode

On 04/27/2012 12:07 PM, Marcus Davy wrote: > Hi Martin, > thanks for the prompt bug fix. > > I was wondering if there is a mechanism to allow a seed for reproducible > random sample generation, and if so how that might be implemented, > set.seed() or otherwise. Hi Marcus -- FastqSampler uses R's random number stream so after example(FastqSampler) > f = open(FastqSampler(fl, 50)) > set.seed(123); s1 = yield(f); set.seed(123); s2 = yield(f) > identical(s1, s2) [1] TRUE > packageVersion("ShortRead") [1] '1.14.3' Martin > > cheers, > > Marcus > > > On Wed, Apr 25, 2012 at 11:43 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 / > 1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4. > Records 2:n were actually records 2:n in the file. > > FWIW in case there is confusion the intended way to draw independent > samples is just > > f = open(FastqSample(fl, 50) > s1 = yield(f); s2 = yield(f) > close(f) > > Martin > > > On 04/25/2012 12:50 AM, Marcus Davy wrote: > > Hi, > there appears to be a problem in FastqSampler() which seems to > sample the > first read at random, but then the next n-1 reads > are always the same, so the next sample obtained with yield() is not > independent. Is this a bug, or my code implementation wrong? > > ## From example(FastqSampler) > sp<- SolexaPath(system.file('__extdata', package='ShortRead')) > fl<- file.path(analysisPath(sp), "s_1_sequence.txt") > > f<- FastqFile(fl) > rfq<- readFastq(f) > > set.seed(1) > f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) > s1<- yield(f) # sample of size n=50 > set.seed(2) > f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) > s2<- yield(f) # independent sample of size 50 > > > ## Intersection length between samples always 50-1 > length(intersect(id(s1), id(s2))) > > ## First read is different, the next 2:n reads are the same > head(sread(s1)) > head(sread(s2)) > close(f) > > head(sread(s1)) > > A DNAStringSet instance of length 6 > width seq > [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACT__TTACTC # Different > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCA__TCGGAC #Same > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAAT__TTGGGT # ... > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAG__GTAAAA # ... > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGA__AGGCTT # ... > [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC__GGGGCG Same > > head(sread(s2)) > > A DNAStringSet instance of length 6 > width seq > [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTG__GGGGGA # Different > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCA__TCGGAC #Same > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAAT__TTGGGT # ... > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAG__GTAAAA # ... > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGA__AGGCTT # ... > [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC__GGGGCG #Same > > Marcus > > sessionInfo() > > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1 > [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 > stats4_2.15.0 > [6] tools_2.15.0 zlibbioc_1.2.0 > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 <tel:206%20667-2793> > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 13.0 years ago Martin Morgan 25k

Login before adding your answer.