FastqSampler not sampling correctly in ShortRead
1
0
Entering edit mode
Marcus Davy ▴ 390
@marcus-davy-5153
Last seen 6.7 years ago
Hi, there appears to be a problem in FastqSampler() which seems to sample the first read at random, but then the next n-1 reads are always the same, so the next sample obtained with yield() is not independent. Is this a bug, or my code implementation wrong? ## From example(FastqSampler) sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") f <- FastqFile(fl) rfq <- readFastq(f) set.seed(1) f <- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) s1 <- yield(f) # sample of size n=50 set.seed(2) f <- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) s2 <- yield(f) # independent sample of size 50 ## Intersection length between samples always 50-1 length(intersect(id(s1), id(s2))) ## First read is different, the next 2:n reads are the same head(sread(s1)) head(sread(s2)) close(f) > head(sread(s1)) A DNAStringSet instance of length 6 width seq [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACTTTACTC # Different [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ... [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ... [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ... [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG Same > head(sread(s2)) A DNAStringSet instance of length 6 width seq [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTGGGGGGA # Different [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ... [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ... [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ... [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG #Same Marcus > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1 [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 stats4_2.15.0 [6] tools_2.15.0 zlibbioc_1.2.0 [[alternative HTML version deleted]]
• 1.2k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 / 1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4. Records 2:n were actually records 2:n in the file. FWIW in case there is confusion the intended way to draw independent samples is just f = open(FastqSample(fl, 50) s1 = yield(f); s2 = yield(f) close(f) Martin On 04/25/2012 12:50 AM, Marcus Davy wrote: > Hi, > there appears to be a problem in FastqSampler() which seems to sample the > first read at random, but then the next n-1 reads > are always the same, so the next sample obtained with yield() is not > independent. Is this a bug, or my code implementation wrong? > > ## From example(FastqSampler) > sp<- SolexaPath(system.file('extdata', package='ShortRead')) > fl<- file.path(analysisPath(sp), "s_1_sequence.txt") > > f<- FastqFile(fl) > rfq<- readFastq(f) > > set.seed(1) > f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) > s1<- yield(f) # sample of size n=50 > set.seed(2) > f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) > s2<- yield(f) # independent sample of size 50 > > > ## Intersection length between samples always 50-1 > length(intersect(id(s1), id(s2))) > > ## First read is different, the next 2:n reads are the same > head(sread(s1)) > head(sread(s2)) > close(f) > >> head(sread(s1)) > A DNAStringSet instance of length 6 > width seq > [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACTTTACTC # Different > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ... > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ... > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ... > [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG Same >> head(sread(s2)) > A DNAStringSet instance of length 6 > width seq > [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTGGGGGGA # Different > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ... > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ... > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ... > [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG #Same > > Marcus > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1 > [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 > stats4_2.15.0 > [6] tools_2.15.0 zlibbioc_1.2.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
Hi Martin, thanks for the prompt bug fix. I was wondering if there is a mechanism to allow a seed for reproducible random sample generation, and if so how that might be implemented, set.seed() or otherwise. cheers, Marcus On Wed, Apr 25, 2012 at 11:43 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 / > 1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4. Records 2:n > were actually records 2:n in the file. > > FWIW in case there is confusion the intended way to draw independent > samples is just > > f = open(FastqSample(fl, 50) > s1 = yield(f); s2 = yield(f) > close(f) > > Martin > > > On 04/25/2012 12:50 AM, Marcus Davy wrote: > >> Hi, >> there appears to be a problem in FastqSampler() which seems to sample the >> first read at random, but then the next n-1 reads >> are always the same, so the next sample obtained with yield() is not >> independent. Is this a bug, or my code implementation wrong? >> >> ## From example(FastqSampler) >> sp<- SolexaPath(system.file('**extdata', package='ShortRead')) >> fl<- file.path(analysisPath(sp), "s_1_sequence.txt") >> >> f<- FastqFile(fl) >> rfq<- readFastq(f) >> >> set.seed(1) >> f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) >> s1<- yield(f) # sample of size n=50 >> set.seed(2) >> f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) >> s2<- yield(f) # independent sample of size 50 >> >> >> ## Intersection length between samples always 50-1 >> length(intersect(id(s1), id(s2))) >> >> ## First read is different, the next 2:n reads are the same >> head(sread(s1)) >> head(sread(s2)) >> close(f) >> >> head(sread(s1)) >>> >> A DNAStringSet instance of length 6 >> width seq >> [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACT**TTACTC # Different >> [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCA**TCGGAC #Same >> [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAAT**TTGGGT # ... >> [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAG**GTAAAA # ... >> [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGA**AGGCTT # ... >> [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC**GGGGCG Same >> >>> head(sread(s2)) >>> >> A DNAStringSet instance of length 6 >> width seq >> [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTG**GGGGGA # Different >> [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCA**TCGGAC #Same >> [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAAT**TTGGGT # ... >> [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAG**GTAAAA # ... >> [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGA**AGGCTT # ... >> [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC**GGGGCG #Same >> >> Marcus >> >> sessionInfo() >>> >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-apple-darwin9.8.0/x86_**64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5 >> [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1 >> [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >> stats4_2.15.0 >> [6] tools_2.15.0 zlibbioc_1.2.0 >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On 04/27/2012 12:07 PM, Marcus Davy wrote: > Hi Martin, > thanks for the prompt bug fix. > > I was wondering if there is a mechanism to allow a seed for reproducible > random sample generation, and if so how that might be implemented, > set.seed() or otherwise. Hi Marcus -- FastqSampler uses R's random number stream so after example(FastqSampler) > f = open(FastqSampler(fl, 50)) > set.seed(123); s1 = yield(f); set.seed(123); s2 = yield(f) > identical(s1, s2) [1] TRUE > packageVersion("ShortRead") [1] '1.14.3' Martin > > cheers, > > Marcus > > > On Wed, Apr 25, 2012 at 11:43 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 / > 1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4. > Records 2:n were actually records 2:n in the file. > > FWIW in case there is confusion the intended way to draw independent > samples is just > > f = open(FastqSample(fl, 50) > s1 = yield(f); s2 = yield(f) > close(f) > > Martin > > > On 04/25/2012 12:50 AM, Marcus Davy wrote: > > Hi, > there appears to be a problem in FastqSampler() which seems to > sample the > first read at random, but then the next n-1 reads > are always the same, so the next sample obtained with yield() is not > independent. Is this a bug, or my code implementation wrong? > > ## From example(FastqSampler) > sp<- SolexaPath(system.file('__extdata', package='ShortRead')) > fl<- file.path(analysisPath(sp), "s_1_sequence.txt") > > f<- FastqFile(fl) > rfq<- readFastq(f) > > set.seed(1) > f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) > s1<- yield(f) # sample of size n=50 > set.seed(2) > f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE) > s2<- yield(f) # independent sample of size 50 > > > ## Intersection length between samples always 50-1 > length(intersect(id(s1), id(s2))) > > ## First read is different, the next 2:n reads are the same > head(sread(s1)) > head(sread(s2)) > close(f) > > head(sread(s1)) > > A DNAStringSet instance of length 6 > width seq > [1] 36 GTCTGGAAACGTACGGATTGTTCAGTAACT__TTACTC # Different > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCA__TCGGAC #Same > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAAT__TTGGGT # ... > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAG__GTAAAA # ... > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGA__AGGCTT # ... > [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC__GGGGCG Same > > head(sread(s2)) > > A DNAStringSet instance of length 6 > width seq > [1] 36 GTTTACGAATTAAATCGAAGTGGACTGCTG__GGGGGA # Different > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCA__TCGGAC #Same > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAAT__TTGGGT # ... > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAG__GTAAAA # ... > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGA__AGGCTT # ... > [6] 36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC__GGGGCG #Same > > Marcus > > sessionInfo() > > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.14.2 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.8.4 lattice_0.20-6 Biostrings_2.24.1 > [7] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 > stats4_2.15.0 > [6] tools_2.15.0 zlibbioc_1.2.0 > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 <tel:206%20667-2793> > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD REPLY

Login before adding your answer.

Traffic: 647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6