FastqSampler sort by ID
1
0
Entering edit mode
Marcus Davy ▴ 390
@marcus-davy-5153
Last seen 6.1 years ago
Hi Martin, when the FastqSampler uniformly samples from Fastq files, it appears that the header information in the id() accessor of a ShortReadQ object is not sorted in the sample with respect to tile information, as you would expect in the original source files. I was wondering whether an argument (e.g. sortByID=TRUE/FALSE) would be worth implementing in FastqSampler to specify whether to sort the sampled reads by their IDs rather than sorting manually after taking a sample? cheers, Marcus ## Using ShortRead example sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") ## Load Fastq data to change id information rfq <- readFastq(fl) ## No tile information in example id(rfq) ## Make up some new headers in the form; ## @HWI-EAS88:187:C0101VACXX:2:2206:9399:51464 1:N:0: tile <- rep(c(1001:(1001+7)), each=length(rfq)/8) x <- gsub(".+_(\\d+)_\\d+$", "\\1", id(rfq)) y <- gsub(".+_\\d+_(\\d+)$", "\\1", id(rfq)) ids <- paste("@HWI-EAS88:187:C0101VACXX:2:", tile, ":", x, ":", y, " 1:N:0:", sep="") ## Hack overlaying ids with additional tile information rfq@id <- BStringSet(ids) ## Write to a tmp file writeFastq(rfq, "/tmp/example.fastq") f <- open(FastqSampler("/tmp/example.fastq", 50)) ## Sample set.seed(42) fqSample <- yield(f) close(f) ## Tile information is randomly ordered id(fqSample) sapply(strsplit(as.character(id(fqSample)), ":"), function(x)x[5]) ## Manually sorting the ids o <- srorder(id(fqSample)) fqSorted <- fqSample[o] id(fqSorted) > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.16.1 latticeExtra_0.6-24 RColorBrewer_1.0-5 [4] Rsamtools_1.10.1 lattice_0.20-10 Biostrings_2.26.2 [7] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] Biobase_2.18.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 [5] parallel_2.15.0 stats4_2.15.0 tools_2.15.0 zlibbioc_1.4.0 [[alternative HTML version deleted]]
ShortRead ShortRead • 1.2k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 17 days ago
United States
Hi Marcus -- On 11/13/2012 3:11 PM, Marcus Davy wrote: > Hi Martin, > when the FastqSampler uniformly samples from Fastq files, it appears that the > header information in the id() accessor of a ShortReadQ object is not sorted in > the sample with respect to tile information, as you would expect in the original > source files. > > I was wondering whether an argument (e.g. sortByID=TRUE/FALSE) would be worth > implementing in FastqSampler to specify whether to sort the sampled reads by > their IDs rather than sorting manually after taking a sample? > > cheers, > > Marcus > > > ## Using ShortRead example > sp <- SolexaPath(system.file('extdata', package='ShortRead')) > fl <- file.path(analysisPath(sp), "s_1_sequence.txt") > > ## Load Fastq data to change id information > rfq <- readFastq(fl) > > ## No tile information in example > id(rfq) actually, from ?readFastq, argument withIds=TRUE will return ids. In ShortRead 1.17.4 (in the devel branch) I've added an 'ordered' argument to FastqSampler, with default FALSE but when set to TRUE returns the reads in the order encountered in the file. Thanks for the suggestion. Martin > > ## Make up some new headers in the form; > ## @HWI-EAS88:187:C0101VACXX:2:2206:9399:51464 1:N:0: > tile <- rep(c(1001:(1001+7)), each=length(rfq)/8) > x <- gsub(".+_(\\d+)_\\d+$", "\\1", id(rfq)) > y <- gsub(".+_\\d+_(\\d+)$", "\\1", id(rfq)) > ids <- paste("@HWI-EAS88:187:C0101VACXX:2:", tile, ":", x, ":", y, " 1:N:0:", > sep="") > > ## Hack overlaying ids with additional tile information > rfq at id <- BStringSet(ids) > > ## Write to a tmp file > writeFastq(rfq, "/tmp/example.fastq") > f <- open(FastqSampler("/tmp/example.fastq", 50)) > > ## Sample > set.seed(42) > fqSample <- yield(f) > close(f) > > ## Tile information is randomly ordered > id(fqSample) > sapply(strsplit(as.character(id(fqSample)), ":"), function(x)x[5]) > > ## Manually sorting the ids > o <- srorder(id(fqSample)) > fqSorted <- fqSample[o] > id(fqSorted) > > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.16.1 latticeExtra_0.6-24 RColorBrewer_1.0-5 > [4] Rsamtools_1.10.1 lattice_0.20-10 Biostrings_2.26.2 > [7] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.18.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 > [5] parallel_2.15.0 stats4_2.15.0 tools_2.15.0 zlibbioc_1.4.0 -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
ADD COMMENT
0
Entering edit mode
thanks for implementing, it was a small request, certainly useful if working with illumina data that contains quality problems in some tiles. Marcus On Thu, Nov 15, 2012 at 5:55 AM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > Hi Marcus -- > > > On 11/13/2012 3:11 PM, Marcus Davy wrote: > >> Hi Martin, >> when the FastqSampler uniformly samples from Fastq files, it appears that >> the >> header information in the id() accessor of a ShortReadQ object is not >> sorted in >> the sample with respect to tile information, as you would expect in the >> original >> source files. >> >> I was wondering whether an argument (e.g. sortByID=TRUE/FALSE) would be >> worth >> implementing in FastqSampler to specify whether to sort the sampled reads >> by >> their IDs rather than sorting manually after taking a sample? >> >> cheers, >> >> Marcus >> >> >> ## Using ShortRead example >> sp <- SolexaPath(system.file('**extdata', package='ShortRead')) >> fl <- file.path(analysisPath(sp), "s_1_sequence.txt") >> >> ## Load Fastq data to change id information >> rfq <- readFastq(fl) >> >> ## No tile information in example >> id(rfq) >> > > actually, from ?readFastq, argument withIds=TRUE will return ids. > > In ShortRead 1.17.4 (in the devel branch) I've added an 'ordered' argument > to FastqSampler, with default FALSE but when set to TRUE returns the reads > in the order encountered in the file. > > Thanks for the suggestion. > > Martin > > > >> ## Make up some new headers in the form; >> ## @HWI-EAS88:187:C0101VACXX:2:**2206:9399:51464 1:N:0: >> tile <- rep(c(1001:(1001+7)), each=length(rfq)/8) >> x <- gsub(".+_(\\d+)_\\d+$", "\\1", id(rfq)) >> y <- gsub(".+_\\d+_(\\d+)$", "\\1", id(rfq)) >> ids <- paste("@HWI-EAS88:187:**C0101VACXX:2:", tile, ":", x, ":", y, " >> 1:N:0:", >> sep="") >> >> ## Hack overlaying ids with additional tile information >> rfq@id <- BStringSet(ids) >> >> ## Write to a tmp file >> writeFastq(rfq, "/tmp/example.fastq") >> f <- open(FastqSampler("/tmp/**example.fastq", 50)) >> >> ## Sample >> set.seed(42) >> fqSample <- yield(f) >> close(f) >> >> ## Tile information is randomly ordered >> id(fqSample) >> sapply(strsplit(as.character(**id(fqSample)), ":"), function(x)x[5]) >> >> ## Manually sorting the ids >> o <- srorder(id(fqSample)) >> fqSorted <- fqSample[o] >> id(fqSorted) >> >> > sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-apple-darwin9.8.0/x86_**64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ShortRead_1.16.1 latticeExtra_0.6-24 RColorBrewer_1.0-5 >> [4] Rsamtools_1.10.1 lattice_0.20-10 Biostrings_2.26.2 >> [7] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.18.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >> [5] parallel_2.15.0 stats4_2.15.0 tools_2.15.0 zlibbioc_1.4.0 >> > > > -- > Dr. Martin Morgan, PhD > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6