readFastq, writeFastq, compressed files and unix pipes.
2
0
Entering edit mode
@ivan-gregoretti-3975
Last seen 9.6 years ago
Canada
Hello ShortRead developers, I recently tried to create an R script that could run as an ordinary programme at the command line. It was something like $ R --vanilla --slave -f ./myscript.R --input=input.fastq.gz --output=output.fastq.gz I tried and failed but I learned something in the process: readFastq() and writeFastq() are not symmetrical. Specifically: 1) readFastq() can tell the difference between a plain text FASTQ file and a gzipped FASTQ file (.gz). Unlike that, writeFastq() always outputs in plain text regardless of the suffix passed. 2) writeFastq() understands unix pipes because it conveniently accepts the file argument "/dev/stdout". Unlike that, readFastq() does not accept the file argument "/dev/stdin", so, no pipes are possible. It would be great if readFastq() and writeFastq() were equally smart. Please consider adding these functionalities. Thank you, Ivan Ivan Gregoretti, PhD
ShortRead ShortRead • 2.5k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 days ago
United States
Hi Ivan -- On 12/13/2012 06:40 AM, Martin Morgan wrote: > On 12/13/2012 01:22 AM, Ivan Gregoretti wrote: >> Hello ShortRead developers, >> >> I recently tried to create an R script that could run as an ordinary >> programme at the command line. It was something like >> >> $ R --vanilla --slave -f ./myscript.R --input=input.fastq.gz >> --output=output.fastq.gz >> >> I tried and failed but I learned something in the process: readFastq() >> and writeFastq() are not symmetrical. >> >> Specifically: >> >> 1) readFastq() can tell the difference between a plain text FASTQ file >> and a gzipped FASTQ file (.gz). Unlike that, writeFastq() always >> outputs in plain text regardless of the suffix passed. > > FastqStreamer is intended to work on R connections, and from the note in ?stdin > + a little googling + tolerating a warning about an already open connection one > can read a compressed file from stdin using > > fin <- gzcon(file("stdin", "rb")) > > (this seems to be the R paradigm, nothing special to fastq files) followed by > > strm <- FastqStreamer(fin) > object <- yield(strm) > > see ?FastqStreamer for how many records are read at a time. > >> 2) writeFastq() understands unix pipes because it conveniently accepts >> the file argument "/dev/stdout". Unlike that, readFastq() does not >> accept the file argument "/dev/stdin", so, no pipes are possible. > > For me, writeFastq(object, "/dev/stdout") fails with > > > writeFastq(object, "/dev/stdout") > Error: UserArgumentMismatch > file '/dev/stdout' exists, but mode is not 'a' > > an easy way to get the full support of R's connections is > > setMethod(writeFastq, c("ShortReadQ", "connection"), > function(object, file, mode="w", full = FALSE, ...) > { > outp <- character(length(object) * 4L) > outp[c(TRUE, FALSE, full, FALSE)] <- as.character(id(object)) > outp[c(FALSE, TRUE, FALSE, FALSE)] <- as.character(sread(object)) > outp[c(FALSE, FALSE, TRUE, FALSE)] <- "+" > outp[c(FALSE, FALSE, FALSE, TRUE)] <- as.character(quality(quality(object))) > writeLines(outp, file) > }) > > which has obvious limitations. Writing a compressed stream to /dev/stdout could > be arranged with > > fout <- gzcon(file("/dev/stdout", "wb")) > > (which seems like a hack; maybe there's a better R way?) and then > > writeFastq(object, fout) > close(fout) > > and processing a file would be > > fin <- gzcon(file("stdin", "rb")) > fout <- gzcon(file("/dev/stdout", "wb")) > > strm <- FastqStreamer(fin) > while(length(object <- yield(strm))) { > writeFastq(object, fout) > } > close(fin); close(fout) > > piping to stdout seems like it will be problematic, e.g., if some R command > writes to stdout then it will be inserted in the output stream. Maybe better to > use a named pipe (fifo) > >> It would be great if readFastq() and writeFastq() were equally smart. >> Please consider adding these functionalities. >> >> Thank you, >> >> Ivan >> >> >> >> >> >> Ivan Gregoretti, PhD >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 days ago
United States
Sorry, I sent a previous reply a little too quickly, meant to add... On 12/13/2012 06:40 AM, Martin Morgan wrote: > On 12/13/2012 01:22 AM, Ivan Gregoretti wrote: >> Hello ShortRead developers, >> >> I recently tried to create an R script that could run as an ordinary >> programme at the command line. It was something like >> >> $ R --vanilla --slave -f ./myscript.R --input=input.fastq.gz >> --output=output.fastq.gz >> >> I tried and failed but I learned something in the process: readFastq() >> and writeFastq() are not symmetrical. >> >> Specifically: >> >> 1) readFastq() can tell the difference between a plain text FASTQ file >> and a gzipped FASTQ file (.gz). Unlike that, writeFastq() always >> outputs in plain text regardless of the suffix passed. > > FastqStreamer is intended to work on R connections, and from the note in ?stdin > + a little googling + tolerating a warning about an already open connection one > can read a compressed file from stdin using > > fin <- gzcon(file("stdin", "rb")) > > (this seems to be the R paradigm, nothing special to fastq files) followed by > > strm <- FastqStreamer(fin) > object <- yield(strm) > > see ?FastqStreamer for how many records are read at a time. > >> 2) writeFastq() understands unix pipes because it conveniently accepts >> the file argument "/dev/stdout". Unlike that, readFastq() does not >> accept the file argument "/dev/stdin", so, no pipes are possible. > > For me, writeFastq(object, "/dev/stdout") fails with > > > writeFastq(object, "/dev/stdout") > Error: UserArgumentMismatch > file '/dev/stdout' exists, but mode is not 'a' > > an easy way to get the full support of R's connections is > > setMethod(writeFastq, c("ShortReadQ", "connection"), > function(object, file, mode="w", full = FALSE, ...) > { > outp <- character(length(object) * 4L) > outp[c(TRUE, FALSE, full, FALSE)] <- as.character(id(object)) > outp[c(FALSE, TRUE, FALSE, FALSE)] <- as.character(sread(object)) > outp[c(FALSE, FALSE, TRUE, FALSE)] <- "+" > outp[c(FALSE, FALSE, FALSE, TRUE)] <- as.character(quality(quality(object))) > writeLines(outp, file) > }) > > which has obvious limitations. Writing a compressed stream to /dev/stdout could > be arranged with > > fout <- gzcon(file("/dev/stdout", "wb")) > > (which seems like a hack; maybe there's a better R way?) and then > > writeFastq(object, fout) > close(fout) > > and processing a file would be > > fin <- gzcon(file("stdin", "rb")) > fout <- gzcon(file("/dev/stdout", "wb")) > > strm <- FastqStreamer(fin) > while(length(object <- yield(strm))) { > writeFastq(object, fout) > } > close(fin); close(fout) > > piping to stdout seems like it will be problematic, e.g., if some R command > writes to stdout then it will be inserted in the output stream. Maybe better to > use a named pipe (fifo) Picking up here, in R I did library(ShortRead) fin <- gzcon(file("stdin", "rb")) fout = commandArgs(trailingOnly=TRUE)[[1]] strm <- FastqStreamer(fin) while(length(object <- yield(strm))) { writeFastq(object, fout, "a") } close(fin) and used this from the shell as $ mkfifo pipe $ cat fastq.gz | Rscript --slave myfifo.R pipe & cat pipe | gzip -f | ... and yes, I'll try to make read/writeFastq more symmetrical in how they behave. Martin > >> It would be great if readFastq() and writeFastq() were equally smart. >> Please consider adding these functionalities. >> >> Thank you, >> >> Ivan >> >> >> >> >> >> Ivan Gregoretti, PhD >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Thanks Martin.There is always something to learn from your replies. Making readFastq/writeFastq more symmetrical would still be appreciated. The simpler the usage of a script, the smoother to passage to production. Ivan On 15 Dec 2012 01:03, "Martin Morgan" <mtmorgan@fhcrc.org> wrote: > Sorry, I sent a previous reply a little too quickly, meant to add... > > On 12/13/2012 06:40 AM, Martin Morgan wrote: > >> On 12/13/2012 01:22 AM, Ivan Gregoretti wrote: >> >>> Hello ShortRead developers, >>> >>> I recently tried to create an R script that could run as an ordinary >>> programme at the command line. It was something like >>> >>> $ R --vanilla --slave -f ./myscript.R --input=input.fastq.gz >>> --output=output.fastq.gz >>> >>> I tried and failed but I learned something in the process: readFastq() >>> and writeFastq() are not symmetrical. >>> >>> Specifically: >>> >>> 1) readFastq() can tell the difference between a plain text FASTQ file >>> and a gzipped FASTQ file (.gz). Unlike that, writeFastq() always >>> outputs in plain text regardless of the suffix passed. >>> >> >> FastqStreamer is intended to work on R connections, and from the note in >> ?stdin >> + a little googling + tolerating a warning about an already open >> connection one >> can read a compressed file from stdin using >> >> fin <- gzcon(file("stdin", "rb")) >> >> (this seems to be the R paradigm, nothing special to fastq files) >> followed by >> >> strm <- FastqStreamer(fin) >> object <- yield(strm) >> >> see ?FastqStreamer for how many records are read at a time. >> >> 2) writeFastq() understands unix pipes because it conveniently accepts >>> the file argument "/dev/stdout". Unlike that, readFastq() does not >>> accept the file argument "/dev/stdin", so, no pipes are possible. >>> >> >> For me, writeFastq(object, "/dev/stdout") fails with >> >> > writeFastq(object, "/dev/stdout") >> Error: UserArgumentMismatch >> file '/dev/stdout' exists, but mode is not 'a' >> >> an easy way to get the full support of R's connections is >> >> setMethod(writeFastq, c("ShortReadQ", "connection"), >> function(object, file, mode="w", full = FALSE, ...) >> { >> outp <- character(length(object) * 4L) >> outp[c(TRUE, FALSE, full, FALSE)] <- as.character(id(object)) >> outp[c(FALSE, TRUE, FALSE, FALSE)] <- as.character(sread(object)) >> outp[c(FALSE, FALSE, TRUE, FALSE)] <- "+" >> outp[c(FALSE, FALSE, FALSE, TRUE)] <- as.character(quality(quality(* >> *object))) >> writeLines(outp, file) >> }) >> >> which has obvious limitations. Writing a compressed stream to /dev/stdout >> could >> be arranged with >> >> fout <- gzcon(file("/dev/stdout", "wb")) >> >> (which seems like a hack; maybe there's a better R way?) and then >> >> writeFastq(object, fout) >> close(fout) >> >> and processing a file would be >> >> fin <- gzcon(file("stdin", "rb")) >> fout <- gzcon(file("/dev/stdout", "wb")) >> >> strm <- FastqStreamer(fin) >> while(length(object <- yield(strm))) { >> writeFastq(object, fout) >> } >> close(fin); close(fout) >> >> piping to stdout seems like it will be problematic, e.g., if some R >> command >> writes to stdout then it will be inserted in the output stream. Maybe >> better to >> use a named pipe (fifo) >> > > Picking up here, in R I did > > library(ShortRead) > fin <- gzcon(file("stdin", "rb")) > fout = commandArgs(trailingOnly=TRUE)**[[1]] > > strm <- FastqStreamer(fin) > while(length(object <- yield(strm))) { > writeFastq(object, fout, "a") > } > close(fin) > > and used this from the shell as > > $ mkfifo pipe > $ cat fastq.gz | Rscript --slave myfifo.R pipe & cat pipe | gzip -f | ... > > and yes, I'll try to make read/writeFastq more symmetrical in how they > behave. > > Martin > > >> It would be great if readFastq() and writeFastq() were equally smart. >>> Please consider adding these functionalities. >>> >>> Thank you, >>> >>> Ivan >>> >>> >>> >>> >>> >>> Ivan Gregoretti, PhD >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6