readFasta and gzipped FASTA files

0

Entering edit mode

Ivan Gregoretti ▴ 310

@ivan-gregoretti-3975

Last seen 9.6 years ago

Canada

Hello everybody, The library ShortRead includes two very useful functions: readFastq() and readFasta() While readFastq() can open FASTQ files as either plain text or gzipped files, readFasta() can only open files in plain text. For example: # FASTQ: success > readFastq("t01213R0QU.fq.gz") class: ShortReadQ length: 43608 reads; width: 178..486 cycles # FASTA: failure > readFasta("t01213R0QU.P.fa.gz") Error in .normargInputFilepath(filepath) : file "t01213R0QU.P.fa.gz" has unsupported type: gzfile Is this the current status or it's time for me to update my BioC? Can someone offer a work around that does not involve decompressing the FASTA file to disc? I tried, yet unsuccessfully: readFasta(gzfile("t01213R0QU.P.fa.gz","r")) Error in function (classes, fdef, mtable) : unable to find an inherited method for function ?readFasta? for signature ?"gzfile"? Thank you, Ivan > sessionInfo() R Under development (unstable) (2012-11-30 r61184) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.11.16 Biostrings_2.27.11 GenomicRanges_1.11.29 [4] IRanges_1.17.32 BiocGenerics_0.5.6 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0

ShortRead ShortRead • 6.0k views

ADD COMMENT • link updated 11.2 years ago by Martin Morgan 25k • written 11.2 years ago by Ivan Gregoretti ▴ 310

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

On 2/14/2013 12:32 PM, Ivan Gregoretti wrote: > Hello everybody, > > The library ShortRead includes two very useful functions: readFastq() > and readFasta() > > While readFastq() can open FASTQ files as either plain text or gzipped > files, readFasta() can only open files in plain text. > > For example: > > # FASTQ: success >> readFastq("t01213R0QU.fq.gz") > class: ShortReadQ > length: 43608 reads; width: 178..486 cycles > > # FASTA: failure >> readFasta("t01213R0QU.P.fa.gz") > Error in .normargInputFilepath(filepath) : > file "t01213R0QU.P.fa.gz" has unsupported type: gzfile > > > Is this the current status or it's time for me to update my BioC? > > Can someone offer a work around that does not involve decompressing > the FASTA file to disc? I tried, yet unsuccessfully: readFasta is just an alias for readDNAStringSet, which does not support compressed files. Not sure exactly what is in your fasta file, but Rsamtools::FaFile might be what you are looking for (typically, relatively few long sequences; you need to create an index (using indexFa) if one does not already exist then something along the lines of fa = FaFile("some.fa.gz") ## index is some.fa.gz. scanFa(fa, param=scanFaIndex(fa)) would read the whole file, or a more selective GRanges to go after particular sub-sequences. A workaround might also involve http://comments.gmane.org/gmane.comp.lang.r.sequencing/1981 all <- readLines("s_1.fa.gz") sread <- DNAStringSet(all[c(FALSE, TRUE)]) id <- BStringSet(all[c(TRUE, FALSE)]) fas <- ShortRead(sread=sread, id=id) > > readFasta(gzfile("t01213R0QU.P.fa.gz","r")) > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function ?readFasta? for > signature ?"gzfile"? > > > Thank you, > > Ivan > > >> sessionInfo() > R Under development (unstable) (2012-11-30 r61184) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.11.16 Biostrings_2.27.11 GenomicRanges_1.11.29 > [4] IRanges_1.17.32 BiocGenerics_0.5.6 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

ADD COMMENT • link 11.2 years ago Martin Morgan 25k

0

Entering edit mode

Thank you Martin. I realize now that I have faced this problem before and that I can work around it with the ShortRead constructor. I appreciate it. Ivan On 14 Feb 2013 21:24, "Martin Morgan" <mtmorgan@fhcrc.org> wrote: > On 2/14/2013 12:32 PM, Ivan Gregoretti wrote: > >> Hello everybody, >> >> The library ShortRead includes two very useful functions: readFastq() >> and readFasta() >> >> While readFastq() can open FASTQ files as either plain text or gzipped >> files, readFasta() can only open files in plain text. >> >> For example: >> >> # FASTQ: success >> >>> readFastq("t01213R0QU.fq.gz") >>> >> class: ShortReadQ >> length: 43608 reads; width: 178..486 cycles >> >> # FASTA: failure >> >>> readFasta("t01213R0QU.P.fa.gz"**) >>> >> Error in .normargInputFilepath(**filepath) : >> file "t01213R0QU.P.fa.gz" has unsupported type: gzfile >> >> >> Is this the current status or it's time for me to update my BioC? >> >> Can someone offer a work around that does not involve decompressing >> the FASTA file to disc? I tried, yet unsuccessfully: >> > > readFasta is just an alias for readDNAStringSet, which does not support > compressed files. > > Not sure exactly what is in your fasta file, but Rsamtools::FaFile might > be what you are looking for (typically, relatively few long sequences; you > need to create an index (using indexFa) if one does not already exist then > something along the lines of > > fa = FaFile("some.fa.gz") ## index is some.fa.gz. > scanFa(fa, param=scanFaIndex(fa)) > > would read the whole file, or a more selective GRanges to go after > particular sub-sequences. > > A workaround might also involve > > http://comments.gmane.org/**gmane.comp.lang.r.sequencing/**1981<http :="" comments.gmane.org="" gmane.comp.lang.r.sequencing="" 1981=""> > > all <- readLines("s_1.fa.gz") > sread <- DNAStringSet(all[c(FALSE, TRUE)]) > id <- BStringSet(all[c(TRUE, FALSE)]) > fas <- ShortRead(sread=sread, id=id) > > > >> readFasta(gzfile("t01213R0QU.**P.fa.gz","r")) >> Error in function (classes, fdef, mtable) : >> unable to find an inherited method for function readFasta for >> signature "gzfile" >> >> >> Thank you, >> >> Ivan >> >> >> sessionInfo() >>> >> R Under development (unstable) (2012-11-30 r61184) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] Rsamtools_1.11.16 Biostrings_2.27.11 GenomicRanges_1.11.29 >> [4] IRanges_1.17.32 BiocGenerics_0.5.6 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0 >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Dr. Martin Morgan, PhD > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Ivan Gregoretti ▴ 310

0

Entering edit mode

Actually, now I also remember what problems I encountered with that strategy: not all my FASTA records are limited to 2 lines. Some sequences are long and span 6 lines. So, I can't take advantage of the elegant solution sread <- DNAStringSet(all[c(FALSE, TRUE)]) I'll have to think about a strategy where I read the file one line at a time while I identify the start of a new record every time the character ">" is found at the beginning of the line. Still, thank you. Ivan On 14 Feb 2013 22:50, "Ivan Gregoretti" <ivangreg@gmail.com> wrote: > Thank you Martin. I realize now that I have faced this problem before and > that I can work around it with the ShortRead constructor. > > I appreciate it. > > Ivan > On 14 Feb 2013 21:24, "Martin Morgan" <mtmorgan@fhcrc.org> wrote: > >> On 2/14/2013 12:32 PM, Ivan Gregoretti wrote: >> >>> Hello everybody, >>> >>> The library ShortRead includes two very useful functions: readFastq() >>> and readFasta() >>> >>> While readFastq() can open FASTQ files as either plain text or gzipped >>> files, readFasta() can only open files in plain text. >>> >>> For example: >>> >>> # FASTQ: success >>> >>>> readFastq("t01213R0QU.fq.gz") >>>> >>> class: ShortReadQ >>> length: 43608 reads; width: 178..486 cycles >>> >>> # FASTA: failure >>> >>>> readFasta("t01213R0QU.P.fa.gz"**) >>>> >>> Error in .normargInputFilepath(**filepath) : >>> file "t01213R0QU.P.fa.gz" has unsupported type: gzfile >>> >>> >>> Is this the current status or it's time for me to update my BioC? >>> >>> Can someone offer a work around that does not involve decompressing >>> the FASTA file to disc? I tried, yet unsuccessfully: >>> >> >> readFasta is just an alias for readDNAStringSet, which does not support >> compressed files. >> >> Not sure exactly what is in your fasta file, but Rsamtools::FaFile might >> be what you are looking for (typically, relatively few long sequences; you >> need to create an index (using indexFa) if one does not already exist then >> something along the lines of >> >> fa = FaFile("some.fa.gz") ## index is some.fa.gz. >> scanFa(fa, param=scanFaIndex(fa)) >> >> would read the whole file, or a more selective GRanges to go after >> particular sub-sequences. >> >> A workaround might also involve >> >> http://comments.gmane.org/**gmane.comp.lang.r.sequencing/**1981<htt p:="" comments.gmane.org="" gmane.comp.lang.r.sequencing="" 1981=""> >> >> all <- readLines("s_1.fa.gz") >> sread <- DNAStringSet(all[c(FALSE, TRUE)]) >> id <- BStringSet(all[c(TRUE, FALSE)]) >> fas <- ShortRead(sread=sread, id=id) >> >> >> >>> readFasta(gzfile("t01213R0QU.**P.fa.gz","r")) >>> Error in function (classes, fdef, mtable) : >>> unable to find an inherited method for function readFasta for >>> signature "gzfile" >>> >>> >>> Thank you, >>> >>> Ivan >>> >>> >>> sessionInfo() >>>> >>> R Under development (unstable) (2012-11-30 r61184) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] Rsamtools_1.11.16 Biostrings_2.27.11 GenomicRanges_1.11.29 >>> [4] IRanges_1.17.32 BiocGenerics_0.5.6 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0 >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >>> >> >> -- >> Dr. Martin Morgan, PhD >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> > [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Ivan Gregoretti ▴ 310

Login before adding your answer.