writeFastq writing dashes instead of dots
2
0
Entering edit mode
@thomas-rensch-5791
Last seen 9.6 years ago
Hi everyone, I am reading and writing fastq files and writeFastq just swaps dots ('.') for dashes ('-'). Is this the desired behaviour of writeFastq and if so why? Otherwise could someone better at R development than I modify this? Example fastq: @HWI-EAS149_3:1:1:0:1956:0:1:0 .A.................................. + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS149_3:1:1:0:173:0:1:0 .T.................................. + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS149_3:1:1:0:47:0:1:0 .T.................................. + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Thanks a lot, Thomas -- Thomas Rensch PhD Student - Paul Flicek Group EMBL-EBI
• 1.0k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States
Hi Thomas -- On 02/25/2013 09:08 AM, Thomas Rensch wrote: > Hi everyone, > > I am reading and writing fastq files and writeFastq just swaps dots ('.') for dashes ('-'). > > Is this the desired behaviour of writeFastq and if so why? Otherwise could someone better at R development than I modify this? > > Example fastq: > > > @HWI-EAS149_3:1:1:0:1956:0:1:0 > .A.................................. > + > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > @HWI-EAS149_3:1:1:0:173:0:1:0 > .T.................................. > + > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > @HWI-EAS149_3:1:1:0:47:0:1:0 > .T.................................. > + > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB It's actually on input > sread(readFastq("tmp.fastq")) A DNAStringSet instance of length 3 width seq [1] 36 -A---------------------------------- [2] 36 -T---------------------------------- [3] 36 -T---------------------------------- because '.' is not an a valid letter for DNAStringSet (or from the international standard, if I understand correctly...) > DNAStringSet(".") Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : key 46 (char '.') not in lookup table > alphabet(DNAStringSet()) [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" > DNA_ALPHABET [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" and from ?DNA_ALPHABET This alphabet contains all letters from the IUPAC Extended Genetic Alphabet (see '?IUPAC_CODE_MAP') + the gap ('"-"') and the hard masking ('"+"') letters. One possibility would be to add an option writeFastq(..., dashesASdots=FALSE), is that really a good idea? Martin > > > Thanks a lot, > Thomas > > -- > Thomas Rensch > PhD Student - Paul Flicek Group > EMBL-EBI > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Hi Martin, I am not sure what is the best way of solving this, however it seems that dashes are not part of the FASTQ format (http://maq.sourceforge.net/fastq.shtml) which implies the writeFastq function is "wrong". If such as method was written ( writeFastq(..., dashesASdots=FALSE) ) I think it the default value should be TRUE. Again in my opinion it is a standard assumption to assume that a writeFastq function should always write a valid FASTQ file no? Best, Thomas -- Thomas Rensch PhD Student - Paul Flicek Group EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD United Kingdom On 26 Feb 2013, at 18:38, Martin Morgan <mtmorgan@fhcrc.org> wrote: > Hi Thomas -- > > On 02/25/2013 09:08 AM, Thomas Rensch wrote: >> Hi everyone, >> >> I am reading and writing fastq files and writeFastq just swaps dots ('.') for dashes ('-'). >> >> Is this the desired behaviour of writeFastq and if so why? Otherwise could someone better at R development than I modify this? >> >> Example fastq: >> >> >> @HWI-EAS149_3:1:1:0:1956:0:1:0 >> .A.................................. >> + >> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >> @HWI-EAS149_3:1:1:0:173:0:1:0 >> .T.................................. >> + >> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >> @HWI-EAS149_3:1:1:0:47:0:1:0 >> .T.................................. >> + >> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > > It's actually on input > > > sread(readFastq("tmp.fastq")) > A DNAStringSet instance of length 3 > width seq > [1] 36 -A---------------------------------- > [2] 36 -T---------------------------------- > [3] 36 -T---------------------------------- > > because '.' is not an a valid letter for DNAStringSet (or from the international standard, if I understand correctly...) > > > DNAStringSet(".") > Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : > key 46 (char '.') not in lookup table > > alphabet(DNAStringSet()) > [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" > > DNA_ALPHABET > [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" > > and from ?DNA_ALPHABET > > This alphabet contains all letters from the IUPAC Extended Genetic > Alphabet (see '?IUPAC_CODE_MAP') + the gap ('"-"') and the hard > masking ('"+"') letters. > > One possibility would be to add an option writeFastq(..., dashesASdots=FALSE), is that really a good idea? > > Martin > >> >> >> Thanks a lot, >> Thomas >> >> -- >> Thomas Rensch >> PhD Student - Paul Flicek Group >> EMBL-EBI >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States
On 02/27/2013 03:11 AM, Thomas Rensch wrote: > Hi Martin, > > I am not sure what is the best way of solving this, however it seems that dashes > are not part of the FASTQ format (http://maq.sourceforge.net/fastq.shtml) which > implies the writeFastq function is "wrong". If such as method was written > ( writeFastq(..., dashesASdots=FALSE) ) I think it the default value should be > TRUE. > > Again in my opinion it is a standard assumption to assume that a writeFastq > function should always write a valid FASTQ file no? Yes, it should read and write valid fastq, and the Bioc representation has obviously diverged from (this) standard. It seems like the class for representing fastq sequences needs to be generalized, probably to BStringSet from DNAStringSet, probably both in ShortRead and Rsamtools::scanBam. This would mean loss of some functionality (e.g., complement(), translate()) which don't make sense for a more general alphabet, and some unnecessary complexity (e.g., alphabetFrequency() would return frequencies of all 256 letters). It seems a shame to lose this, when it is clear that many fastq and bam files contain IUPAC-consistent sequences. My temptation is to make the input type user-selectable, c("DNAStringSet", "BStringSet") with the former the default. Your '.' would then be handled correctly in a round-trip. What does the broader community use non-IUPAC symbols for, including '.' and '~' but (paradoxically?) not '-'? Martin > > Best, > Thomas > > -- > Thomas Rensch > PhD Student - Paul Flicek Group > EMBL - European Bioinformatics Institute > Wellcome Trust Genome Campus > Hinxton, Cambridge > CB10 1SD > United Kingdom > > On 26 Feb 2013, at 18:38, Martin Morgan <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > >> Hi Thomas -- >> >> On 02/25/2013 09:08 AM, Thomas Rensch wrote: >>> Hi everyone, >>> >>> I am reading and writing fastq files and writeFastq just swaps dots ('.') for >>> dashes ('-'). >>> >>> Is this the desired behaviour of writeFastq and if so why? Otherwise could >>> someone better at R development than I modify this? >>> >>> Example fastq: >>> >>> >>> @HWI-EAS149_3:1:1:0:1956:0:1:0 >>> .A.................................. >>> + >>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> @HWI-EAS149_3:1:1:0:173:0:1:0 >>> .T.................................. >>> + >>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> @HWI-EAS149_3:1:1:0:47:0:1:0 >>> .T.................................. >>> + >>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >> >> It's actually on input >> >> > sread(readFastq("tmp.fastq")) >> A DNAStringSet instance of length 3 >> width seq >> [1] 36 -A---------------------------------- >> [2] 36 -T---------------------------------- >> [3] 36 -T---------------------------------- >> >> because '.' is not an a valid letter for DNAStringSet (or from the >> international standard, if I understand correctly...) >> >> > DNAStringSet(".") >> Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : >> key 46 (char '.') not in lookup table >> > alphabet(DNAStringSet()) >> [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" >> > DNA_ALPHABET >> [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" >> >> and from ?DNA_ALPHABET >> >> This alphabet contains all letters from the IUPAC Extended Genetic >> Alphabet (see '?IUPAC_CODE_MAP') + the gap ('"-"') and the hard >> masking ('"+"') letters. >> >> One possibility would be to add an option writeFastq(..., dashesASdots=FALSE), >> is that really a good idea? >> >> Martin >> >>> >>> >>> Thanks a lot, >>> Thomas >>> >>> -- >>> Thomas Rensch >>> PhD Student - Paul Flicek Group >>> EMBL-EBI >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6