writeFastq writing dashes instead of dots

0

Entering edit mode

Thomas Rensch ▴ 20

@thomas-rensch-5791

Last seen 9.6 years ago

Hi everyone, I am reading and writing fastq files and writeFastq just swaps dots ('.') for dashes ('-'). Is this the desired behaviour of writeFastq and if so why? Otherwise could someone better at R development than I modify this? Example fastq: @HWI-EAS149_3:1:1:0:1956:0:1:0 .A.................................. + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS149_3:1:1:0:173:0:1:0 .T.................................. + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS149_3:1:1:0:47:0:1:0 .T.................................. + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Thanks a lot, Thomas -- Thomas Rensch PhD Student - Paul Flicek Group EMBL-EBI

• 1.0k views

ADD COMMENT • link updated 11.2 years ago by Martin Morgan 25k • written 11.2 years ago by Thomas Rensch ▴ 20

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

Hi Thomas -- On 02/25/2013 09:08 AM, Thomas Rensch wrote: > Hi everyone, > > I am reading and writing fastq files and writeFastq just swaps dots ('.') for dashes ('-'). > > Is this the desired behaviour of writeFastq and if so why? Otherwise could someone better at R development than I modify this? > > Example fastq: > > > @HWI-EAS149_3:1:1:0:1956:0:1:0 > .A.................................. > + > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > @HWI-EAS149_3:1:1:0:173:0:1:0 > .T.................................. > + > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > @HWI-EAS149_3:1:1:0:47:0:1:0 > .T.................................. > + > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB It's actually on input > sread(readFastq("tmp.fastq")) A DNAStringSet instance of length 3 width seq [1] 36 -A---------------------------------- [2] 36 -T---------------------------------- [3] 36 -T---------------------------------- because '.' is not an a valid letter for DNAStringSet (or from the international standard, if I understand correctly...) > DNAStringSet(".") Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : key 46 (char '.') not in lookup table > alphabet(DNAStringSet()) [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" > DNA_ALPHABET [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" and from ?DNA_ALPHABET This alphabet contains all letters from the IUPAC Extended Genetic Alphabet (see '?IUPAC_CODE_MAP') + the gap ('"-"') and the hard masking ('"+"') letters. One possibility would be to add an option writeFastq(..., dashesASdots=FALSE), is that really a good idea? Martin > > > Thanks a lot, > Thomas > > -- > Thomas Rensch > PhD Student - Paul Flicek Group > EMBL-EBI > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.2 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, I am not sure what is the best way of solving this, however it seems that dashes are not part of the FASTQ format (http://maq.sourceforge.net/fastq.shtml) which implies the writeFastq function is "wrong". If such as method was written ( writeFastq(..., dashesASdots=FALSE) ) I think it the default value should be TRUE. Again in my opinion it is a standard assumption to assume that a writeFastq function should always write a valid FASTQ file no? Best, Thomas -- Thomas Rensch PhD Student - Paul Flicek Group EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD United Kingdom On 26 Feb 2013, at 18:38, Martin Morgan <mtmorgan@fhcrc.org> wrote: > Hi Thomas -- > > On 02/25/2013 09:08 AM, Thomas Rensch wrote: >> Hi everyone, >> >> I am reading and writing fastq files and writeFastq just swaps dots ('.') for dashes ('-'). >> >> Is this the desired behaviour of writeFastq and if so why? Otherwise could someone better at R development than I modify this? >> >> Example fastq: >> >> >> @HWI-EAS149_3:1:1:0:1956:0:1:0 >> .A.................................. >> + >> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >> @HWI-EAS149_3:1:1:0:173:0:1:0 >> .T.................................. >> + >> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >> @HWI-EAS149_3:1:1:0:47:0:1:0 >> .T.................................. >> + >> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > > It's actually on input > > > sread(readFastq("tmp.fastq")) > A DNAStringSet instance of length 3 > width seq > [1] 36 -A---------------------------------- > [2] 36 -T---------------------------------- > [3] 36 -T---------------------------------- > > because '.' is not an a valid letter for DNAStringSet (or from the international standard, if I understand correctly...) > > > DNAStringSet(".") > Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : > key 46 (char '.') not in lookup table > > alphabet(DNAStringSet()) > [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" > > DNA_ALPHABET > [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" > > and from ?DNA_ALPHABET > > This alphabet contains all letters from the IUPAC Extended Genetic > Alphabet (see '?IUPAC_CODE_MAP') + the gap ('"-"') and the hard > masking ('"+"') letters. > > One possibility would be to add an option writeFastq(..., dashesASdots=FALSE), is that really a good idea? > > Martin > >> >> >> Thanks a lot, >> Thomas >> >> -- >> Thomas Rensch >> PhD Student - Paul Flicek Group >> EMBL-EBI >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Thomas Rensch ▴ 20

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

On 02/27/2013 03:11 AM, Thomas Rensch wrote: > Hi Martin, > > I am not sure what is the best way of solving this, however it seems that dashes > are not part of the FASTQ format (http://maq.sourceforge.net/fastq.shtml) which > implies the writeFastq function is "wrong". If such as method was written > ( writeFastq(..., dashesASdots=FALSE) ) I think it the default value should be > TRUE. > > Again in my opinion it is a standard assumption to assume that a writeFastq > function should always write a valid FASTQ file no? Yes, it should read and write valid fastq, and the Bioc representation has obviously diverged from (this) standard. It seems like the class for representing fastq sequences needs to be generalized, probably to BStringSet from DNAStringSet, probably both in ShortRead and Rsamtools::scanBam. This would mean loss of some functionality (e.g., complement(), translate()) which don't make sense for a more general alphabet, and some unnecessary complexity (e.g., alphabetFrequency() would return frequencies of all 256 letters). It seems a shame to lose this, when it is clear that many fastq and bam files contain IUPAC-consistent sequences. My temptation is to make the input type user-selectable, c("DNAStringSet", "BStringSet") with the former the default. Your '.' would then be handled correctly in a round-trip. What does the broader community use non-IUPAC symbols for, including '.' and '~' but (paradoxically?) not '-'? Martin > > Best, > Thomas > > -- > Thomas Rensch > PhD Student - Paul Flicek Group > EMBL - European Bioinformatics Institute > Wellcome Trust Genome Campus > Hinxton, Cambridge > CB10 1SD > United Kingdom > > On 26 Feb 2013, at 18:38, Martin Morgan <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > >> Hi Thomas -- >> >> On 02/25/2013 09:08 AM, Thomas Rensch wrote: >>> Hi everyone, >>> >>> I am reading and writing fastq files and writeFastq just swaps dots ('.') for >>> dashes ('-'). >>> >>> Is this the desired behaviour of writeFastq and if so why? Otherwise could >>> someone better at R development than I modify this? >>> >>> Example fastq: >>> >>> >>> @HWI-EAS149_3:1:1:0:1956:0:1:0 >>> .A.................................. >>> + >>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> @HWI-EAS149_3:1:1:0:173:0:1:0 >>> .T.................................. >>> + >>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> @HWI-EAS149_3:1:1:0:47:0:1:0 >>> .T.................................. >>> + >>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >> >> It's actually on input >> >> > sread(readFastq("tmp.fastq")) >> A DNAStringSet instance of length 3 >> width seq >> [1] 36 -A---------------------------------- >> [2] 36 -T---------------------------------- >> [3] 36 -T---------------------------------- >> >> because '.' is not an a valid letter for DNAStringSet (or from the >> international standard, if I understand correctly...) >> >> > DNAStringSet(".") >> Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : >> key 46 (char '.') not in lookup table >> > alphabet(DNAStringSet()) >> [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" >> > DNA_ALPHABET >> [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" >> >> and from ?DNA_ALPHABET >> >> This alphabet contains all letters from the IUPAC Extended Genetic >> Alphabet (see '?IUPAC_CODE_MAP') + the gap ('"-"') and the hard >> masking ('"+"') letters. >> >> One possibility would be to add an option writeFastq(..., dashesASdots=FALSE), >> is that really a good idea? >> >> Martin >> >>> >>> >>> Thanks a lot, >>> Thomas >>> >>> -- >>> Thomas Rensch >>> PhD Student - Paul Flicek Group >>> EMBL-EBI >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.2 years ago Martin Morgan 25k

Login before adding your answer.