Trouble creating a ShortReadQ object and using writeFastq

0

Entering edit mode

Noah Dowell ▴ 410

@noah-dowell-3791

Last seen 9.7 years ago

Hello All, I am trying to do some preprocessing quality analysis on PacBio reads. To make any plots of quality score or nucleotide frequency one should only look at reads of the same length. One output from PacBio filtering of the raw data is a FASTQ file. The nature of the PacBio reads is that they lack uniformity in their length dimension. I am pretty sure the file is okay because the readSeqFile() function in the QRQC package is able to read the data in and the FASTQSummary object can be used to make plots. Given the non-uniform length of reads though those plots show artifacts. As far as can tell the FASTQSummary object can not be manipulated like a ShortRead or BioStrings object. So I turn to the ShortRead package. I have also successfully used the readFastq() function in ShortRead package to read in Illumina reads of different length and then preprocess or slice-n-dice those reads. I am getting the following error though when I try this on PacBio fastq files: ###################################################################### ############### > dirPath = "/Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_25 5/017249_12cells_filtered/data/" > pat1 = "filteredSTD_12cells.fastq" > seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file = "filteredSTD_12cells.fastq") Error: Input/Output file(s): /Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_ 12cells_filtered/data//filteredSTD_12cells.fastq message: line too long /Users/ndowell/Documents/PacBio/130204_NLD1_X L_12cells_255/017249_12cells_filtered/data//filteredSTD_12cells.fastq: 17521 ###################################################################### ############### So I tried to manually build a ShortReadQ file using the following (there are generally 4 distinct lines in a FASTQ file): #workaround to error with too long lines seqTemp <- readLines("filteredSTD_12cells.fastq") seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)]) ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)]) qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)]) SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids) #This worked!!: length(SeqClean) #484232 summary(width(SeqClean)) Min. 1st Qu. Median Mean 3rd Qu. Max. 50 1368 2153 2785 3567 26940 #But I get this error when I try to write the file (changing withIds to TRUE or full to TRUE gives same error): > writeFastq(SeqClean, filepath = "cleanReads.fastq", format = "fastq", mode = "w", full = FALSE, withIds = FALSE) Error in function (classes, fdef, mtable) : unable to find an inherited method for function ?writeFastq? for signature ?"ShortReadQ", "missing"? #I think something is not quite right with my construction of the ShortReadQ object and specifically the read IDs #I wonder if it is because it is slotting the ids into the wrong place: > head(ids) A BStringSet instance of length 6 width seq [1] 72 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/21/0_5104 [2] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/53/2430_3193 [3] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/53/3237_6580 [4] 75 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/53/6626_9948 [5] 72 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/54/0_4929 [6] 74 @m130205_030957_42152_c100453962550000001523068903201342_s1_ p0/81/276_2420 ###################################################################### ############### My questions: 1. Is there any way to leverage the ability to slice-n-dice a FASTQ data set in the ShortRead package and then write to a new FASTQ file? 2 Should I just try to add in names to my sequences manually (to the ShortReadQ object) to alleviate the 'mssing' error above? 3. Is there any desire from the developers to merge the id() and names() functions? Any help/hints would be appreciated. I apologize for the lack of a toy-reproducible example; I was unsuccessful in my attempt to create a toy quality score. Thank you in advance for your assistance and time. Best, Noah ###################################################################### ############### > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] qrqc_1.12.0 testthat_0.7 xtable_1.7-0 brew_1.0-6 biovizBase_1.6.2 ggplot2_0.9.3 [7] reshape_0.8.4 plyr_1.8 org.Hs.eg.db_2.8.0 RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 [13] Biobase_2.18.0 BiocInstaller_1.8.3 ShortRead_1.16.3 latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.10.2 [19] lattice_0.20-13 Biostrings_2.26.3 GenomicRanges_1.10.6 IRanges_1.16.5 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 bitops_1.0-4.2 BSgenome_1.26.1 cluster_1.14.3 colorspace_1.2-1 [6] dichromat_2.0-0 digest_0.6.2 evaluate_0.4.3 GenomicFeatures_1.10.1 gtable_0.1.2 [11] Hmisc_3.10-1 hwriter_1.3 labeling_0.1 MASS_7.3-23 munsell_0.4 [16] parallel_2.15.2 proto_0.3-10 RCurl_1.95-3 reshape2_1.2.2 rtracklayer_1.18.2 [21] scales_0.2.3 stats4_2.15.2 stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 [26] zlibbioc_1.4.0

Preprocessing ShortRead Preprocessing ShortRead • 2.6k views

ADD COMMENT • link updated 11.2 years ago by Martin Morgan 25k • written 11.2 years ago by Noah Dowell ▴ 410

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 16 days ago

United States

Hi Noah -- On 2/20/2013 2:47 PM, Noah Dowell wrote: > Hello All, > > I am trying to do some preprocessing quality analysis on PacBio reads. To make any plots of quality score or nucleotide frequency one should only look at reads of the same length. One output from PacBio filtering of the raw data is a FASTQ file. The nature of the PacBio reads is that they lack uniformity in their length dimension. I am pretty sure the file is okay because the readSeqFile() function in the QRQC package is able to read the data in and the FASTQSummary object can be used to make plots. Given the non-uniform length of reads though those plots show artifacts. As far as can tell the FASTQSummary object can not be manipulated like a ShortRead or BioStrings object. > > So I turn to the ShortRead package. I have also successfully used the readFastq() function in ShortRead package to read in Illumina reads of different length and then preprocess or slice-n-dice those reads. I am getting the following error though when I try this on PacBio fastq files: > #################################################################### ################# >> dirPath = "/Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_2 55/017249_12cells_filtered/data/" >> pat1 = "filteredSTD_12cells.fastq" >> seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file = "filteredSTD_12cells.fastq") > Error: Input/Output > file(s): > /Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/0172 49_12cells_filtered/data//filteredSTD_12cells.fastq > message: line too long /Users/ndowell/Documents/PacBio/130204_NLD 1_XL_12cells_255/017249_12cells_filtered/data//filteredSTD_12cells.fas tq:17521 is there something unusual about this line of this file, e.g., blank or otherwise? How long are reads, anyway? Maybe you can trigger this in a subset of the file with just one or two records, and you'd be willing to share that with me? The error could come from one of two places in the C code, and both are preceded by a comment 'this should never happen' > > > #################################################################### ################# > > So I tried to manually build a ShortReadQ file using the following (there are generally 4 distinct lines in a FASTQ file): > #workaround to error with too long lines > seqTemp <- readLines("filteredSTD_12cells.fastq") > seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)]) > ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)]) > qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)]) > > SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids) > > #This worked!!: > length(SeqClean) #484232 > summary(width(SeqClean)) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 50 1368 2153 2785 3567 26940 > > #But I get this error when I try to write the file (changing withIds to TRUE or full to TRUE gives same error): >> writeFastq(SeqClean, filepath = "cleanReads.fastq", format = "fastq", mode = "w", full = FALSE, withIds = FALSE) > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function ?writeFastq? for signature ?"ShortReadQ", "missing"? > args(writeFastq) function (object, file, mode = "w", full = FALSE, ...) NULL so I think your 'filepath' should be file="cleanReads.fastq". > > #I think something is not quite right with my construction of the ShortReadQ object and specifically the read IDs > #I wonder if it is because it is slotting the ids into the wrong place: >> head(ids) > A BStringSet instance of length 6 > width seq > [1] 72 @m130205_030957_42152_c100453962550000001523068903201342_s 1_p0/21/0_5104 > [2] 75 @m130205_030957_42152_c100453962550000001523068903201342_s 1_p0/53/2430_3193 > [3] 75 @m130205_030957_42152_c100453962550000001523068903201342_s 1_p0/53/3237_6580 > [4] 75 @m130205_030957_42152_c100453962550000001523068903201342_s 1_p0/53/6626_9948 > [5] 72 @m130205_030957_42152_c100453962550000001523068903201342_s 1_p0/54/0_4929 > [6] 74 @m130205_030957_42152_c100453962550000001523068903201342_s 1_p0/81/276_2420 > #################################################################### ################# > > My questions: > 1. Is there any way to leverage the ability to slice-n-dice a FASTQ data set in the ShortRead package and then write to a new FASTQ file? > > 2 Should I just try to add in names to my sequences manually (to the ShortReadQ object) to alleviate the 'mssing' error above? > > 3. Is there any desire from the developers to merge the id() and names() functions? would have been a good choice, in retrospect ;) Martin > > > Any help/hints would be appreciated. I apologize for the lack of a toy-reproducible example; I was unsuccessful in my attempt to create a toy quality score. Thank you in advance for your assistance and time. > > Best, > Noah > > > > > > > > > #################################################################### ################# > > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods base > > other attached packages: > [1] qrqc_1.12.0 testthat_0.7 xtable_1.7-0 brew_1.0-6 biovizBase_1.6.2 ggplot2_0.9.3 > [7] reshape_0.8.4 plyr_1.8 org.Hs.eg.db_2.8.0 RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 > [13] Biobase_2.18.0 BiocInstaller_1.8.3 ShortRead_1.16.3 latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.10.2 > [19] lattice_0.20-13 Biostrings_2.26.3 GenomicRanges_1.10.6 IRanges_1.16.5 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.14.0 bitops_1.0-4.2 BSgenome_1.26.1 cluster_1.14.3 colorspace_1.2-1 > [6] dichromat_2.0-0 digest_0.6.2 evaluate_0.4.3 GenomicFeatures_1.10.1 gtable_0.1.2 > [11] Hmisc_3.10-1 hwriter_1.3 labeling_0.1 MASS_7.3-23 munsell_0.4 > [16] parallel_2.15.2 proto_0.3-10 RCurl_1.95-3 reshape2_1.2.2 rtracklayer_1.18.2 > [21] scales_0.2.3 stats4_2.15.2 stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 > [26] zlibbioc_1.4.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

ADD COMMENT • link 11.2 years ago Martin Morgan 25k

0

Entering edit mode

Thank you Martin! Once again you caught a simple mistake I was making with regards to using the wrong argument (filepath instead of file). Changing it to 'file' allowed writeFastq to work but then reading the newly written fastq file threw the same "line error" when I tried to read it in with readFastq. It is hard to find anything wrong with that specific line. It is a line containing DNA sequence that is 23728 nt long. I opened the file in vi and went to that line and looked for anything odd but did not "see" anything to edit. The exact same data in fasta format can be read in using readBStringSet( format = "fasta") in the BioStrings package. So length of sequence does not seem to be a problem and maybe this particular file is just corrupted. Sorry for bothering the list with this issue. Moving forward the utilities in Biostrings and ShortRead for manipulating fasta files are very powerful for people with PacBio long reads or assembling genomes and extracting contigs/scaffolds associated with specific genes. This was not the intended user function but I thought it might be useful to provide another user scenario for the developers. Extending the FASTQ functions to work on reads of non-uniform length might be a worthwhile discussion for your team. Alternatively, allowing slots within SequenceSummary objects (FASTQSummary in qrqc package) to be accessed by the same methods as a XString-Set object could leverage the power of both BioStrings and qrqc packages. I recognize these are non-trivial requests. Thank you for all the great work you and your Bioconductor team do. Best, Noah On Wed, Feb 20, 2013 at 5:08 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > Hi Noah -- > > > On 2/20/2013 2:47 PM, Noah Dowell wrote: > >> Hello All, >> >> I am trying to do some preprocessing quality analysis on PacBio reads. >> To make any plots of quality score or nucleotide frequency one should only >> look at reads of the same length. One output from PacBio filtering of the >> raw data is a FASTQ file. The nature of the PacBio reads is that they lack >> uniformity in their length dimension. I am pretty sure the file is okay >> because the readSeqFile() function in the QRQC package is able to read the >> data in and the FASTQSummary object can be used to make plots. Given the >> non-uniform length of reads though those plots show artifacts. As far as >> can tell the FASTQSummary object can not be manipulated like a ShortRead or >> BioStrings object. >> >> So I turn to the ShortRead package. I have also successfully used the >> readFastq() function in ShortRead package to read in Illumina reads of >> different length and then preprocess or slice-n-dice those reads. I am >> getting the following error though when I try this on PacBio fastq files: >> ##############################**##############################** >> ######################### >> >>> dirPath = "/Users/ndowell/Documents/**PacBio/130204_NLD1_XL_12cells_** >>> 255/017249_12cells_filtered/**data/" >>> pat1 = "filteredSTD_12cells.fastq" >>> seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file = >>> "filteredSTD_12cells.fastq") >>> >> Error: Input/Output >> file(s): >> /Users/ndowell/Documents/**PacBio/130204_NLD1_XL_12cells_** >> 255/017249_12cells_filtered/**data//filteredSTD_12cells.**fastq >> message: line too long /Users/ndowell/Documents/** >> PacBio/130204_NLD1_XL_12cells_**255/017249_12cells_filtered/** >> data//filteredSTD_12cells.**fastq:17521 >> > > is there something unusual about this line of this file, e.g., blank or > otherwise? How long are reads, anyway? Maybe you can trigger this in a > subset of the file with just one or two records, and you'd be willing to > share that with me? The error could come from one of two places in the C > code, and both are preceded by a comment 'this should never happen' > > > >> >> ##############################**##############################** >> ######################### >> >> So I tried to manually build a ShortReadQ file using the following (there >> are generally 4 distinct lines in a FASTQ file): >> #workaround to error with too long lines >> seqTemp <- readLines("filteredSTD_**12cells.fastq") >> seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)]) >> ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)]) >> qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)]) >> >> SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids) >> >> #This worked!!: >> length(SeqClean) #484232 >> summary(width(SeqClean)) >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> 50 1368 2153 2785 3567 26940 >> >> #But I get this error when I try to write the file (changing withIds to >> TRUE or full to TRUE gives same error): >> >>> writeFastq(SeqClean, filepath = "cleanReads.fastq", format = "fastq", >>> mode = "w", full = FALSE, withIds = FALSE) >>> >> Error in function (classes, fdef, mtable) : >> unable to find an inherited method for function writeFastq for >> signature "ShortReadQ", "missing" >> > > > args(writeFastq) > function (object, file, mode = "w", full = FALSE, ...) > NULL > > so I think your 'filepath' should be file="cleanReads.fastq". > > > >> #I think something is not quite right with my construction of the >> ShortReadQ object and specifically the read IDs >> #I wonder if it is because it is slotting the ids into the wrong place: >> >>> head(ids) >>> >> A BStringSet instance of length 6 >> width seq >> [1] 72 @m130205_030957_42152_**c10045396255000000152306890320** >> 1342_s1_p0/21/0_5104 >> [2] 75 @m130205_030957_42152_**c10045396255000000152306890320** >> 1342_s1_p0/53/2430_3193 >> [3] 75 @m130205_030957_42152_**c10045396255000000152306890320** >> 1342_s1_p0/53/3237_6580 >> [4] 75 @m130205_030957_42152_**c10045396255000000152306890320** >> 1342_s1_p0/53/6626_9948 >> [5] 72 @m130205_030957_42152_**c10045396255000000152306890320** >> 1342_s1_p0/54/0_4929 >> [6] 74 @m130205_030957_42152_**c10045396255000000152306890320** >> 1342_s1_p0/81/276_2420 >> ##############################**##############################** >> ######################### >> >> My questions: >> 1. Is there any way to leverage the ability to slice-n-dice a FASTQ data >> set in the ShortRead package and then write to a new FASTQ file? >> >> 2 Should I just try to add in names to my sequences manually (to the >> ShortReadQ object) to alleviate the 'mssing' error above? >> >> 3. Is there any desire from the developers to merge the id() and names() >> functions? >> > > would have been a good choice, in retrospect ;) > > Martin > > > >> >> Any help/hints would be appreciated. I apologize for the lack of a >> toy-reproducible example; I was unsuccessful in my attempt to create a toy >> quality score. Thank you in advance for your assistance and time. >> >> Best, >> Noah >> >> >> >> >> >> >> >> >> ##############################**##############################** >> ######################### >> >> >> sessionInfo() >>> >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-apple-darwin9.8.0/x86_**64 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >> >> attached base packages: >> [1] grid stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] qrqc_1.12.0 testthat_0.7 xtable_1.7-0 >> brew_1.0-6 biovizBase_1.6.2 ggplot2_0.9.3 >> [7] reshape_0.8.4 plyr_1.8 org.Hs.eg.db_2.8.0 >> RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 >> [13] Biobase_2.18.0 BiocInstaller_1.8.3 ShortRead_1.16.3 >> latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.10.2 >> [19] lattice_0.20-13 Biostrings_2.26.3 GenomicRanges_1.10.6 >> IRanges_1.16.5 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.14.0 bitops_1.0-4.2 BSgenome_1.26.1 >> cluster_1.14.3 colorspace_1.2-1 >> [6] dichromat_2.0-0 digest_0.6.2 evaluate_0.4.3 >> GenomicFeatures_1.10.1 gtable_0.1.2 >> [11] Hmisc_3.10-1 hwriter_1.3 labeling_0.1 >> MASS_7.3-23 munsell_0.4 >> [16] parallel_2.15.2 proto_0.3-10 RCurl_1.95-3 >> reshape2_1.2.2 rtracklayer_1.18.2 >> [21] scales_0.2.3 stats4_2.15.2 stringr_0.6.2 >> tools_2.15.2 XML_3.95-0.1 >> [26] zlibbioc_1.4.0 >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Dr. Martin Morgan, PhD > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Noah Dowell ▴ 410

0

Entering edit mode

On 02/21/2013 08:47 AM, NOAH LEE DOWELL wrote: > Thank you Martin! > > Once again you caught a simple mistake I was making with regards to using the > wrong argument (filepath instead of file). Changing it to 'file' allowed > writeFastq to work but then reading the newly written fastq file threw the same > "line error" when I tried to read it in with readFastq. > > It is hard to find anything wrong with that specific line. It is a line > containing DNA sequence that is 23728 nt long. I opened the file in vi and went The line length itself might have caused problems; ShortRead was written when short reads were 36 nt (back in the day...) though it could accommodate short reads of up to about 20k. I made a temporary patch that should handle much longer reads (200k) available via biocLite as 1.16.4 probably noon tomorrow, PST. > to that line and looked for anything odd but did not "see" anything to edit. > The exact same data in fasta format can be read in using readBStringSet( > format = "fasta") in the BioStrings package. So length of sequence does not > seem to be a problem and maybe this particular file is just corrupted. Sorry > for bothering the list with this issue. They're using different input mechanisms, so have different limits. > > Moving forward the utilities in Biostrings and ShortRead for manipulating fasta > files are very powerful for people with PacBio long reads or assembling genomes > and extracting contigs/scaffolds associated with specific genes. This was not > the intended user function but I thought it might be useful to provide another > user scenario for the developers. > > Extending the FASTQ functions to work on reads of non-uniform length might be > a worthwhile discussion for your team. Alternatively, allowing slots within Generally, uneven read lengths in ShortRead are not (meant to be) a problem. Maybe you can point to some specific issues you're having? Martin > SequenceSummary objects (FASTQSummary in qrqc package) to be accessed by the > same methods as a XString-Set object could leverage the power of both BioStrings > and qrqc packages. I recognize these are non-trivial requests. > > Thank you for all the great work you and your Bioconductor team do. > > Best, > Noah > > > On Wed, Feb 20, 2013 at 5:08 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > Hi Noah -- > > > On 2/20/2013 2:47 PM, Noah Dowell wrote: > > Hello All, > > I am trying to do some preprocessing quality analysis on PacBio reads. > To make any plots of quality score or nucleotide frequency one should > only look at reads of the same length. One output from PacBio filtering > of the raw data is a FASTQ file. The nature of the PacBio reads is that > they lack uniformity in their length dimension. I am pretty sure the > file is okay because the readSeqFile() function in the QRQC package is > able to read the data in and the FASTQSummary object can be used to make > plots. Given the non-uniform length of reads though those plots show > artifacts. As far as can tell the FASTQSummary object can not be > manipulated like a ShortRead or BioStrings object. > > So I turn to the ShortRead package. I have also successfully used the > readFastq() function in ShortRead package to read in Illumina reads of > different length and then preprocess or slice-n-dice those reads. I am > getting the following error though when I try this on PacBio fastq files: > ##############################__############################ ##__######################### > > dirPath = > "/Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cell s___255/017249_12cells_filtered/__data/" > pat1 = "filteredSTD_12cells.fastq" > seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file = > "filteredSTD_12cells.fastq") > > Error: Input/Output > file(s): > > /Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cells__ _255/017249_12cells_filtered/__data//filteredSTD_12cells.__fastq > message: line too long > /Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cells___2 55/017249_12cells_filtered/__data//filteredSTD_12cells.__fastq:17521 > > > is there something unusual about this line of this file, e.g., blank or > otherwise? How long are reads, anyway? Maybe you can trigger this in a > subset of the file with just one or two records, and you'd be willing to > share that with me? The error could come from one of two places in the C > code, and both are preceded by a comment 'this should never happen' > > > > > ##############################__############################ ##__######################### > > So I tried to manually build a ShortReadQ file using the following > (there are generally 4 distinct lines in a FASTQ file): > #workaround to error with too long lines > seqTemp <- readLines("filteredSTD___12cells.fastq") > seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)]) > ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)]) > qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)]) > > SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids) > > #This worked!!: > length(SeqClean) #484232 > summary(width(SeqClean)) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 50 1368 2153 2785 3567 26940 > > #But I get this error when I try to write the file (changing withIds to > TRUE or full to TRUE gives same error): > > writeFastq(SeqClean, filepath = "cleanReads.fastq", format = > "fastq", mode = "w", full = FALSE, withIds = FALSE) > > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function ?writeFastq? for > signature ?"ShortReadQ", "missing"? > > > > args(writeFastq) > function (object, file, mode = "w", full = FALSE, ...) > NULL > > so I think your 'filepath' should be file="cleanReads.fastq". > > > > #I think something is not quite right with my construction of the > ShortReadQ object and specifically the read IDs > #I wonder if it is because it is slotting the ids into the wrong place: > > head(ids) > > A BStringSet instance of length 6 > width seq > [1] 72 > @m130205_030957_42152___c10045396255000000152306890320__1342 _s1_p0/21/0_5104 > [2] 75 > @m130205_030957_42152___c10045396255000000152306890320__1342 _s1_p0/53/2430_3193 > [3] 75 > @m130205_030957_42152___c10045396255000000152306890320__1342 _s1_p0/53/3237_6580 > [4] 75 > @m130205_030957_42152___c10045396255000000152306890320__1342 _s1_p0/53/6626_9948 > [5] 72 > @m130205_030957_42152___c10045396255000000152306890320__1342 _s1_p0/54/0_4929 > [6] 74 > @m130205_030957_42152___c10045396255000000152306890320__1342 _s1_p0/81/276_2420 > ##############################__############################ ##__######################### > > My questions: > 1. Is there any way to leverage the ability to slice-n-dice a FASTQ data > set in the ShortRead package and then write to a new FASTQ file? > > 2 Should I just try to add in names to my sequences manually (to the > ShortReadQ object) to alleviate the 'mssing' error above? > > 3. Is there any desire from the developers to merge the id() and names() > functions? > > > would have been a good choice, in retrospect ;) > > Martin > > > > > Any help/hints would be appreciated. I apologize for the lack of a > toy-reproducible example; I was unsuccessful in my attempt to create a > toy quality score. Thank you in advance for your assistance and time. > > Best, > Noah > > > > > > > > > ##############################__############################ ##__######################### > > > sessionInfo() > > R version 2.15.2 (2012-10-26) > Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8 > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] qrqc_1.12.0 testthat_0.7 xtable_1.7-0 > brew_1.0-6 biovizBase_1.6.2 ggplot2_0.9.3 > [7] reshape_0.8.4 plyr_1.8 org.Hs.eg.db_2.8.0 > RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 > [13] Biobase_2.18.0 BiocInstaller_1.8.3 ShortRead_1.16.3 > latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.10.2 > [19] lattice_0.20-13 Biostrings_2.26.3 GenomicRanges_1.10.6 > IRanges_1.16.5 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.14.0 bitops_1.0-4.2 BSgenome_1.26.1 > cluster_1.14.3 colorspace_1.2-1 > [6] dichromat_2.0-0 digest_0.6.2 evaluate_0.4.3 > GenomicFeatures_1.10.1 gtable_0.1.2 > [11] Hmisc_3.10-1 hwriter_1.3 labeling_0.1 > MASS_7.3-23 munsell_0.4 > [16] parallel_2.15.2 proto_0.3-10 RCurl_1.95-3 > reshape2_1.2.2 rtracklayer_1.18.2 > [21] scales_0.2.3 stats4_2.15.2 stringr_0.6.2 > tools_2.15.2 XML_3.95-0.1 > [26] zlibbioc_1.4.0 > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Dr. Martin Morgan, PhD > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.2 years ago Martin Morgan 25k

Login before adding your answer.