Question

ShortRead 3' trimming negative length vectors are not allowed

0

Entering edit mode

Sam McInturf ▴ 300

@sam-mcinturf-5291

Last seen 8.6 years ago

United States

Hello everyone, I am trying to do some quality trimming on some RNA seq reads (Illumina, 100 bp, single end). I have been following the pipeline from the UCR manuals<http: manuals.bioinformatics.ucr.edu="" home="" ht-seq#toc-="" trimming-low-quality-3-tails-from-r=""> but I have run into an error I can't seem to resolve. library(ShortRead) reads <- readFastq("myFile.fastq") qualityCutoff <- 10 # remove read tails with quality lower than this seqs <- sread(reads) # get sequence list qual <- PhredQuality(quality(quality(reads))) # get quality score list as PhredQuality > length(qual) #my length is positive [1] 39145018 > myqual_mat <- matrix(charToRaw(as.character(unlist(qual))), nrow=length(qual), byrow=TRUE) Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : negative length vectors are not allowed as(qual, matrix) gives the same error, and as.matrix(qual) does as well, not sure how those two commands are different, but tried anyway. If I run the exact same commands, instead of the full 39145018 reads, I use the first 100, it works like a charm. I am working on a linux cluster, if that is important Does anyone have an idea? Cheers! -- Sam McInturf A look at my data: > seqs A DNAStringSet instance of length 39145018 width seq [1] 100 ATCNCAAAGGTATATTTTGAATCAAAAACA...TTCTTTTTCTGGACCTTGAGGGAAAGGGGG [2] 100 CGANCTTGACCGTTGAGAAATTATCAGCTG...AGCTTCAGTCAATTCCACACAGTCTGAAAA [3] 100 GATNGTAGGACGGAAAAAGAATNNNNNGTT...GAAANNNGTTGGTCACAACTGTTANNCACC [4] 100 GCAACGTTATCGGCGATCGCACTGAAGACG...CCAGAGTTCTCCTCCCACTGATCGTCCTGT [5] 100 GACGATTAAAGGAGCTTATGAGGACTCTTT...TTGTTTGATAGGATTATTCAGAGAGGACAT ... ... ... [39145014] 100 ACAAGGGATAAAATCAAAGATGGGTGATAA...TGAAGCTTGGTGGCTTGTAGGCAATGAAAC [39145015] 100 CATGCTTCTCCTTCTCTTTTTCTTTGTGTC...TTGTTCTCCTCCTCAACATCTCTCTCCCTC [39145016] 100 CAGCGACTATCAGTGTAAACTCCAATCTTG...CTGGCCGTACCAATCCAACCGCATGCTTCA [39145017] 100 CGTGCGTGGGATGTTTCCGGTGAGGCGTGG...CCGATCCGGTACGAGATCTCACCGCAGACT [39145018] 100 TACTTGAGGAAATGGTGTTTAGTTGCTGTG...AGCCTCAAGTTCCACGTTACTTCATCTTTG > qual A PhredQuality instance of length 39145018 width seq [1] 100 @@@#4=D?DD<dhgiiiidhhiiiiiighh...=?############################ [2]="" 100="" @@@#4="=A??DHH?BBBBCEGHCHGIIIII...@=AEEAC@??CBCECE@">;?1=5>CCC@C3 [3] 100 ##############################...############################## [4] 100 @@CADDFDFHFGGIIIIIIG=GEIIEGHIH...ACCCA?>>:>A>:<c?<a<a:c@ ?<<?###="" [5]="" 100="" c@bfffffhghhhiiijjjjjjjjiijjji.="" ..giighehhhhffffffceedeeeddddbdd="" ...="" ...="" ...="" [39145014]="" 100="" @@@="?ADD">CFFFBFIIBHIIHF=A??DD<ccdcdddcbbbd3?:@a@c [39145017]="" 100="" ?=":+==2@<CDDDE@BEIDA" ##########...##############################="" [39145018]="" 100="" (;<38="">)<7>@>8=:)2:-=@;)9)=9?.2...############################## > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ShortRead_1.18.0 latticeExtra_0.6-24 RColorBrewer_1.0-5 [4] Rsamtools_1.12.3 lattice_0.20-15 Biostrings_2.28.0 [7] GenomicRanges_1.12.3 IRanges_1.18.1 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] Biobase_2.20.0 bitops_1.0-5 grid_3.0.0 hwriter_1.3 stats4_3.0.0 [6] zlibbioc_1.6.0 [[alternative HTML version deleted]]

charm charm • 1.7k views

ADD COMMENT • link updated 10.9 years ago by Martin Morgan 25k • written 10.9 years ago by Sam McInturf ▴ 300

score 0 · Answer 1 · 2013-05-17

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 hours ago

United States

On 05/17/2013 09:19 AM, Sam McInturf wrote: > Hello everyone, > I am trying to do some quality trimming on some RNA seq reads (Illumina, > 100 bp, single end). I have been following the pipeline from the UCR > manuals<http: manuals.bioinformatics.ucr.edu="" home="" ht-seq#toc-="" trimming-low-quality-3-tails-from-r=""> > but > I have run into an error I can't seem to resolve. > > library(ShortRead) > reads <- readFastq("myFile.fastq") > > qualityCutoff <- 10 # remove read tails with quality lower than this > > seqs <- sread(reads) # get sequence list > > qual <- PhredQuality(quality(quality(reads))) # get quality score list as > PhredQuality > >> length(qual) #my length is positive > [1] 39145018 > >> myqual_mat <- matrix(charToRaw(as.character(unlist(qual))), > nrow=length(qual), byrow=TRUE) I'm guessing that what's happening is that as.character(unlist(qual)) is making a very large vector sum(width(qual)), larger than can be represented in (your) version of R (the output of sessionInfo() would be helpful...). But I don't think you need to go this route; have you looked at ShortRead::trimEnds & friends? For instance, after running exmample(trimEnds) I have an object 'rfq' with some qualities > quality(rfq) class: SFastqQuality quality: A BStringSet instance of length 256 width seq [1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS [2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK [3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA [4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA [5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS ... ... ... [252] 36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAOC [253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM [254] 36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOOHH [255] 36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJSJ [256] 36 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHNH And I can simultaneously trim reads and sequences with trimEnds(rfq, "V") which you can see in action as > quality(trimEnds(rfq, "V")) class: SFastqQuality quality: A BStringSet instance of length 256 width seq [1] 27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]] [2] 31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZ [3] 32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUX [4] 31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZ [5] 28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]] ... ... ... [252] 28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR] [253] 27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]] [254] 31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZ [255] 32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVX [256] 24 ]]]]]]]]]]]Y]]T]T]]]]TRY The following might be helpful to transform a large file -- setMethod(trimEnds, "character", function(object, a, left = TRUE, right = TRUE, relation = c("<=", "=="), ..., destination, yieldSize=1000000, ranges = FALSE) { if (missing('destination')) stop("'destination' required") strm <- FastqStreamer(object, yieldSize) tot <- nNuc <- nTrimNuc <- 0L while (length(fq <- yield(strm))) { tot <- tot + length(fq) nNuc <- nNuc + sum(width(fq)) fq <- trimEnds(fq, a, left, right, relation, ..., ranges=ranges) nTrimNuc <- nTrimNuc + sum(width(fq)) writeFastq(fq, destination, "a") } list(destination=destination, c(TotalReads=tot, TotalNucleotides=nNuc, TrimmedNucleotides=nNuc - nTrimNuc)) }) provide the first argument to trimEnds as a simple character vector, and provide a 'destination' file name. For example ## just a convenient path to a fastq file fl <- file.path(analysisPath(sp), "s_1_sequence.txt") trimEnds(fl, "V", destination=tempfile()) A variation of this will be added to the 'devel' version of ShortRead, so if there are suggestions for improvements please let me know. Martin > Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : > negative length vectors are not allowed > > as(qual, matrix) gives the same error, and as.matrix(qual) does as well, > not sure how those two commands are different, but tried anyway. > > If I run the exact same commands, instead of the full 39145018 reads, I use > the first 100, it works like a charm. I am working on a linux cluster, if > that is important > > Does anyone have an idea? > > Cheers! > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 10.9 years ago Martin Morgan 25k

0

Entering edit mode

Martin, I was unaware of the trimEnds() function, it worked wonderfully without having to use your modified trimEnds(). I imagine it is because the cluster has a lot of memory available. I ran it a second time using the modification, but it was much slower. Thanks again! On Fri, May 17, 2013 at 12:20 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 05/17/2013 09:19 AM, Sam McInturf wrote: > >> Hello everyone, >> I am trying to do some quality trimming on some RNA seq reads >> (Illumina, >> 100 bp, single end). I have been following the pipeline from the UCR >> manuals<http: manuals.**bioinformatics.ucr.edu="" home="" **="">> ht-seq#TOC-Trimming-Low-**Quality-3-Tails- from-R<http: manuals.bioinformatics.ucr.edu="" home="" ht-seq#toc-trimming-="" low-quality-3-tails-from-r=""> >> > >> >> but >> I have run into an error I can't seem to resolve. >> >> library(ShortRead) >> reads <- readFastq("myFile.fastq") >> >> qualityCutoff <- 10 # remove read tails with quality lower than this >> >> seqs <- sread(reads) # get sequence list >> >> qual <- PhredQuality(quality(quality(**reads))) # get quality score list >> as >> PhredQuality >> >> length(qual) #my length is positive >>> >> [1] 39145018 >> >> myqual_mat <- matrix(charToRaw(as.character(**unlist(qual))), >>> >> nrow=length(qual), byrow=TRUE) >> > > I'm guessing that what's happening is that as.character(unlist(qual)) is > making a very large vector sum(width(qual)), larger than can be represented > in (your) version of R (the output of sessionInfo() would be helpful...). > > But I don't think you need to go this route; have you looked at > ShortRead::trimEnds & friends? For instance, after running > > exmample(trimEnds) > > I have an object 'rfq' with some qualities > > > quality(rfq) > class: SFastqQuality > quality: > A BStringSet instance of length 256 > width seq > [1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]**VCHVMPLAS > [2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][**PZPICCK > [3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]**TMJEUXEFLA > [4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]**RJRZTQLOA > [5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]**MJUJVLSS > ... ... ... > [252] 36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]**MJQNSAOC > [253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]**VTVVRVMSM > [254] 36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]**YYVVTSZUOOHH > [255] 36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[**OVXEJSJ > [256] 36 ]]]]]]]]]]]Y]]T]T]]]]**TRYVMEVVRSRHHNH > > And I can simultaneously trim reads and sequences with > > trimEnds(rfq, "V") > > which you can see in action as > > > quality(trimEnds(rfq, "V")) > class: SFastqQuality > quality: > A BStringSet instance of length 256 > width seq > [1] 27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]] > [2] 31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][**PZ > [3] 32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]**TMJEUX > [4] 31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]**RJRZ > [5] 28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]] > ... ... ... > [252] 28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR] > [253] 27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]] > [254] 31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]**YYVVTSZ > [255] 32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[**OVX > [256] 24 ]]]]]]]]]]]Y]]T]T]]]]TRY > > The following might be helpful to transform a large file -- > > setMethod(trimEnds, "character", > function(object, a, left = TRUE, right = TRUE, > relation = c("<=", "=="), ..., > destination, yieldSize=1000000, ranges = FALSE) > { > if (missing('destination')) > stop("'destination' required") > strm <- FastqStreamer(object, yieldSize) > tot <- nNuc <- nTrimNuc <- 0L > while (length(fq <- yield(strm))) { > tot <- tot + length(fq) > nNuc <- nNuc + sum(width(fq)) > fq <- trimEnds(fq, a, left, right, relation, ..., ranges=ranges) > nTrimNuc <- nTrimNuc + sum(width(fq)) > writeFastq(fq, destination, "a") > } > list(destination=destination, > c(TotalReads=tot, TotalNucleotides=nNuc, > TrimmedNucleotides=nNuc - nTrimNuc)) > }) > > provide the first argument to trimEnds as a simple character vector, and > provide a 'destination' file name. For example > > ## just a convenient path to a fastq file > fl <- file.path(analysisPath(sp), "s_1_sequence.txt") > trimEnds(fl, "V", destination=tempfile()) > > A variation of this will be added to the 'devel' version of ShortRead, so > if there are suggestions for improvements please let me know. > > Martin > > > > > Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : >> negative length vectors are not allowed >> >> as(qual, matrix) gives the same error, and as.matrix(qual) does as well, >> not sure how those two commands are different, but tried anyway. >> >> If I run the exact same commands, instead of the full 39145018 reads, I >> use >> the first 100, it works like a charm. I am working on a linux cluster, if >> that is important >> >> Does anyone have an idea? >> >> Cheers! >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > -- Sam McInturf [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Sam McInturf ▴ 300

0

Entering edit mode

A few follow up questions. I wan to trim my reads to keep everything up until the first base with a quality below 20. Is trimEnds() the right function for this, or is trimTail(k = 1, a = "see below") the way to go? In your reply, you used "V" as the cut off, what is that Q score? My reads use an offset of 33 for the qual scores, so ! (ascii = 33) is the lowest score? Is that consistent with what ShortRead is doing? I would like a cutoff of 20, so 33+20 = 53, which is "5", is this what I should be telling trimEnds? Best On Fri, May 17, 2013 at 1:38 PM, Sam McInturf <smcinturf@gmail.com> wrote: > Martin, > I was unaware of the trimEnds() function, it worked wonderfully without > having to use your modified trimEnds(). I imagine it is because the > cluster has a lot of memory available. I ran it a second time using the > modification, but it was much slower. > > Thanks again! > > > On Fri, May 17, 2013 at 12:20 PM, Martin Morgan <mtmorgan@fhcrc.org>wrote: > >> On 05/17/2013 09:19 AM, Sam McInturf wrote: >> >>> Hello everyone, >>> I am trying to do some quality trimming on some RNA seq reads >>> (Illumina, >>> 100 bp, single end). I have been following the pipeline from the UCR >>> manuals<http: manuals.**bioinformatics.ucr.edu="" home="" **="">>> ht-seq#TOC-Trimming-Low-**Quality-3-Tails- from-R<http: manuals.bioinformatics.ucr.edu="" home="" ht-seq#toc-trimming-="" low-quality-3-tails-from-r=""> >>> > >>> >>> but >>> I have run into an error I can't seem to resolve. >>> >>> library(ShortRead) >>> reads <- readFastq("myFile.fastq") >>> >>> qualityCutoff <- 10 # remove read tails with quality lower than this >>> >>> seqs <- sread(reads) # get sequence list >>> >>> qual <- PhredQuality(quality(quality(**reads))) # get quality score >>> list as >>> PhredQuality >>> >>> length(qual) #my length is positive >>>> >>> [1] 39145018 >>> >>> myqual_mat <- matrix(charToRaw(as.character(**unlist(qual))), >>>> >>> nrow=length(qual), byrow=TRUE) >>> >> >> I'm guessing that what's happening is that as.character(unlist(qual)) is >> making a very large vector sum(width(qual)), larger than can be represented >> in (your) version of R (the output of sessionInfo() would be helpful...). >> >> But I don't think you need to go this route; have you looked at >> ShortRead::trimEnds & friends? For instance, after running >> >> exmample(trimEnds) >> >> I have an object 'rfq' with some qualities >> >> > quality(rfq) >> class: SFastqQuality >> quality: >> A BStringSet instance of length 256 >> width seq >> [1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]**VCHVMPLAS >> [2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][**PZPICCK >> [3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]**TMJEUXEFLA >> [4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]**RJRZTQLOA >> [5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]**MJUJVLSS >> ... ... ... >> [252] 36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]**MJQNSAOC >> [253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]**VTVVRVMSM >> [254] 36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]**YYVVTSZUOOHH >> [255] 36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[**OVXEJSJ >> [256] 36 ]]]]]]]]]]]Y]]T]T]]]]**TRYVMEVVRSRHHNH >> >> And I can simultaneously trim reads and sequences with >> >> trimEnds(rfq, "V") >> >> which you can see in action as >> >> > quality(trimEnds(rfq, "V")) >> class: SFastqQuality >> quality: >> A BStringSet instance of length 256 >> width seq >> [1] 27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]] >> [2] 31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][**PZ >> [3] 32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]**TMJEUX >> [4] 31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]**RJRZ >> [5] 28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]] >> ... ... ... >> [252] 28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR] >> [253] 27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]] >> [254] 31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]**YYVVTSZ >> [255] 32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[**OVX >> [256] 24 ]]]]]]]]]]]Y]]T]T]]]]TRY >> >> The following might be helpful to transform a large file -- >> >> setMethod(trimEnds, "character", >> function(object, a, left = TRUE, right = TRUE, >> relation = c("<=", "=="), ..., >> destination, yieldSize=1000000, ranges = FALSE) >> { >> if (missing('destination')) >> stop("'destination' required") >> strm <- FastqStreamer(object, yieldSize) >> tot <- nNuc <- nTrimNuc <- 0L >> while (length(fq <- yield(strm))) { >> tot <- tot + length(fq) >> nNuc <- nNuc + sum(width(fq)) >> fq <- trimEnds(fq, a, left, right, relation, ..., ranges=ranges) >> nTrimNuc <- nTrimNuc + sum(width(fq)) >> writeFastq(fq, destination, "a") >> } >> list(destination=destination, >> c(TotalReads=tot, TotalNucleotides=nNuc, >> TrimmedNucleotides=nNuc - nTrimNuc)) >> }) >> >> provide the first argument to trimEnds as a simple character vector, and >> provide a 'destination' file name. For example >> >> ## just a convenient path to a fastq file >> fl <- file.path(analysisPath(sp), "s_1_sequence.txt") >> trimEnds(fl, "V", destination=tempfile()) >> >> A variation of this will be added to the 'devel' version of ShortRead, so >> if there are suggestions for improvements please let me know. >> >> Martin >> >> >> >> >> Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : >>> negative length vectors are not allowed >>> >>> as(qual, matrix) gives the same error, and as.matrix(qual) does as well, >>> not sure how those two commands are different, but tried anyway. >>> >>> If I run the exact same commands, instead of the full 39145018 reads, I >>> use >>> the first 100, it works like a charm. I am working on a linux cluster, >>> if >>> that is important >>> >>> Does anyone have an idea? >>> >>> Cheers! >>> >>> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> > > > > -- > Sam McInturf > -- Sam McInturf [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Sam McInturf ▴ 300

0

Entering edit mode

On 05/17/2013 12:55 PM, Sam McInturf wrote: > A few follow up questions. > I wan to trim my reads to keep everything up until the first base with a quality > below 20. Is trimEnds() the right function for this, or is trimTail(k = 1, a = > "see below") the way to go? > > In your reply, you used "V" as the cut off, what is that Q score? My reads use > an offset of 33 for the qual scores, so ! (ascii = 33) is the lowest score? Is > that consistent with what ShortRead is doing? I would like a cutoff of 20, so > 33+20 = 53, which is "5", is this what I should be telling trimEnds? Sorry not to have answered sooner. Yes, it sounds like trimTail is what you're after. The letter / quality score encoding depends on class(quality(reads)), with FastqQuality: ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ; < = > ? @ A B C D E F G H I J 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 SFastqQuality: ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 U V W X Y Z [ \\ ] ^ _ ` a b c d e f g h i 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 In response to your question, I made it so that in the 'devel' version of Bioconductor / ShortRead encoding(FastqQuality()) returns this information. I also added a function filterFastq that allows space-efficient filtering from one fastq file to another; there are also trimTails and friends that use this to transform a character vector of 'object's to new files given by the argument 'destinations'. Again this is available in the devel branch of ShortRead. Thanks for the questions... Martin > > Best > > > On Fri, May 17, 2013 at 1:38 PM, Sam McInturf <smcinturf at="" gmail.com=""> <mailto:smcinturf at="" gmail.com="">> wrote: > > Martin, > I was unaware of the trimEnds() function, it worked wonderfully without > having to use your modified trimEnds(). I imagine it is because the cluster > has a lot of memory available. I ran it a second time using the > modification, but it was much slower. > > Thanks again! > > > On Fri, May 17, 2013 at 12:20 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > On 05/17/2013 09:19 AM, Sam McInturf wrote: > > Hello everyone, > I am trying to do some quality trimming on some RNA seq reads > (Illumina, > 100 bp, single end). I have been following the pipeline from the UCR > manuals<http: manuals.__bioinformatics.ucr.edu="" home="" __ht-seq#toc-trimming-low-__quality-3-tails-from-r=""> <http: manuals.bioinformatics.ucr.edu="" home="" ht-seq#toc-="" trimming-low-quality-3-tails-from-r="">> > > but > I have run into an error I can't seem to resolve. > > library(ShortRead) > reads <- readFastq("myFile.fastq") > > qualityCutoff <- 10 # remove read tails with quality lower than this > > seqs <- sread(reads) # get sequence list > > qual <- PhredQuality(quality(quality(__reads))) # get quality score > list as > PhredQuality > > length(qual) #my length is positive > > [1] 39145018 > > myqual_mat <- matrix(charToRaw(as.character(__unlist(qual))), > > nrow=length(qual), byrow=TRUE) > > > I'm guessing that what's happening is that as.character(unlist(qual)) is > making a very large vector sum(width(qual)), larger than can be > represented in (your) version of R (the output of sessionInfo() would be > helpful...). > > But I don't think you need to go this route; have you looked at > ShortRead::trimEnds & friends? For instance, after running > > exmample(trimEnds) > > I have an object 'rfq' with some qualities > > > quality(rfq) > class: SFastqQuality > quality: > A BStringSet instance of length 256 > width seq > [1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]__VCHVMPLAS > [2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][__PZPICCK > [3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]__TMJEUXEFLA > [4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]__RJRZTQLOA > [5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]__MJUJVLSS > ... ... ... > [252] 36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]__MJQNSAOC > [253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]__VTVVRVMSM > [254] 36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]__YYVVTSZUOOHH > [255] 36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[__OVXEJSJ > [256] 36 ]]]]]]]]]]]Y]]T]T]]]]__TRYVMEVVRSRHHNH > > And I can simultaneously trim reads and sequences with > > trimEnds(rfq, "V") > > which you can see in action as > > > quality(trimEnds(rfq, "V")) > class: SFastqQuality > quality: > A BStringSet instance of length 256 > width seq > [1] 27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]] > [2] 31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][__PZ > [3] 32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]__TMJEUX > [4] 31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]__RJRZ > [5] 28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]] > ... ... ... > [252] 28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR] > [253] 27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]] > [254] 31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]__YYVVTSZ > [255] 32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[__OVX > [256] 24 ]]]]]]]]]]]Y]]T]T]]]]TRY > > The following might be helpful to transform a large file -- > > setMethod(trimEnds, "character", > function(object, a, left = TRUE, right = TRUE, > relation = c("<=", "=="), ..., > destination, yieldSize=1000000, ranges = FALSE) > { > if (missing('destination')) > stop("'destination' required") > strm <- FastqStreamer(object, yieldSize) > tot <- nNuc <- nTrimNuc <- 0L > while (length(fq <- yield(strm))) { > tot <- tot + length(fq) > nNuc <- nNuc + sum(width(fq)) > fq <- trimEnds(fq, a, left, right, relation, ..., ranges=ranges) > nTrimNuc <- nTrimNuc + sum(width(fq)) > writeFastq(fq, destination, "a") > } > list(destination=destination, > c(TotalReads=tot, TotalNucleotides=nNuc, > TrimmedNucleotides=nNuc - nTrimNuc)) > }) > > provide the first argument to trimEnds as a simple character vector, and > provide a 'destination' file name. For example > > ## just a convenient path to a fastq file > fl <- file.path(analysisPath(sp), "s_1_sequence.txt") > trimEnds(fl, "V", destination=tempfile()) > > A variation of this will be added to the 'devel' version of ShortRead, > so if there are suggestions for improvements please let me know. > > Martin > > > > > Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : > negative length vectors are not allowed > > as(qual, matrix) gives the same error, and as.matrix(qual) does as well, > not sure how those two commands are different, but tried anyway. > > If I run the exact same commands, instead of the full 39145018 > reads, I use > the first 100, it works like a charm. I am working on a linux > cluster, if > that is important > > Does anyone have an idea? > > Cheers! > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 <tel:%28206%29%20667-2793> > > > > > -- > Sam McInturf > > > > > -- > Sam McInturf -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 10.9 years ago Martin Morgan 25k

0

Entering edit mode

Hi Sam, I picked up on your original email that you are loading in an entire Fastq file into memory, you might want to also look at example(FastqStreamer). Any operation that does not require information between segments or reads can be iterated through saving memory, you can then adjust how many reads to process at a time and append the trimmed/filtered output to file. require(ShortRead) sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") ## iterating over an entire file 50 reads at a time f <- FastqStreamer(fl, 50) fileName <- tempfile() ## Remove if file already exists for append mode with writeFastq() unlink(fileName) i <- 1 while (length(fq <- yield(f))) { ## do work here cat("[ Current iteration:", i, "with", length(fq), " reads ]\n") ## Trimming using trimEnds or trimTails trimmed.fq <- trimTails(fq, k=1, a="T") print(sread(trimmed.fq)) ## Using append mode with writeFastq writeFastq(trimmed.fq, fileName, mode="a") i <- i+1 } close(f) ## Checking output file system(paste("head -n10", fileName)) Marcus On Wed, May 22, 2013 at 1:44 AM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 05/17/2013 12:55 PM, Sam McInturf wrote: > >> A few follow up questions. >> I wan to trim my reads to keep everything up until the first base with a >> quality >> below 20. Is trimEnds() the right function for this, or is trimTail(k = >> 1, a = >> "see below") the way to go? >> >> In your reply, you used "V" as the cut off, what is that Q score? My >> reads use >> an offset of 33 for the qual scores, so ! (ascii = 33) is the lowest >> score? Is >> that consistent with what ShortRead is doing? I would like a cutoff of >> 20, so >> 33+20 = 53, which is "5", is this what I should be telling trimEnds? >> > > Sorry not to have answered sooner. Yes, it sounds like trimTail is what > you're after. The letter / quality score encoding depends on > class(quality(reads)), with > > FastqQuality: > > ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 > : > 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 > 24 25 > > ; < = > ? @ A B C D E F G H I J > 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 > > SFastqQuality: > > ; < = > ? @ A B C D E F G H I J K L M N O P Q R S > T > -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 > 20 > > U V W X Y Z [ \\ ] ^ _ ` a b c d e f g h i > 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 > > In response to your question, I made it so that in the 'devel' version of > Bioconductor / ShortRead > > encoding(FastqQuality()) > > returns this information. I also added a function filterFastq that allows > space-efficient filtering from one fastq file to another; there are also > trimTails and friends that use this to transform a character vector of > 'object's to new files given by the argument 'destinations'. Again this is > available in the devel branch of ShortRead. > > Thanks for the questions... > > Martin > > > >> Best >> >> >> On Fri, May 17, 2013 at 1:38 PM, Sam McInturf <smcinturf@gmail.com>> <mailto:smcinturf@gmail.com>> wrote: >> >> Martin, >> I was unaware of the trimEnds() function, it worked wonderfully >> without >> having to use your modified trimEnds(). I imagine it is because the >> cluster >> has a lot of memory available. I ran it a second time using the >> modification, but it was much slower. >> >> Thanks again! >> >> >> On Fri, May 17, 2013 at 12:20 PM, Martin Morgan <mtmorgan@fhcrc.org>> <mailto:mtmorgan@fhcrc.org>> wrote: >> >> On 05/17/2013 09:19 AM, Sam McInturf wrote: >> >> Hello everyone, >> I am trying to do some quality trimming on some RNA seq >> reads >> (Illumina, >> 100 bp, single end). I have been following the pipeline from >> the UCR >> manuals<http: manuals.__bioin**formatics.ucr.edu="" home="" __ht-*="">> *seq#TOC-Trimming-Low-__**Quality-3-Tails- from-R<http: bioinformatics.ucr.edu="" home="" __ht-seq#toc-trimming-low-="" __quality-3-tails-from-r=""> >> >> <http: manuals.**bioinformatics.ucr.edu="" home="" **="">> ht-seq#TOC-Trimming-Low-**Quality-3-Tails- from-R<http: manuals.bioinformatics.ucr.edu="" home="" ht-seq#toc-trimming-="" low-quality-3-tails-from-r=""> >> >> >> >> but >> I have run into an error I can't seem to resolve. >> >> library(ShortRead) >> reads <- readFastq("myFile.fastq") >> >> qualityCutoff <- 10 # remove read tails with quality lower >> than this >> >> seqs <- sread(reads) # get sequence list >> >> qual <- PhredQuality(quality(quality(_**_reads))) # get >> quality score >> >> list as >> PhredQuality >> >> length(qual) #my length is positive >> >> [1] 39145018 >> >> myqual_mat <- matrix(charToRaw(as.character(** >> __unlist(qual))), >> >> >> nrow=length(qual), byrow=TRUE) >> >> >> I'm guessing that what's happening is that >> as.character(unlist(qual)) is >> making a very large vector sum(width(qual)), larger than can be >> represented in (your) version of R (the output of sessionInfo() >> would be >> helpful...). >> >> But I don't think you need to go this route; have you looked at >> ShortRead::trimEnds & friends? For instance, after running >> >> exmample(trimEnds) >> >> I have an object 'rfq' with some qualities >> >> > quality(rfq) >> class: SFastqQuality >> quality: >> A BStringSet instance of length 256 >> width seq >> [1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]__**VCHVMPLAS >> [2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][_**_PZPICCK >> [3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]__**TMJEUXEFLA >> [4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]__**RJRZTQLOA >> [5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]__**MJUJVLSS >> ... ... ... >> [252] 36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]__**MJQNSAOC >> [253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]__**VTVVRVMSM >> [254] 36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]__**YYVVTSZUOOHH >> [255] 36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[_**_OVXEJSJ >> [256] 36 ]]]]]]]]]]]Y]]T]T]]]]__**TRYVMEVVRSRHHNH >> >> >> And I can simultaneously trim reads and sequences with >> >> trimEnds(rfq, "V") >> >> which you can see in action as >> >> > quality(trimEnds(rfq, "V")) >> class: SFastqQuality >> quality: >> A BStringSet instance of length 256 >> width seq >> [1] 27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]] >> [2] 31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][_**_PZ >> [3] 32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]__**TMJEUX >> [4] 31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]__**RJRZ >> >> [5] 28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]] >> ... ... ... >> [252] 28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR] >> [253] 27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]] >> [254] 31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]__**YYVVTSZ >> [255] 32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[_**_OVX >> >> [256] 24 ]]]]]]]]]]]Y]]T]T]]]]TRY >> >> The following might be helpful to transform a large file -- >> >> setMethod(trimEnds, "character", >> function(object, a, left = TRUE, right = TRUE, >> relation = c("<=", "=="), ..., >> destination, yieldSize=1000000, ranges = FALSE) >> { >> if (missing('destination')) >> stop("'destination' required") >> strm <- FastqStreamer(object, yieldSize) >> tot <- nNuc <- nTrimNuc <- 0L >> while (length(fq <- yield(strm))) { >> tot <- tot + length(fq) >> nNuc <- nNuc + sum(width(fq)) >> fq <- trimEnds(fq, a, left, right, relation, ..., >> ranges=ranges) >> nTrimNuc <- nTrimNuc + sum(width(fq)) >> writeFastq(fq, destination, "a") >> } >> list(destination=destination, >> c(TotalReads=tot, TotalNucleotides=nNuc, >> TrimmedNucleotides=nNuc - nTrimNuc)) >> }) >> >> provide the first argument to trimEnds as a simple character >> vector, and >> provide a 'destination' file name. For example >> >> ## just a convenient path to a fastq file >> fl <- file.path(analysisPath(sp), "s_1_sequence.txt") >> trimEnds(fl, "V", destination=tempfile()) >> >> A variation of this will be added to the 'devel' version of >> ShortRead, >> so if there are suggestions for improvements please let me know. >> >> Martin >> >> >> >> >> Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : >> negative length vectors are not allowed >> >> as(qual, matrix) gives the same error, and as.matrix(qual) >> does as well, >> not sure how those two commands are different, but tried >> anyway. >> >> If I run the exact same commands, instead of the full 39145018 >> reads, I use >> the first 100, it works like a charm. I am working on a linux >> cluster, if >> that is important >> >> Does anyone have an idea? >> >> Cheers! >> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 <tel:%28206%29%20667-2793> >> >> >> >> >> >> -- >> Sam McInturf >> >> >> >> >> -- >> Sam McInturf >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Marcus Davy ▴ 390