Possible bug in ShortRead pileup function

0

Entering edit mode

David Rossell ▴ 100

@david-rossell-3353

Last seen 9.6 years ago

Hi, the pileup function seems to stop doing the pileup after the first few sequences. For instance, if I test the function with my first 10 sequences it works fine: > start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10]; dir1 <- str1[1:10] > puprova <- pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"], dir=dir1) > sum(puprova) [1] 193 I manually checked that the result is correct. Then I repeat the process with the first 50 sequences > sel <- chromosome(aln1)=="chr2L" > str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+')) > start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50]; dir1 <- str1[1:50] > puprova <- pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"], dir=dir1) > sum(puprova) [1] 754 So far so good, but now if I repeat the process with the first 100 sequences the pileup has not incorporated any new information. > start <- position(aln1)[sel][1:100]; fraglength <- width(aln1)[sel][1:100]; dir1 <- str1[1:100] > puprova <- pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"], dir=dir1) > sum(puprova) [1] 754 Same if I use the full chromosome chr2L > start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <- str1 > puprova <- pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"], dir=dir1) > sum(puprova) [1] 754 As you can see below there are many more sequences in chromosome chr2L > table(chromosome(aln1)) chr2L chr2LHet chr2R chr2RHet chr3L chr3LHet chr3R chr3RHet 903343 526 978985 3898 200033 8135 639261 6096 chr4 chrM chrU chrUextra chrX chrXHet chrYHet 6246 3241 11115 15711 106300 1028 111 There's also a minor bug that returns an error if the argument 'dir' to 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the levels to be ('-','+') in this exact order. Simple fixing the levels(dir)== c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also, notice that readAligned typically returns a factor with levels ('-','+','*') and this causes an error in pileup. Thank you, take care, David [[alternative HTML version deleted]]

PROcess PROcess • 1.1k views

ADD COMMENT • link updated 15.1 years ago by Martin Morgan 25k • written 15.1 years ago by David Rossell ▴ 100

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

Hi again -- David Rossell <david.rossell at="" irbbarcelona.org=""> writes: > Hi, the pileup function seems to stop doing the pileup after the first few > sequences. For instance, if I test the function with my first 10 sequences > it works fine: > >> start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10]; > dir1 <- str1[1:10] >> puprova <- > pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L" ],dir=dir1) >> sum(puprova) > [1] 193 > > I manually checked that the result is correct. Then I repeat the process > with the first 50 sequences > >> sel <- chromosome(aln1)=="chr2L" >> str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+')) >> start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50]; > dir1 <- str1[1:50] >> puprova <- > pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L" ],dir=dir1) >> sum(puprova) > [1] 754 > > So far so good, but now if I repeat the process with the first 100 sequences > the pileup has not incorporated any new information. > >> start <- position(aln1)[sel][1:100]; fraglength <- > width(aln1)[sel][1:100]; dir1 <- str1[1:100] >> puprova <- > pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L" ],dir=dir1) >> sum(puprova) > [1] 754 > > Same if I use the full chromosome chr2L > >> start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <- str1 >> puprova <- > pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L" ],dir=dir1) >> sum(puprova) > [1] 754 If I run example(readAligned), and then do the following (to get some data that is reproducible, though not meaningful) > example(readAligned) [snip] > aln <- readAligned(ap, "s_2_export.txt", "SolexaExport") > aln1 = aln[!is.na(position(aln))] I end up with 406 reads, which pile up to > pup = pileup(position(aln1), 35, max(position(aln1))+35, + dir=factor(strand(aln), levels=c("-", "+"))) and > sum(pup) / width(aln2) [1] 406 so something non-trivial is going on. Can you provide a reproducible example (privately or otherwise) and your sessionInfo()? also... > As you can see below there are many more sequences in chromosome chr2L > >> table(chromosome(aln1)) > > chr2L chr2LHet chr2R chr2RHet chr3L chr3LHet chr3R > chr3RHet > 903343 526 978985 3898 200033 8135 639261 > 6096 > chr4 chrM chrU chrUextra chrX chrXHet chrYHet > 6246 3241 11115 15711 106300 1028 111 > > There's also a minor bug that returns an error if the argument 'dir' to > 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the > levels to be ('-','+') in this exact order. Simple fixing the levels(dir)== > c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also, > notice that readAligned typically returns a factor with levels ('-','+','*') > and this causes an error in pileup. The code requires factors with levels in the specified order, but the development (soon to be released) version has tidied this up. One small facility is a method for strand() that takes a character vector and creates a factor with the levels ordered consistently. In general the code in ShortRead and supporting packages has changed a lot, and using the 'devel' version of R and Bioconductor for these applications is a good idea. Here's my sessionInfo() > sessionInfo() R version 2.8.1 Patched (2009-02-25 r48007) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] tools stats graphics utils datasets grDevices methods [8] base other attached packages: [1] ShortRead_1.0.7 lattice_0.17-15 Biobase_2.2.2 Biostrings_2.10.21 [5] IRanges_1.0.14 loaded via a namespace (and not attached): [1] grid_2.8.1 Matrix_0.999375-22 Martin > Thank you, take care, > > David > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 15.1 years ago Martin Morgan 25k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

David Rossell <david.rossell at="" irbbarcelona.org=""> writes: > It turns out the error is due to the width method for "AlignedRead" objects. > "width" works fine on DNAStringSet objects but not on "AlignedRead" objects, > as it returns NA after the first few reads (see below) > >> width(aln1)[1:40] > [1] 25 22 20 19 18 21 17 16 12 23 14 13 40 39 34 38 37 35 15 26 33 29 28 27 > 24 > [26] 30 32 31 36 NA NA NA NA NA NA NA NA NA NA NA width(aln) is meant to return the unique widths, and you're indexing past the end of those. The documentation is ambiguous about what is really being returned. >> width(aln1 at sread)[1:40] > [1] 25 22 20 20 19 20 19 18 19 21 21 19 19 19 19 19 19 20 20 20 21 21 20 21 > 19 > [26] 21 19 20 20 21 17 16 19 19 19 20 12 12 12 12 use the accessor width(sread(aln1)) Martin > David > > > On Fri, Mar 20, 2009 at 11:16 AM, David Rossell < > david.rossell at irbbarcelona.org> wrote: > >> Hi, the pileup function seems to stop doing the pileup after the first few >> sequences. For instance, if I test the function with my first 10 sequences >> it works fine: >> >> > start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10]; >> dir1 <- str1[1:10] >> > puprova <- >> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L "],dir=dir1) >> > sum(puprova) >> [1] 193 >> >> I manually checked that the result is correct. Then I repeat the process >> with the first 50 sequences >> >> > sel <- chromosome(aln1)=="chr2L" >> > str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+')) >> > start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50]; >> dir1 <- str1[1:50] >> > puprova <- >> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L "],dir=dir1) >> > sum(puprova) >> [1] 754 >> >> So far so good, but now if I repeat the process with the first 100 >> sequences the pileup has not incorporated any new information. >> >> > start <- position(aln1)[sel][1:100]; fraglength <- >> width(aln1)[sel][1:100]; dir1 <- str1[1:100] >> > puprova <- >> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L "],dir=dir1) >> > sum(puprova) >> [1] 754 >> >> Same if I use the full chromosome chr2L >> >> > start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <- >> str1 >> > puprova <- >> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L "],dir=dir1) >> > sum(puprova) >> [1] 754 >> >> As you can see below there are many more sequences in chromosome chr2L >> >> > table(chromosome(aln1)) >> >> chr2L chr2LHet chr2R chr2RHet chr3L chr3LHet chr3R >> chr3RHet >> 903343 526 978985 3898 200033 8135 639261 >> 6096 >> chr4 chrM chrU chrUextra chrX chrXHet chrYHet >> 6246 3241 11115 15711 106300 1028 111 >> >> There's also a minor bug that returns an error if the argument 'dir' to >> 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the >> levels to be ('-','+') in this exact order. Simple fixing the levels(dir)== >> c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also, >> notice that readAligned typically returns a factor with levels ('-','+','*') >> and this causes an error in pileup. >> >> Thank you, take care, >> >> David >> >> >> -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 15.1 years ago Martin Morgan 25k

0

Entering edit mode

David Rossell ▴ 100

@david-rossell-3353

Last seen 9.6 years ago

It turns out the error is due to the width method for "AlignedRead" objects. "width" works fine on DNAStringSet objects but not on "AlignedRead" objects, as it returns NA after the first few reads (see below) > width(aln1)[1:40] [1] 25 22 20 19 18 21 17 16 12 23 14 13 40 39 34 38 37 35 15 26 33 29 28 27 24 [26] 30 32 31 36 NA NA NA NA NA NA NA NA NA NA NA > width(aln1@sread)[1:40] [1] 25 22 20 20 19 20 19 18 19 21 21 19 19 19 19 19 19 20 20 20 21 21 20 21 19 [26] 21 19 20 20 21 17 16 19 19 19 20 12 12 12 12 David On Fri, Mar 20, 2009 at 11:16 AM, David Rossell < david.rossell@irbbarcelona.org> wrote: > Hi, the pileup function seems to stop doing the pileup after the first few > sequences. For instance, if I test the function with my first 10 sequences > it works fine: > > > start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10]; > dir1 <- str1[1:10] > > puprova <- > pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L" ],dir=dir1) > > sum(puprova) > [1] 193 > > I manually checked that the result is correct. Then I repeat the process > with the first 50 sequences > > > sel <- chromosome(aln1)=="chr2L" > > str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+')) > > start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50]; > dir1 <- str1[1:50] > > puprova <- > pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L" ],dir=dir1) > > sum(puprova) > [1] 754 > > So far so good, but now if I repeat the process with the first 100 > sequences the pileup has not incorporated any new information. > > > start <- position(aln1)[sel][1:100]; fraglength <- > width(aln1)[sel][1:100]; dir1 <- str1[1:100] > > puprova <- > pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L" ],dir=dir1) > > sum(puprova) > [1] 754 > > Same if I use the full chromosome chr2L > > > start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <- > str1 > > puprova <- > pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L" ],dir=dir1) > > sum(puprova) > [1] 754 > > As you can see below there are many more sequences in chromosome chr2L > > > table(chromosome(aln1)) > > chr2L chr2LHet chr2R chr2RHet chr3L chr3LHet chr3R > chr3RHet > 903343 526 978985 3898 200033 8135 639261 > 6096 > chr4 chrM chrU chrUextra chrX chrXHet chrYHet > 6246 3241 11115 15711 106300 1028 111 > > There's also a minor bug that returns an error if the argument 'dir' to > 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the > levels to be ('-','+') in this exact order. Simple fixing the levels(dir)== > c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also, > notice that readAligned typically returns a factor with levels ('-','+','*') > and this causes an error in pileup. > > Thank you, take care, > > David > > > -- David Rossell Manager, Bioinformatics and Biostatistics unit IRB Barcelona Tel (+34) 93 402 0217 Fax (+34) 93 402 0257 [[alternative HTML version deleted]]

ADD COMMENT • link 15.1 years ago David Rossell ▴ 100

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

Martin Morgan <mtmorgan at="" fhcrc.org=""> writes: > David Rossell <david.rossell at="" irbbarcelona.org=""> writes: > >> It turns out the error is due to the width method for "AlignedRead" objects. >> "width" works fine on DNAStringSet objects but not on "AlignedRead" objects, >> as it returns NA after the first few reads (see below) >> >>> width(aln1)[1:40] >> [1] 25 22 20 19 18 21 17 16 12 23 14 13 40 39 34 38 37 35 15 26 33 29 28 27 >> 24 >> [26] 30 32 31 36 NA NA NA NA NA NA NA NA NA NA NA > > width(aln) is meant to return the unique widths, and you're indexing > past the end of those. The documentation is ambiguous about what > is really being returned. For the record, the behavior of width() has been changed in the development (soon to be release!) version of the ShortRead package to return a vector of widths, with the length of the vector equal to the length of the object. Thanks for bringing this inconsistency to the fore. Martin >>> width(aln1 at sread)[1:40] >> [1] 25 22 20 20 19 20 19 18 19 21 21 19 19 19 19 19 19 20 20 20 21 21 20 21 >> 19 >> [26] 21 19 20 20 21 17 16 19 19 19 20 12 12 12 12 > > use the accessor width(sread(aln1)) > > Martin > >> David >> >> >> On Fri, Mar 20, 2009 at 11:16 AM, David Rossell < >> david.rossell at irbbarcelona.org> wrote: >> >>> Hi, the pileup function seems to stop doing the pileup after the first few >>> sequences. For instance, if I test the function with my first 10 sequences >>> it works fine: >>> >>> > start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10]; >>> dir1 <- str1[1:10] >>> > puprova <- >>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2 L"],dir=dir1) >>> > sum(puprova) >>> [1] 193 >>> >>> I manually checked that the result is correct. Then I repeat the process >>> with the first 50 sequences >>> >>> > sel <- chromosome(aln1)=="chr2L" >>> > str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+')) >>> > start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50]; >>> dir1 <- str1[1:50] >>> > puprova <- >>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2 L"],dir=dir1) >>> > sum(puprova) >>> [1] 754 >>> >>> So far so good, but now if I repeat the process with the first 100 >>> sequences the pileup has not incorporated any new information. >>> >>> > start <- position(aln1)[sel][1:100]; fraglength <- >>> width(aln1)[sel][1:100]; dir1 <- str1[1:100] >>> > puprova <- >>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2 L"],dir=dir1) >>> > sum(puprova) >>> [1] 754 >>> >>> Same if I use the full chromosome chr2L >>> >>> > start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <- >>> str1 >>> > puprova <- >>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2 L"],dir=dir1) >>> > sum(puprova) >>> [1] 754 >>> >>> As you can see below there are many more sequences in chromosome chr2L >>> >>> > table(chromosome(aln1)) >>> >>> chr2L chr2LHet chr2R chr2RHet chr3L chr3LHet chr3R >>> chr3RHet >>> 903343 526 978985 3898 200033 8135 639261 >>> 6096 >>> chr4 chrM chrU chrUextra chrX chrXHet chrYHet >>> 6246 3241 11115 15711 106300 1028 111 >>> >>> There's also a minor bug that returns an error if the argument 'dir' to >>> 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the >>> levels to be ('-','+') in this exact order. Simple fixing the levels(dir)== >>> c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also, >>> notice that readAligned typically returns a factor with levels ('-','+','*') >>> and this causes an error in pileup. >>> >>> Thank you, take care, >>> >>> David >>> >>> >>> -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 15.1 years ago Martin Morgan 25k

Login before adding your answer.