Question

ShortRead QA

0

Entering edit mode

Alex Gutteridge ▴ 650

@alex-gutteridge-2935

Last seen 9.6 years ago

United States

I'm dealing with some Solexa/Illumina data with ShortRead for the first time and had a couple of questions relating to QA: 1. Memory requirements: My data comprises 7 s_N_export.txt files. Each one comprises 10-20 million aligned reads. If I try to run qa() over the whole directory my machine rapidly grinds to a halt. Tackling each file individually keeps my machine running, but takes >1 hour for each one. The ShortRead vignette says evaluating a single lane can take 'several minutes', so I'm wondering if anyone can offer any clues as to why I'm struggling so much? The machine in question has 6GB of RAM - do I just need more? 2. Read distribution: The QA results I'm getting for the 'read distribution' section don't quite look like those presented in the example ShortRead Solexa QA report. My interpretation is that this is because my data is actually rather high quality, but I'd appreciate a second opinion. To quote from the ShortRead QA report: 'Ideally, the cumulative proportion of reads will transition sharply from low to high. Portions to the left of the transition might correspond roughly to sequencing or sample processing errors, and correspond to reads that are represented relatively infrequently [...]. Portions to the right of the transition represent reads that are over-represented compared to expectation.' Typically the read distribution plots I'm seeing look like this: http://dl.dropbox.com/u/419878/readOccurences.jpg There is a sharp transition, but no portion to the left. I interpret this as a good sign: most of the reads are seen a small number of times (<10), and there are relatively few over-represented reads. Is there anything there that would worry more experienced heads? -- Alex Gutteridge

Sequencing ShortRead Sequencing ShortRead • 1.3k views

ADD COMMENT • link updated 13.8 years ago by Martin Morgan 25k • written 13.8 years ago by Alex Gutteridge ▴ 650

score 0 · Answer 1 · 2010-07-22

Alex Gutteridge <alexg at="" ruggedtextile.com=""> writes: > I'm dealing with some Solexa/Illumina data with ShortRead for the first > time and had a couple of questions relating to QA: > > 1. Memory requirements: My data comprises 7 s_N_export.txt files. Each one > comprises 10-20 million aligned reads. If I try to run qa() over the whole > directory my machine rapidly grinds to a halt. Tackling each file > individually keeps my machine running, but takes >1 hour for each one. The > ShortRead vignette says evaluating a single lane can take 'several > minutes', so I'm wondering if anyone can offer any clues as to why I'm > struggling so much? The machine in question has 6GB of RAM - do I just need > more? It's total # of bases that'll be important, but if these are 'long' reads then yes, likely memory is limiting (we're hoping to take a better approach to qa and other input functions over the next release, though that doesn't help you at the moment). > 2. Read distribution: The QA results I'm getting for the 'read > distribution' section don't quite look like those presented in the example > ShortRead Solexa QA report. My interpretation is that this is because my > data is actually rather high quality, but I'd appreciate a second opinion. > > To quote from the ShortRead QA report: > > 'Ideally, the cumulative proportion of reads will transition sharply from > low to high. Portions to the left of the transition might correspond > roughly to sequencing or sample processing errors, and correspond to reads > that are represented relatively infrequently [...]. Portions to the right > of the transition represent reads that are over-represented compared to > expectation.' > > Typically the read distribution plots I'm seeing look like this: > http://dl.dropbox.com/u/419878/readOccurences.jpg > > There is a sharp transition, but no portion to the left. I interpret this > as a good sign: most of the reads are seen a small number of times (<10), > and there are relatively few over-represented reads. Is there anything > there that would worry more experienced heads? It depends a bit on what the data is for, but your interpretation above is accurate so if consistent with your expectations then that's good. Martin -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

score 0 · Answer 2 · 2010-07-23

Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> writes: > Hi Martin, > > On Fri, Jul 23, 2010 at 6:23 AM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > <snip> >> It's total # of bases that'll be important, but if these are 'long' >> reads then yes, likely memory is limiting (we're hoping ?to take a >> better approach to qa and other input functions over the next release, >> though that doesn't help you at the moment). > > Are you at liberty to discuss what approaches you folks are > considering? I'm curious. Is there some wiki page or something? These are still being explored, with the main approaches a streaming / block processing model or memory mapping. Streaming / block processing seems more likely in the short term. Our development plans are documented, in an internal-but-public way, at http://wiki.fhcrc.org/bioc/DevPlans. Input welcome. Martin > Thanks, > -steve -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793