ShortRead QA
2
0
Entering edit mode
@alex-gutteridge-2935
Last seen 9.6 years ago
United States
I'm dealing with some Solexa/Illumina data with ShortRead for the first time and had a couple of questions relating to QA: 1. Memory requirements: My data comprises 7 s_N_export.txt files. Each one comprises 10-20 million aligned reads. If I try to run qa() over the whole directory my machine rapidly grinds to a halt. Tackling each file individually keeps my machine running, but takes >1 hour for each one. The ShortRead vignette says evaluating a single lane can take 'several minutes', so I'm wondering if anyone can offer any clues as to why I'm struggling so much? The machine in question has 6GB of RAM - do I just need more? 2. Read distribution: The QA results I'm getting for the 'read distribution' section don't quite look like those presented in the example ShortRead Solexa QA report. My interpretation is that this is because my data is actually rather high quality, but I'd appreciate a second opinion. To quote from the ShortRead QA report: 'Ideally, the cumulative proportion of reads will transition sharply from low to high. Portions to the left of the transition might correspond roughly to sequencing or sample processing errors, and correspond to reads that are represented relatively infrequently [...]. Portions to the right of the transition represent reads that are over-represented compared to expectation.' Typically the read distribution plots I'm seeing look like this: http://dl.dropbox.com/u/419878/readOccurences.jpg There is a sharp transition, but no portion to the left. I interpret this as a good sign: most of the reads are seen a small number of times (<10), and there are relatively few over-represented reads. Is there anything there that would worry more experienced heads? -- Alex Gutteridge
Sequencing ShortRead Sequencing ShortRead • 1.3k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States
Alex Gutteridge <alexg at="" ruggedtextile.com=""> writes: > I'm dealing with some Solexa/Illumina data with ShortRead for the first > time and had a couple of questions relating to QA: > > 1. Memory requirements: My data comprises 7 s_N_export.txt files. Each one > comprises 10-20 million aligned reads. If I try to run qa() over the whole > directory my machine rapidly grinds to a halt. Tackling each file > individually keeps my machine running, but takes >1 hour for each one. The > ShortRead vignette says evaluating a single lane can take 'several > minutes', so I'm wondering if anyone can offer any clues as to why I'm > struggling so much? The machine in question has 6GB of RAM - do I just need > more? It's total # of bases that'll be important, but if these are 'long' reads then yes, likely memory is limiting (we're hoping to take a better approach to qa and other input functions over the next release, though that doesn't help you at the moment). > 2. Read distribution: The QA results I'm getting for the 'read > distribution' section don't quite look like those presented in the example > ShortRead Solexa QA report. My interpretation is that this is because my > data is actually rather high quality, but I'd appreciate a second opinion. > > To quote from the ShortRead QA report: > > 'Ideally, the cumulative proportion of reads will transition sharply from > low to high. Portions to the left of the transition might correspond > roughly to sequencing or sample processing errors, and correspond to reads > that are represented relatively infrequently [...]. Portions to the right > of the transition represent reads that are over-represented compared to > expectation.' > > Typically the read distribution plots I'm seeing look like this: > http://dl.dropbox.com/u/419878/readOccurences.jpg > > There is a sharp transition, but no portion to the left. I interpret this > as a good sign: most of the reads are seen a small number of times (<10), > and there are relatively few over-represented reads. Is there anything > there that would worry more experienced heads? It depends a bit on what the data is for, but your interpretation above is accurate so if consistent with your expectations then that's good. Martin -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Hi Martin, On Fri, Jul 23, 2010 at 6:23 AM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: <snip> > It's total # of bases that'll be important, but if these are 'long' > reads then yes, likely memory is limiting (we're hoping ?to take a > better approach to qa and other input functions over the next release, > though that doesn't help you at the moment). Are you at liberty to discuss what approaches you folks are considering? I'm curious. Is there some wiki page or something? Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States
Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> writes: > Hi Martin, > > On Fri, Jul 23, 2010 at 6:23 AM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > <snip> >> It's total # of bases that'll be important, but if these are 'long' >> reads then yes, likely memory is limiting (we're hoping ?to take a >> better approach to qa and other input functions over the next release, >> though that doesn't help you at the moment). > > Are you at liberty to discuss what approaches you folks are > considering? I'm curious. Is there some wiki page or something? These are still being explored, with the main approaches a streaming / block processing model or memory mapping. Streaming / block processing seems more likely in the short term. Our development plans are documented, in an internal-but-public way, at http://wiki.fhcrc.org/bioc/DevPlans. Input welcome. Martin > Thanks, > -steve -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6