Course material for "Using Bioconductor for ChIP-seq experiments"

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 9.6 years ago

United States

The course material for "Using Bioconductor for ChIP-seq experiments", which was held at the Fred Hutchinson Cancer Research Center from 12-14 Nov 2008, is now available on-line at http://bioconductor.org/workshops/2008/SeattleNov08/ Topics for this course include: 1. Overview of Bioconductor and high-throughput sequence data 2. Sequence Data I/O and QA using ShortRead 3. Sequence Data Exploration using rtracklayer 4. Sequence Data Annotations from org.* packages and biomaRt package 5. Biostrings and BSgenome Basics 6. Sequence Matching and Aligning using Biostrings 7. Ambiguous Motif Resolution 8. Example ChIP-seq Analysis Workflow 9. RNA-seq This course was designed to highlight what BioC 2.3, along with R 2.8, has to offer experimenters working with high-throughput sequence data. Course participants used laptop computers to read, examine, annotate, align, and analyze high-throughput sequence data on the BioC 2.3 ShortRead/Biostrings/BSgenome/IRanges/rtracklayer/biomaRt backbone. Members of the Bioconductor Core Team that attended the course are thankful for all the feedback they received from these participants and they are actively converting this feedback into enhancements within the BioC 2.4 code line. Please direct all comments on this course material as well as BioC 2.4 enhancement requests in the sequencing realm to the Bioc-sig-sequencing group (https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing) since it provides a forum for more detailed discussion by interested parties. Sincerely, The Bioconductor Core Team

Sequencing Cancer BSgenome annotate Biostrings BSgenome Sequencing Cancer BSgenome • 1.2k views

ADD COMMENT • link updated 15.4 years ago by Werner Van Belle ▴ 10 • written 15.4 years ago by Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Werner Van Belle ▴ 10

@werner-van-belle-3147

Last seen 9.6 years ago

Hello, Just out of pure curiosity. If I have around 60'000'000 short fragmnents. A typical output of an Illumina GAII experiment. Can your package align these to a reference genome, such as the Human genome and if so, how much memory is required to complete this process. How much time is necessary on a typical GAII server ? That is 16 Gb memory and 4 intel quadcores. Could you compare the speed of your processing against something like for instance Eland ? Wkr, Werner,- -- Dr. Werner Van Belle http://werner.sigtrans.org/

ADD COMMENT • link 15.4 years ago Werner Van Belle ▴ 10

0

Entering edit mode

Hi Werner, Werner Van Belle wrote: > Hello, > > Just out of pure curiosity. If I have around 60'000'000 short > fragmnents. A typical output of an Illumina GAII experiment. Can your > package align these to a reference genome, such as the Human genome and > if so, how much memory is required to complete this process. How much > time is necessary on a typical GAII server ? That is 16 Gb memory and 4 > intel quadcores. What is the length of your fragments? The PDict/matchPDict tool in the Biostrings package uses an approach that consists in preprocessing the short fragments. The result of this preprocessing is a PDict object that is currently taking a lot of memory: around 7GB for 10 millions 36-mers. Also 10M 36-mers / 15M 25-mers is close to the maximum number of short fragments that you can store in a PDict object so you'll have to split your original set of 60M fragments. Using the PDict object to match the 10M (or 15M) fragments against the Human genome (+ and - strand) should take about 1 hour on a Linux server with 16GB of RAM. That's for exact matching. PDict/matchPDict also supports inexact matching (a small number of mismatches per read, let's say 1 or 2) but this will increase the time by a factor 12x for 1 mismatch and a factor 240x for 2 mismatches! > > Could you compare the speed of your processing against something like > for instance Eland ? PDict/matchPDict is very fast for exact matching. If you want to allow up to 2 mismatches, a tool like bowtie will be much faster. The speed of PDict/matchPDict will be comparable to that of MAQ but PDict/matchPDict uses more memory. I'm not sure how it compares with Eland but I think MAQ is faster than Eland. Note that bowtie, MAQ and Eland do quality-based alignments. PDict/matchPDict doesn't use the quality at all. Another difference is that PDict/matchPDict will return all the matches for all the reads. bowtie, MAQ and Eland return at most 1 match per read. With PDict/matchPDict it's up to you to decide what to do with the reads that have multiple matches. Also for now, we offer no facilities to write the output of matchPDict() to a file (this will be added soon). PDict/matchPDict is still a very young tool and there is still room for improvements like making the PDict object more compact in memory, allow it to store 50M or more short reads, support indels, limiting the number of matches per read that is returned by matchPDict(), provide IO facilities, etc... User feedback will help us to set priorities. Cheers, H. > > Wkr, > > Werner,- > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 15.4 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Hervé Pagès wrote: <snip> > bowtie, MAQ and Eland return at most 1 match > per read. This isn't true. The default is one, but you can have as many as you want for any of these tools. Of course there is a penalty for more matches. Best, Jim -- James W. MacDonald, M.S. Biostatistician Hildebrandt Lab 8220D MSRB III 1150 W. Medical Center Drive Ann Arbor MI 48109-0646 734-936-8662

ADD REPLY • link 15.4 years ago James W. MacDonald 65k

0

Entering edit mode

On Fri, Nov 21, 2008 at 2:17 PM, HervÃ© PagÃ¨s <hpages@fhcrc.org> wrote: > > PDict/matchPDict uses more memory. I'm not sure how it compares with > Eland but I think MAQ is faster than Eland. > Actually, Eland is generally faster. Sean [[alternative HTML version deleted]]

ADD REPLY • link 15.4 years ago Sean Davis 21k

Login before adding your answer.