Search
Question: DESeq on CCAT identified chipseq peaks
0
4.1 years ago by
Guest User12k
Guest User12k wrote:
I plan on using DESeq downstream of CCAT identified peaks on 5 tumor and 5 normal samples and I was unsure of how to best create a unified list of peaks and corresponding read counts - CCAT outputs different peak regions from each sample. Thus to create a unified list of peak regions and their read counts would you suggest - A. Taking a union of all the CCAT called peaks and calculating read count in each biological replicate OR B. Calculating the read count for each peak in each replicate whether or not it has been called in the replicate or not I saw both being suggested earlier online and I am not sure which is appropriate. 2. Since this is chipseq and not rna seq data, do you agree that using coverageBed ( coverageBed -abam $bamfile -b$CCATpeaks > countdata) would work as good as HTseqcount ? Thanks ! -- output of sessionInfo(): - -- Sent via the guest posting facility at bioconductor.org.
modified 4.1 years ago by Rory Stark2.5k • written 4.1 years ago by Guest User12k
0
4.1 years ago by
Rory Stark2.5k
CRUK, Cambridge, UK
Rory Stark2.5k wrote:
You do indeed want to form a consensus peakset from the replicates. How you do this depends on exactly what question you are trying to ask. You can take the union of all peak and count the reads for each peak in each replicate, or you use more stringent criteria in determining the consensus peakset, such as peaks that appear in at least 2 (or 3) replicates, or perhaps the union of peaks that appear in a majority of each condition (ie peaks identified in at least 3 of 5 tumors OR in at least 3 of 5 normals). The DiffBind package provides tools to do exactly this, and the user guide/vignette walks through an example in some detail. Besides assembling consensus peaksets, DiffBind will handle the counting (with various options) and differential analysis using edgeR, DESeq, and/or DESeq2, and has convenient tools for reporting and plotting results. Cheers- Rory on Wed May 14 18:16:36 CEST 2014 Aditi [guest] guest at bioconductor.org wrote: > I plan on using DESeq downstream of CCAT identified peaks on 5 tumor and 5 normal samples and I was unsure of how to best create a > unified list of peaks and corresponding read counts - > CCAT outputs different peak regions from each sample. Thus to create a unified list of peak regions and their read counts would you suggest - > > A. Taking a union of all the CCAT called peaks and calculating read count in each biological replicate OR > > B. Calculating the read count for each peak in each replicate whether or not it has been called in the replicate or not > > I saw both being suggested earlier online and I am not sure which is appropriate. > > 2. Since this is chipseq and not rna seq data, do you agree that using coverageBed ( coverageBed -abam $bamfile -b$CCATpeaks > countdata) would work as > good as HTseqcount ?
Hi Aditi on Wed May 14 18:16:36 CEST 2014 Aditi [guest] guest at bioconductor.org wrote: [...] Thus to create a unified list of peak regions and their read counts would you suggest - >> >> A. Taking a union of all the CCAT called peaks and calculating read >> count in each biological replicate OR >> >> B. Calculating the read count for each peak in each replicate >> whether or not it has been called in the replicate or not >> >> I saw both being suggested earlier online and I am not sure which >> is appropriate. Both approaches will give statistically valid results. I don't have much experience with ChIP-Seq myself, so I suggest to follow Rory's advice (and the DiffBind vignette) to get most inferential power. >> 2. Since this is chipseq and not rna seq data, do you agree that >> using coverageBed ( coverageBed -abam $bamfile -b$CCATpeaks > >> countdata) would work as > good as HTseqcount ? Yes. As the features do not overlap, this should not make a difference. BEDtools might be easier to use here, as you probably have the peaks in BED format, anyway. Or you use Rory's DiffBind package. Simon
Hi Dr. Rory, Thanks a lot for pointing this out. I wanted to confirm one thing while using diffbind - If my sample sheet looks like - SampleID Tissue Factor Condition Treatment Replicate bamReads bamControl Peaks PeakCaller PeakFormat ScoreCol LowerBetter 1 T h3k4me3 tumor none 1 PATH PATH PATH raw raw 4 FALSE 2 N h3k4me3 normal none 1 PATH PATH PATH raw raw 4 FALSE 3 T h3k4me3 tumor none 2 PATH PATH PATH raw raw 4 FALSE 4 N h3k4me3 normal none 2 PATH PATH PATH raw raw 4 FALSE 5 T h3k4me3 tumor none 3 PATH PATH PATH raw raw 4 FALSE 6 N h3k4me3 normal none 3 PATH PATH PATH raw raw 4 FALSE 7 T h3k4me4 tumor none 4 PATH PATH PATH raw raw 5 FALSE 8 N h3k4me5 normal none 4 PATH PATH PATH raw raw 6 FALSE 9 T h3k4me6 tumor none 5 PATH PATH PATH raw raw 7 FALSE 10 N h3k4me7 normal none 5 PATH PATH PATH raw raw 8 FALSE Then to create a consensus peakset from the union of peaks that appear in atleast 3 of 5 samples of each condition, the commandline would be - h3k4me3_peakset = dba.peakset(h3k4me3_readin,consensus = DBA_CONDITION, minOverlap=0.6) I am not too clear on how to use this command and thus wanted to confirm. Thanks ! Aditi -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Rory Stark Sent: Thursday, May 15, 2014 12:59 AM To: bioconductor@r-project.org Subject: Re: [BioC] DESeq on CCAT identified chipseq peaks You do indeed want to form a consensus peakset from the replicates. How you do this depends on exactly what question you are trying to ask. You can take the union of all peak and count the reads for each peak in each replicate, or you use more stringent criteria in determining the consensus peakset, such as peaks that appear in at least 2 (or 3) replicates, or perhaps the union of peaks that appear in a majority of each condition (ie peaks identified in at least 3 of 5 tumors OR in at least 3 of 5 normals). The DiffBind package provides tools to do exactly this, and the user guide/vignette walks through an example in some detail. Besides assembling consensus peaksets, DiffBind will handle the counting (with various options) and differential analysis using edgeR, DESeq, and/or DESeq2, and has convenient tools for reporting and plotting results. Cheers- Rory on Wed May 14 18:16:36 CEST 2014 Aditi [guest] guest at bioconductor.org wrote: > I plan on using DESeq downstream of CCAT identified peaks on 5 tumor > and 5 normal samples and I was unsure of how to best create a > > unified list of peaks and corresponding read counts - CCAT outputs > different peak regions from each sample. Thus to create a unified list > of peak regions and their read counts would you suggest - > > A. Taking a union of all the CCAT called peaks and calculating read > count in each biological replicate OR > > B. Calculating the read count for each peak in each replicate whether > or not it has been called in the replicate or not > > I saw both being suggested earlier online and I am not sure which is appropriate. > > 2. Since this is chipseq and not rna seq data, do you agree that using coverageBed ( coverageBed -abam $bamfile -b$CCATpeaks > countdata) would work as > good as HTseqcount ? _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------- This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act. ------------------------------- [[alternative HTML version deleted]]
Hello Aditi- What you want is to use a "matched" design. There is a good explanation of this design (in a differential expression context) in the edgeR vignette. Basically, the matched tumour-normal pairs are going to have certain similarities to each other as they each come from the same patient. A matched design will model this to detect consistent differences in enrichment between tumor and normal that is independent of individual patients. You can analyze a matched design by setting up the contrast with block=DBA_REPLICATE: > h3k4me3_counts = dba.contrast(h3k4me3_counts, categories=DBA_CONDITION, >block=DBA_REPLICATE) > h3k4me3_counts = dba.analyze(h3k4me3_counts, method=DBA_DESEQ2) You'll see that two analyses are run (unmatched and matched): > h3k4me3_counts Is is useful to look at the MA plot: > dba.plotMA(h3k4me3_counts, method=DBA_DESEQ2_BLOCK) You can get the list of all the sites with statistics relating to how confidently they can be identified as being differentially enriched: > matchedReport = dba.report(h3k4me3_counts, method=DBA_DESEQ2_BLOCK, th=1) Cheers- Rory On 16/05/2014 06:54, "QAMRA Aditi (GIS)" <qamraa99 at="" gis.a-star.edu.sg=""> wrote: >Hi Dr. Rory, > >I understand now. Thank you ! > >A last question (hopefully) - Can you explain a little more on how the >use of a blocking factor works in the case of matched normal tumor pairs >? Does it mean that using the DBA_REPLICATE condition as a blocking >factor in such a case adjusts (?) and removes any sort of batch effects >between replicates ? > >Thanks ! >Aditi >________________________________________