Diffbind: Binding Affinity Heatmap
1
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK
Hi Giuseppe- The standard heatmap plots the read densities of the differentially bound sites. The x-axis clusters the samples, and the y-axis clusters the sites based on the score for each site in each sample. The major clusters should show similar patterns of binding levels. In the example plot you sent, there are roughly three main clusters of differentially bound sites. The bottom cluster has higher binding rates in the first (red) sample group (group one gain/group two loss). The middle cluster includes sites with higher binding rates in the second sample group (group one loss/group two gain). The top cluster includes sites with substantial binding in both groups, but that nonetheless exhibit a significant change in binding intensity at these sites; it looks in general like these go from moderate binding in the first group to very strong binding in the second group (especially in the sample cluster on the far right). You can get a bit more contrast in these plots by using the "maxval" parameter to clip the highly-boud sites (the long tail to the right in the Color Key). For example, in this case setting maxval=6 could probably give a clearer picture of what patterns are driving the clustering of binding sites. Cheers- Rory On 14/08/2013 11:37, "Giuseppe Gallone" <giuseppe.gallone at="" dpag.ox.ac.uk=""> wrote: >Hi Rory > >I have a further question about DiffBind. Could you tell me something >more about the clustering visualisation obtained with >dba.plotHeatmap(....correlations=FALSE)? I've carried out a differential >analysis on my samples and I observe some interesting clustering using >both EDGER and DESEQ. I then plotted the heatmap using correlation=FALSE. > >I understand that the clustering obtained with dba.visualise is >reproduced on the x axis (columns are grouped by clustering). > >What is shown instead on the y axis? Are these the individual >differentially bound sites across the genome? What is the clustering >described on the left? > >Thanks once again. > >Best >Giuseppe > >On 07/23/13 18:22, Rory Stark wrote: >> Hi Giuseppe- >> >> I'm glad to sorted the column thing out, that was what I suspected. >> >> There shouldn't be much problem doing the analysis without a control >> track, particularly if the samples come from the same tissue. The main >> role of the control tracks is for peak calling. The reason the control >> track is less important for differential analysis is that youy are >>looking >> at the relative differences in read density at the same genomic >>intervals >> across samples, and not comparing read densities across intervals. So if >> the control track were similar at that location for all samples, it will >> not affect the differential analysis. The main issue would be if there >> were something like big copy number differences between samples. Then >>you >> could get sites that show as differentially bound when the real >>difference >> was the copy number. But the difference would be real regardless. >> >> Regarding sequencing depth, this should be taken care of by the >> normalisation step. It takes the library size (either full library size, >> which is the total number of reads, or the default effective library >>size, >> the number of reads within peaks for each sample) and adjusts the read >> counts. You can can an idea of how this is working by using the >> dba.plotBox (with bAll=TRUE) comparing bNormalized=TRUE and >> bNormalized=FALSE to see if things even out. Also, after counting, you >>can >> look at the clustering (dba.plotPCA and dba.plotHeatmap) to see if >>samples >> are grouping by sequencing depth -- try doing the same plots with >> different score, eg score=DBA_SCORE_READS, score=DBA_SCORE_RPKM, and >> score=DBA_SCORE_TMM_READS_EFFECTIVE or score=DBA_SCORE_TMM_READS_FULL to >> see which gives to the best clustering. >> >> Hope this helps! >> >> Cheers- >> Rory
Sequencing GO Clustering GLAD edgeR DESeq DiffBind Sequencing GO Clustering GLAD edgeR • 2.1k views
ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK
Hi Giuseppe- Two compare different peak callers on the same replicate, you can get the clustering/correlation at the peak level but it doesn't make sense at the count level, as all the peaks are merged into a single consensus set at that point. You did this correctly in the first case by including a line for each peak caller with the same read (bam) files. At that point you can get a correlation heatmap, PCA plot, etc, as well as look at overlaps (e.g. by using dba.plotVenn and/or dba.overlap). One you create a binding matrix, as it done when you run dba.count, you are using a single "consensus" set of peaks for all the samples, and getting the number of reads in these peaks for each sample. So it no longer makes sense to have a different sets of counts for each original peakset. This is a result of the peaks being "merged" (by default, all the peaks that appear in at least two peaksets are merged into a single set of peaks for the rest of the analysis). If try what you suggest, and use symbolic links, you should get exactly the same result for each virtual replicate -- that is, the three entries should have correlation values of 1.0, as the same reads are being counted within the same (global, merged) consensus peakset. Cheers- Rory On 15/08/2013 11:57, "Giuseppe Gallone" <giuseppe.gallone at="" dpag.ox.ac.uk=""> wrote: >hm looks like I found the reason. DiffBind wants the bam files named by >replicate, even if they're the same file. 3 symbolic links per bam did >it. Still I wonder if this kind of analysis has any sense in your opinion. > >Thanks >Giuseppe > >On 08/15/13 11:48, Giuseppe Gallone wrote: >> Hi Rory >> >> >> thanks for your explanation, it makes sense. I have another question if >> you don't mind. I have a trio of samples. I don't have replicates so I >> wanted to check the agreement of 3 different peak callers, using each >> peak caller as a replicate. so my sampleSheet looks a bit like this >> >> >> >>SampleID,Tissue,Factor,Condition,Treatment,Replicate,bamReads,bamCon trol, >>Peaks,PeakFormat,ScoreCol,LowerBetter >> >> >>ID1.1,ID1,TF,mother,stimulated,1,ID1_reads.bam,,PeakCaller1/ID1_peak s.bed >>,raw,5,F >> >> >>ID1.2,ID1,TF,mother,stimulated,2,ID1_reads.bam,,PeakCaller2/ID1_peak s.bed >>,raw,5,F >> >> >>ID1.3,ID1,TF,mother,stimulated,3,ID1_reads.bam,,PeakCaller3/ID1_peak s.bed >>,raw,5,F >> >> >>ID2.1,ID2,TF,father,stimulated,1,ID2_reads.bam,,PeakCaller1/ID2_peak s.bed >>,raw,5,F >> >> >>ID2.2,ID2,TF,father,stimulated,2,ID2_reads.bam,,PeakCaller2/ID2_peak s.bed >>,raw,5,F >> >> >>ID2.3,ID2,TF,father,stimulated,3,ID2_reads.bam,,PeakCaller3/ID2_peak s.bed >>,raw,5,F >> >> >>ID3.1,ID3,TF,child,stimulated,1,ID3_reads.bam,,PeakCaller1/ID3_peaks .bed, >>raw,5,F >> >> >>ID3.2,ID3,TF,child,stimulated,2,ID3_reads.bam,,PeakCaller2/ID3_peaks .bed, >>raw,5,F >> >> >>ID3.3,ID3,TF,child,stimulated,3,ID3_reads.bam,,PeakCaller3/ID3_peaks .bed, >>raw,5,F >> >> >> >> The samples are loaded ok with dba, and I do see some clustering by >> replicate via a simple plot. However, I have problems with the dba.count >> call. Basically, the correlation plot produced by dba.count only shows >> three datapoints, corresponding to the 1st replicate (=peak caller n.1) >> for each of the three samples. Why is that? Am I doing something wrong. >> The only difference with your example is that, in your examples, the >> files for the aligned reads are different - one for each replicate. Am I >> "cheating" in using the same reads.bam file for each "replicate"? >>Thanks! >> >> Giuseppe >> >> >> >> On 08/14/13 14:38, Rory Stark wrote: >>> Hi Giuseppe- >>> >>> The standard heatmap plots the read densities of the differentially >>>bound >>> sites. The x-axis clusters the samples, and the y-axis clusters the >>>sites >>> based on the score for each site in each sample. The major clusters >>> should >>> show similar patterns of binding levels. >>> >>> In the example plot you sent, there are roughly three main clusters of >>> differentially bound sites. The bottom cluster has higher binding >>> rates in >>> the first (red) sample group (group one gain/group two loss). The >>>middle >>> cluster includes sites with higher binding rates in the second sample >>> group (group one loss/group two gain). The top cluster includes sites >>> with >>> substantial binding in both groups, but that nonetheless exhibit a >>> significant change in binding intensity at these sites; it looks in >>> general like these go from moderate binding in the first group to very >>> strong binding in the second group (especially in the sample cluster on >>> the far right). >>> >>> You can get a bit more contrast in these plots by using the "maxval" >>> parameter to clip the highly-boud sites (the long tail to the right in >>> the >>> Color Key). For example, in this case setting maxval=6 could probably >>> give >>> a clearer picture of what patterns are driving the clustering of >>>binding >>> sites. >>> >>> Cheers- >>> Rory >>> >>> On 14/08/2013 11:37, "Giuseppe Gallone" >>><giuseppe.gallone at="" dpag.ox.ac.uk=""> >>> wrote: >>> >>>> Hi Rory >>>> >>>> I have a further question about DiffBind. Could you tell me something >>>> more about the clustering visualisation obtained with >>>> dba.plotHeatmap(....correlations=FALSE)? I've carried out a >>>>differential >>>> analysis on my samples and I observe some interesting clustering using >>>> both EDGER and DESEQ. I then plotted the heatmap using >>>> correlation=FALSE. >>>> >>>> I understand that the clustering obtained with dba.visualise is >>>> reproduced on the x axis (columns are grouped by clustering). >>>> >>>> What is shown instead on the y axis? Are these the individual >>>> differentially bound sites across the genome? What is the clustering >>>> described on the left? >>>> >>>> Thanks once again. >>>> >>>> Best >>>> Giuseppe >>>> >>>> On 07/23/13 18:22, Rory Stark wrote: >>>>> Hi Giuseppe- >>>>> >>>>> I'm glad to sorted the column thing out, that was what I suspected. >>>>> >>>>> There shouldn't be much problem doing the analysis without a control >>>>> track, particularly if the samples come from the same tissue. The >>>>>main >>>>> role of the control tracks is for peak calling. The reason the >>>>>control >>>>> track is less important for differential analysis is that youy are >>>>> looking >>>>> at the relative differences in read density at the same genomic >>>>> intervals >>>>> across samples, and not comparing read densities across intervals. >>>>> So if >>>>> the control track were similar at that location for all samples, it >>>>> will >>>>> not affect the differential analysis. The main issue would be if >>>>>there >>>>> were something like big copy number differences between samples. Then >>>>> you >>>>> could get sites that show as differentially bound when the real >>>>> difference >>>>> was the copy number. But the difference would be real regardless. >>>>> >>>>> Regarding sequencing depth, this should be taken care of by the >>>>> normalisation step. It takes the library size (either full library >>>>> size, >>>>> which is the total number of reads, or the default effective library >>>>> size, >>>>> the number of reads within peaks for each sample) and adjusts the >>>>>read >>>>> counts. You can can an idea of how this is working by using the >>>>> dba.plotBox (with bAll=TRUE) comparing bNormalized=TRUE and >>>>> bNormalized=FALSE to see if things even out. Also, after counting, >>>>>you >>>>> can >>>>> look at the clustering (dba.plotPCA and dba.plotHeatmap) to see if >>>>> samples >>>>> are grouping by sequencing depth -- try doing the same plots with >>>>> different score, eg score=DBA_SCORE_READS, score=DBA_SCORE_RPKM, and >>>>> score=DBA_SCORE_TMM_READS_EFFECTIVE or >>>>> score=DBA_SCORE_TMM_READS_FULL to >>>>> see which gives to the best clustering. >>>>> >>>>> Hope this helps! >>>>> >>>>> Cheers- >>>>> Rory >>> >> > >-- >Dr Giuseppe Gallone >MRC career development fellow >MRC Functional Genomics Unit - DPAG >University of Oxford, UK
ADD COMMENT
0
Entering edit mode
Hi Rory, thanks. Yes that's exactly what happens. Cheers Giuseppe On 08/15/13 13:56, Rory Stark wrote: > Hi Giuseppe- > > > Two compare different peak callers on the same replicate, you can get the > clustering/correlation at the peak level but it doesn't make sense at the > count level, as all the peaks are merged into a single consensus set at > that point. > > You did this correctly in the first case by including a line for each peak > caller with the same read (bam) files. At that point you can get a > correlation heatmap, PCA plot, etc, as well as look at overlaps (e.g. by > using dba.plotVenn and/or dba.overlap). > > One you create a binding matrix, as it done when you run dba.count, you > are using a single "consensus" set of peaks for all the samples, and > getting the number of reads in these peaks for each sample. So it no > longer makes sense to have a different sets of counts for each original > peakset. This is a result of the peaks being "merged" (by default, all the > peaks that appear in at least two peaksets are merged into a single set of > peaks for the rest of the analysis). > > If try what you suggest, and use symbolic links, you should get exactly > the same result for each virtual replicate -- that is, the three entries > should have correlation values of 1.0, as the same reads are being counted > within the same (global, merged) consensus peakset. > > Cheers- > Rory >
ADD REPLY
0
Entering edit mode
Hello Rory I've been trying to do a differential analysis between my chip-exo samples. These, as we discussed before, don't have biological replicates. Given my father-mother-child trio, I'd like to assess differential peaks using the (child, !child) contrast. The problem is, as I've found out, that edgeR requires at least two replicates, and this is also a requirement for dba.contrast. I know this is an inherent problem with my dataset, however I was wondering if there's anything you'd attempt doing before going back to lab people to tell them replicates are needed for differential analysis. For instance, I was considering the idea of using self-pseudo-replicates. Split each bam into 2 bam of equal number reads, randomly. Call peaks on each and call these rep1 and rep2. Of course this is not representative of real biological variation for that sample. Anything else you'd suggest trying? Alternatively, I have data for another trio (different Hapmap samples. but same ethnicity). Could I attempt using these as replicates for trio 1? The differential sites obtained would be based on more global patterns of variability within this population, however. Thanks Giuseppe On 08/15/13 13:56, Rory Stark wrote: > Hi Giuseppe- > > > Two compare different peak callers on the same replicate, you can get the > clustering/correlation at the peak level but it doesn't make sense at the > count level, as all the peaks are merged into a single consensus set at > that point. > > You did this correctly in the first case by including a line for each peak > caller with the same read (bam) files. At that point you can get a > correlation heatmap, PCA plot, etc, as well as look at overlaps (e.g. by > using dba.plotVenn and/or dba.overlap). > > One you create a binding matrix, as it done when you run dba.count, you > are using a single "consensus" set of peaks for all the samples, and > getting the number of reads in these peaks for each sample. So it no > longer makes sense to have a different sets of counts for each original > peakset. This is a result of the peaks being "merged" (by default, all the > peaks that appear in at least two peaksets are merged into a single set of > peaks for the rest of the analysis). > > If try what you suggest, and use symbolic links, you should get exactly > the same result for each virtual replicate -- that is, the three entries > should have correlation values of 1.0, as the same reads are being counted > within the same (global, merged) consensus peakset. > > Cheers- > Rory >
ADD REPLY

Login before adding your answer.

Traffic: 629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6