Question

Diffbind: Binding Affinity Heatmap

0

Entering edit mode

Rory Stark ★ 5.1k

@rory-stark-5741

Last seen 7 days ago

Cambridge, UK

Hi Giuseppe- The standard heatmap plots the read densities of the differentially bound sites. The x-axis clusters the samples, and the y-axis clusters the sites based on the score for each site in each sample. The major clusters should show similar patterns of binding levels. In the example plot you sent, there are roughly three main clusters of differentially bound sites. The bottom cluster has higher binding rates in the first (red) sample group (group one gain/group two loss). The middle cluster includes sites with higher binding rates in the second sample group (group one loss/group two gain). The top cluster includes sites with substantial binding in both groups, but that nonetheless exhibit a significant change in binding intensity at these sites; it looks in general like these go from moderate binding in the first group to very strong binding in the second group (especially in the sample cluster on the far right). You can get a bit more contrast in these plots by using the "maxval" parameter to clip the highly-boud sites (the long tail to the right in the Color Key). For example, in this case setting maxval=6 could probably give a clearer picture of what patterns are driving the clustering of binding sites. Cheers- Rory On 14/08/2013 11:37, "Giuseppe Gallone" <giuseppe.gallone at="" dpag.ox.ac.uk=""> wrote: >Hi Rory > >I have a further question about DiffBind. Could you tell me something >more about the clustering visualisation obtained with >dba.plotHeatmap(....correlations=FALSE)? I've carried out a differential >analysis on my samples and I observe some interesting clustering using >both EDGER and DESEQ. I then plotted the heatmap using correlation=FALSE. > >I understand that the clustering obtained with dba.visualise is >reproduced on the x axis (columns are grouped by clustering). > >What is shown instead on the y axis? Are these the individual >differentially bound sites across the genome? What is the clustering >described on the left? > >Thanks once again. > >Best >Giuseppe > >On 07/23/13 18:22, Rory Stark wrote: >> Hi Giuseppe- >> >> I'm glad to sorted the column thing out, that was what I suspected. >> >> There shouldn't be much problem doing the analysis without a control >> track, particularly if the samples come from the same tissue. The main >> role of the control tracks is for peak calling. The reason the control >> track is less important for differential analysis is that youy are >>looking >> at the relative differences in read density at the same genomic >>intervals >> across samples, and not comparing read densities across intervals. So if >> the control track were similar at that location for all samples, it will >> not affect the differential analysis. The main issue would be if there >> were something like big copy number differences between samples. Then >>you >> could get sites that show as differentially bound when the real >>difference >> was the copy number. But the difference would be real regardless. >> >> Regarding sequencing depth, this should be taken care of by the >> normalisation step. It takes the library size (either full library size, >> which is the total number of reads, or the default effective library >>size, >> the number of reads within peaks for each sample) and adjusts the read >> counts. You can can an idea of how this is working by using the >> dba.plotBox (with bAll=TRUE) comparing bNormalized=TRUE and >> bNormalized=FALSE to see if things even out. Also, after counting, you >>can >> look at the clustering (dba.plotPCA and dba.plotHeatmap) to see if >>samples >> are grouping by sequencing depth -- try doing the same plots with >> different score, eg score=DBA_SCORE_READS, score=DBA_SCORE_RPKM, and >> score=DBA_SCORE_TMM_READS_EFFECTIVE or score=DBA_SCORE_TMM_READS_FULL to >> see which gives to the best clustering. >> >> Hope this helps! >> >> Cheers- >> Rory

Sequencing GO Clustering GLAD edgeR DESeq DiffBind Sequencing GO Clustering GLAD edgeR • 1.9k views

ADD COMMENT • link 10.7 years ago Rory Stark ★ 5.1k

score 0 · Answer 1 · 2013-08-15

Hi Giuseppe- Two compare different peak callers on the same replicate, you can get the clustering/correlation at the peak level but it doesn't make sense at the count level, as all the peaks are merged into a single consensus set at that point. You did this correctly in the first case by including a line for each peak caller with the same read (bam) files. At that point you can get a correlation heatmap, PCA plot, etc, as well as look at overlaps (e.g. by using dba.plotVenn and/or dba.overlap). One you create a binding matrix, as it done when you run dba.count, you are using a single "consensus" set of peaks for all the samples, and getting the number of reads in these peaks for each sample. So it no longer makes sense to have a different sets of counts for each original peakset. This is a result of the peaks being "merged" (by default, all the peaks that appear in at least two peaksets are merged into a single set of peaks for the rest of the analysis). If try what you suggest, and use symbolic links, you should get exactly the same result for each virtual replicate -- that is, the three entries should have correlation values of 1.0, as the same reads are being counted within the same (global, merged) consensus peakset. Cheers- Rory On 15/08/2013 11:57, "Giuseppe Gallone" <giuseppe.gallone at="" dpag.ox.ac.uk=""> wrote: >hm looks like I found the reason. DiffBind wants the bam files named by >replicate, even if they're the same file. 3 symbolic links per bam did >it. Still I wonder if this kind of analysis has any sense in your opinion. > >Thanks >Giuseppe > >On 08/15/13 11:48, Giuseppe Gallone wrote: >> Hi Rory >> >> >> thanks for your explanation, it makes sense. I have another question if >> you don't mind. I have a trio of samples. I don't have replicates so I >> wanted to check the agreement of 3 different peak callers, using each >> peak caller as a replicate. so my sampleSheet looks a bit like this >> >> >> >>SampleID,Tissue,Factor,Condition,Treatment,Replicate,bamReads,bamCon trol, >>Peaks,PeakFormat,ScoreCol,LowerBetter >> >> >>ID1.1,ID1,TF,mother,stimulated,1,ID1_reads.bam,,PeakCaller1/ID1_peak s.bed >>,raw,5,F >> >> >>ID1.2,ID1,TF,mother,stimulated,2,ID1_reads.bam,,PeakCaller2/ID1_peak s.bed >>,raw,5,F >> >> >>ID1.3,ID1,TF,mother,stimulated,3,ID1_reads.bam,,PeakCaller3/ID1_peak s.bed >>,raw,5,F >> >> >>ID2.1,ID2,TF,father,stimulated,1,ID2_reads.bam,,PeakCaller1/ID2_peak s.bed >>,raw,5,F >> >> >>ID2.2,ID2,TF,father,stimulated,2,ID2_reads.bam,,PeakCaller2/ID2_peak s.bed >>,raw,5,F >> >> >>ID2.3,ID2,TF,father,stimulated,3,ID2_reads.bam,,PeakCaller3/ID2_peak s.bed >>,raw,5,F >> >> >>ID3.1,ID3,TF,child,stimulated,1,ID3_reads.bam,,PeakCaller1/ID3_peaks .bed, >>raw,5,F >> >> >>ID3.2,ID3,TF,child,stimulated,2,ID3_reads.bam,,PeakCaller2/ID3_peaks .bed, >>raw,5,F >> >> >>ID3.3,ID3,TF,child,stimulated,3,ID3_reads.bam,,PeakCaller3/ID3_peaks .bed, >>raw,5,F >> >> >> >> The samples are loaded ok with dba, and I do see some clustering by >> replicate via a simple plot. However, I have problems with the dba.count >> call. Basically, the correlation plot produced by dba.count only shows >> three datapoints, corresponding to the 1st replicate (=peak caller n.1) >> for each of the three samples. Why is that? Am I doing something wrong. >> The only difference with your example is that, in your examples, the >> files for the aligned reads are different - one for each replicate. Am I >> "cheating" in using the same reads.bam file for each "replicate"? >>Thanks! >> >> Giuseppe >> >> >> >> On 08/14/13 14:38, Rory Stark wrote: >>> Hi Giuseppe- >>> >>> The standard heatmap plots the read densities of the differentially >>>bound >>> sites. The x-axis clusters the samples, and the y-axis clusters the >>>sites >>> based on the score for each site in each sample. The major clusters >>> should >>> show similar patterns of binding levels. >>> >>> In the example plot you sent, there are roughly three main clusters of >>> differentially bound sites. The bottom cluster has higher binding >>> rates in >>> the first (red) sample group (group one gain/group two loss). The >>>middle >>> cluster includes sites with higher binding rates in the second sample >>> group (group one loss/group two gain). The top cluster includes sites >>> with >>> substantial binding in both groups, but that nonetheless exhibit a >>> significant change in binding intensity at these sites; it looks in >>> general like these go from moderate binding in the first group to very >>> strong binding in the second group (especially in the sample cluster on >>> the far right). >>> >>> You can get a bit more contrast in these plots by using the "maxval" >>> parameter to clip the highly-boud sites (the long tail to the right in >>> the >>> Color Key). For example, in this case setting maxval=6 could probably >>> give >>> a clearer picture of what patterns are driving the clustering of >>>binding >>> sites. >>> >>> Cheers- >>> Rory >>> >>> On 14/08/2013 11:37, "Giuseppe Gallone" >>><giuseppe.gallone at="" dpag.ox.ac.uk=""> >>> wrote: >>> >>>> Hi Rory >>>> >>>> I have a further question about DiffBind. Could you tell me something >>>> more about the clustering visualisation obtained with >>>> dba.plotHeatmap(....correlations=FALSE)? I've carried out a >>>>differential >>>> analysis on my samples and I observe some interesting clustering using >>>> both EDGER and DESEQ. I then plotted the heatmap using >>>> correlation=FALSE. >>>> >>>> I understand that the clustering obtained with dba.visualise is >>>> reproduced on the x axis (columns are grouped by clustering). >>>> >>>> What is shown instead on the y axis? Are these the individual >>>> differentially bound sites across the genome? What is the clustering >>>> described on the left? >>>> >>>> Thanks once again. >>>> >>>> Best >>>> Giuseppe >>>> >>>> On 07/23/13 18:22, Rory Stark wrote: >>>>> Hi Giuseppe- >>>>> >>>>> I'm glad to sorted the column thing out, that was what I suspected. >>>>> >>>>> There shouldn't be much problem doing the analysis without a control >>>>> track, particularly if the samples come from the same tissue. The >>>>>main >>>>> role of the control tracks is for peak calling. The reason the >>>>>control >>>>> track is less important for differential analysis is that youy are >>>>> looking >>>>> at the relative differences in read density at the same genomic >>>>> intervals >>>>> across samples, and not comparing read densities across intervals. >>>>> So if >>>>> the control track were similar at that location for all samples, it >>>>> will >>>>> not affect the differential analysis. The main issue would be if >>>>>there >>>>> were something like big copy number differences between samples. Then >>>>> you >>>>> could get sites that show as differentially bound when the real >>>>> difference >>>>> was the copy number. But the difference would be real regardless. >>>>> >>>>> Regarding sequencing depth, this should be taken care of by the >>>>> normalisation step. It takes the library size (either full library >>>>> size, >>>>> which is the total number of reads, or the default effective library >>>>> size, >>>>> the number of reads within peaks for each sample) and adjusts the >>>>>read >>>>> counts. You can can an idea of how this is working by using the >>>>> dba.plotBox (with bAll=TRUE) comparing bNormalized=TRUE and >>>>> bNormalized=FALSE to see if things even out. Also, after counting, >>>>>you >>>>> can >>>>> look at the clustering (dba.plotPCA and dba.plotHeatmap) to see if >>>>> samples >>>>> are grouping by sequencing depth -- try doing the same plots with >>>>> different score, eg score=DBA_SCORE_READS, score=DBA_SCORE_RPKM, and >>>>> score=DBA_SCORE_TMM_READS_EFFECTIVE or >>>>> score=DBA_SCORE_TMM_READS_FULL to >>>>> see which gives to the best clustering. >>>>> >>>>> Hope this helps! >>>>> >>>>> Cheers- >>>>> Rory >>> >> > >-- >Dr Giuseppe Gallone >MRC career development fellow >MRC Functional Genomics Unit - DPAG >University of Oxford, UK