Hi Giuseppe-
Two compare different peak callers on the same replicate, you can get
the
clustering/correlation at the peak level but it doesn't make sense at
the
count level, as all the peaks are merged into a single consensus set
at
that point.
You did this correctly in the first case by including a line for each
peak
caller with the same read (bam) files. At that point you can get a
correlation heatmap, PCA plot, etc, as well as look at overlaps (e.g.
by
using dba.plotVenn and/or dba.overlap).
One you create a binding matrix, as it done when you run dba.count,
you
are using a single "consensus" set of peaks for all the samples, and
getting the number of reads in these peaks for each sample. So it no
longer makes sense to have a different sets of counts for each
original
peakset. This is a result of the peaks being "merged" (by default, all
the
peaks that appear in at least two peaksets are merged into a single
set of
peaks for the rest of the analysis).
If try what you suggest, and use symbolic links, you should get
exactly
the same result for each virtual replicate -- that is, the three
entries
should have correlation values of 1.0, as the same reads are being
counted
within the same (global, merged) consensus peakset.
Cheers-
Rory
On 15/08/2013 11:57, "Giuseppe Gallone" <giuseppe.gallone at="" dpag.ox.ac.uk="">
wrote:
>hm looks like I found the reason. DiffBind wants the bam files named
by
>replicate, even if they're the same file. 3 symbolic links per bam
did
>it. Still I wonder if this kind of analysis has any sense in your
opinion.
>
>Thanks
>Giuseppe
>
>On 08/15/13 11:48, Giuseppe Gallone wrote:
>> Hi Rory
>>
>>
>> thanks for your explanation, it makes sense. I have another
question if
>> you don't mind. I have a trio of samples. I don't have replicates
so I
>> wanted to check the agreement of 3 different peak callers, using
each
>> peak caller as a replicate. so my sampleSheet looks a bit like this
>>
>>
>>
>>SampleID,Tissue,Factor,Condition,Treatment,Replicate,bamReads,bamCon
trol,
>>Peaks,PeakFormat,ScoreCol,LowerBetter
>>
>>
>>ID1.1,ID1,TF,mother,stimulated,1,ID1_reads.bam,,PeakCaller1/ID1_peak
s.bed
>>,raw,5,F
>>
>>
>>ID1.2,ID1,TF,mother,stimulated,2,ID1_reads.bam,,PeakCaller2/ID1_peak
s.bed
>>,raw,5,F
>>
>>
>>ID1.3,ID1,TF,mother,stimulated,3,ID1_reads.bam,,PeakCaller3/ID1_peak
s.bed
>>,raw,5,F
>>
>>
>>ID2.1,ID2,TF,father,stimulated,1,ID2_reads.bam,,PeakCaller1/ID2_peak
s.bed
>>,raw,5,F
>>
>>
>>ID2.2,ID2,TF,father,stimulated,2,ID2_reads.bam,,PeakCaller2/ID2_peak
s.bed
>>,raw,5,F
>>
>>
>>ID2.3,ID2,TF,father,stimulated,3,ID2_reads.bam,,PeakCaller3/ID2_peak
s.bed
>>,raw,5,F
>>
>>
>>ID3.1,ID3,TF,child,stimulated,1,ID3_reads.bam,,PeakCaller1/ID3_peaks
.bed,
>>raw,5,F
>>
>>
>>ID3.2,ID3,TF,child,stimulated,2,ID3_reads.bam,,PeakCaller2/ID3_peaks
.bed,
>>raw,5,F
>>
>>
>>ID3.3,ID3,TF,child,stimulated,3,ID3_reads.bam,,PeakCaller3/ID3_peaks
.bed,
>>raw,5,F
>>
>>
>>
>> The samples are loaded ok with dba, and I do see some clustering by
>> replicate via a simple plot. However, I have problems with the
dba.count
>> call. Basically, the correlation plot produced by dba.count only
shows
>> three datapoints, corresponding to the 1st replicate (=peak caller
n.1)
>> for each of the three samples. Why is that? Am I doing something
wrong.
>> The only difference with your example is that, in your examples,
the
>> files for the aligned reads are different - one for each replicate.
Am I
>> "cheating" in using the same reads.bam file for each "replicate"?
>>Thanks!
>>
>> Giuseppe
>>
>>
>>
>> On 08/14/13 14:38, Rory Stark wrote:
>>> Hi Giuseppe-
>>>
>>> The standard heatmap plots the read densities of the
differentially
>>>bound
>>> sites. The x-axis clusters the samples, and the y-axis clusters
the
>>>sites
>>> based on the score for each site in each sample. The major
clusters
>>> should
>>> show similar patterns of binding levels.
>>>
>>> In the example plot you sent, there are roughly three main
clusters of
>>> differentially bound sites. The bottom cluster has higher binding
>>> rates in
>>> the first (red) sample group (group one gain/group two loss). The
>>>middle
>>> cluster includes sites with higher binding rates in the second
sample
>>> group (group one loss/group two gain). The top cluster includes
sites
>>> with
>>> substantial binding in both groups, but that nonetheless exhibit a
>>> significant change in binding intensity at these sites; it looks
in
>>> general like these go from moderate binding in the first group to
very
>>> strong binding in the second group (especially in the sample
cluster on
>>> the far right).
>>>
>>> You can get a bit more contrast in these plots by using the
"maxval"
>>> parameter to clip the highly-boud sites (the long tail to the
right in
>>> the
>>> Color Key). For example, in this case setting maxval=6 could
probably
>>> give
>>> a clearer picture of what patterns are driving the clustering of
>>>binding
>>> sites.
>>>
>>> Cheers-
>>> Rory
>>>
>>> On 14/08/2013 11:37, "Giuseppe Gallone"
>>><giuseppe.gallone at="" dpag.ox.ac.uk="">
>>> wrote:
>>>
>>>> Hi Rory
>>>>
>>>> I have a further question about DiffBind. Could you tell me
something
>>>> more about the clustering visualisation obtained with
>>>> dba.plotHeatmap(....correlations=FALSE)? I've carried out a
>>>>differential
>>>> analysis on my samples and I observe some interesting clustering
using
>>>> both EDGER and DESEQ. I then plotted the heatmap using
>>>> correlation=FALSE.
>>>>
>>>> I understand that the clustering obtained with dba.visualise is
>>>> reproduced on the x axis (columns are grouped by clustering).
>>>>
>>>> What is shown instead on the y axis? Are these the individual
>>>> differentially bound sites across the genome? What is the
clustering
>>>> described on the left?
>>>>
>>>> Thanks once again.
>>>>
>>>> Best
>>>> Giuseppe
>>>>
>>>> On 07/23/13 18:22, Rory Stark wrote:
>>>>> Hi Giuseppe-
>>>>>
>>>>> I'm glad to sorted the column thing out, that was what I
suspected.
>>>>>
>>>>> There shouldn't be much problem doing the analysis without a
control
>>>>> track, particularly if the samples come from the same tissue.
The
>>>>>main
>>>>> role of the control tracks is for peak calling. The reason the
>>>>>control
>>>>> track is less important for differential analysis is that youy
are
>>>>> looking
>>>>> at the relative differences in read density at the same genomic
>>>>> intervals
>>>>> across samples, and not comparing read densities across
intervals.
>>>>> So if
>>>>> the control track were similar at that location for all samples,
it
>>>>> will
>>>>> not affect the differential analysis. The main issue would be if
>>>>>there
>>>>> were something like big copy number differences between samples.
Then
>>>>> you
>>>>> could get sites that show as differentially bound when the real
>>>>> difference
>>>>> was the copy number. But the difference would be real
regardless.
>>>>>
>>>>> Regarding sequencing depth, this should be taken care of by the
>>>>> normalisation step. It takes the library size (either full
library
>>>>> size,
>>>>> which is the total number of reads, or the default effective
library
>>>>> size,
>>>>> the number of reads within peaks for each sample) and adjusts
the
>>>>>read
>>>>> counts. You can can an idea of how this is working by using the
>>>>> dba.plotBox (with bAll=TRUE) comparing bNormalized=TRUE and
>>>>> bNormalized=FALSE to see if things even out. Also, after
counting,
>>>>>you
>>>>> can
>>>>> look at the clustering (dba.plotPCA and dba.plotHeatmap) to see
if
>>>>> samples
>>>>> are grouping by sequencing depth -- try doing the same plots
with
>>>>> different score, eg score=DBA_SCORE_READS, score=DBA_SCORE_RPKM,
and
>>>>> score=DBA_SCORE_TMM_READS_EFFECTIVE or
>>>>> score=DBA_SCORE_TMM_READS_FULL to
>>>>> see which gives to the best clustering.
>>>>>
>>>>> Hope this helps!
>>>>>
>>>>> Cheers-
>>>>> Rory
>>>
>>
>
>--
>Dr Giuseppe Gallone
>MRC career development fellow
>MRC Functional Genomics Unit - DPAG
>University of Oxford, UK