Entering edit mode
Hello-
1. There are two ways to use less memory in dba.count. The best way is
to update to the latest version of DiffBind in Bioconductor 2.12.
dba.count now has a parameter to do the counting with less memory (but
a bit slower): bLowMem = TRUE.
The other option is to run dba.count serially so it uses all available
memory for each file. You can set bParallel=FALSE to do this -- it is
a lot slower though! You can also try to lower the number of cores
being used when it runs in parallel, e.g.
> DBA$config$cores = 2
2. To get the numbers of peaks in each overlap, you need to set
bCorOnly=FALSE. If there is a place in the documentation where I
missed this, could you point it out to me so I can fix it?
3. In the case of plotting peaks (without counts), DiffBind first
merges all the overlapping peaks to form a master peak list. To
construct the vector for each peakset, it assigns a value of -1 for
each peak that was not called in that peakset, and the score specified
by the peak caller (normalized to be between 0 and 1) for peaks that
are present. (If more than one peak got merged together for the
peakset, it uses the highest score). As you say, it then computes the
correlation between each pair of vectors for the heatmap.
Once you have run dba.count, it uses scores based on the counts.
Counts are all set to a minimum of 1 (in the case where there are no
reads or more reads in the control than the experiment). If there are
negative correlation values, this is not because there are negative
counts, but because the peak vectors are negatively correlated. If you
have peaks with low values for all the samples, you can filter them
out using the filter option in dba.count (you can do this without re-
counting by setting peaks=NULL). The latest version has more advanced
filtering options than the version you are using.
The issue with the merge procedure making the peaks very wide,
especially in the case where you are looking at a lot of different
marks simultaneously, is a legitimate one. You may want to compare the
distribution of peak widths before and after the merge.. We have run
successful analyses looking at very wide regions of enrichment (e.g.
500kb windows in Chandra et. al. Molecular Cell 2012). If you really
want to look at different patterns of marks within a region (e.g. a
promoter), you may want to take a look at the MMDiff package -- you
can use your existing DiffBind objects in MMDiff.
4. Currently DiffBind will merge together any peaks that overlap by at
least one base, and you can not change it. Actually, internally, we do
have a parameter controlling how many bases the overlap should be
(which can also be used to merge peaks that are close but not actually
overlapping), but we have not exposed it in the interface. We are
looking at features that enable you to have better control the merging
function. I could probably expose the "gap" parameter pretty easily in
the development version and make that available to you quickly if you
are interested.
Do say hello to Mike for me!
Cheers-
Rory
________________________________
From: Huayun Hou [huayunhou@gmail.com]
Sent: 03 May 2013 14:00
To: rory.stark@cancer.org.uk
Subject: DiffBind questions
Dear Dr. Stark,
I'm a master student in Dr. Michael Wilson's lab. We've been using
DiffBind to analyze our ChIP-seq data. Thanks for this very handy
tools. But I have a few questions regarding some of the functions.
1. Memory allocation error when using dba.count()
I have a dataset of 22 samples, each around 10-15 million reads. When
I use dba.count() I always get error: cannot allocate vector of size
...
I'm using R on cluster, sessionInfo()
sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DiffBind_1.2.4 Biobase_2.16.0 GenomicRanges_1.8.13
[4] IRanges_1.14.4 BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] amap_0.8-7 edgeR_2.6.12 gdata_2.12.0
gplots_2.11.0
[5] gtools_2.7.0 limma_3.12.3 RColorBrewer_1.0-5
stats4_2.15.0
[9] zlibbioc_1.2.0
I've asked for the newest version of DiffBind but I wonder if you
could give any suggestions on working with large dataset with
DiffBind. Do I need do some settin g in R or I just need more local
memory, for example, run on a super computer.
2. dba.overlap() doesn't give peak numbers.
I did: dba.overlap(test,c(1,2),mode=DBA_OLAP_ALL)
the result: A B onlyA onlyB inAll Cor
Overlap
1.0000000 2.0000000 0.0000000 0.0000000 0.0000000 0.9999698 0.0000000
It doesn't give the number of peaks as described in the manual.
3. How is the correlation calculated?
When plotting the correlation heatmap using
dba(sampleSheet)
plot()
My understanding is the package merges all the peak regions from all
the samples to generate a master peakset, and ask if each peak region
in each sample is in that master peakset or not. Then calculate the
correlation between 2 vectors of 0 and 1 and plot the heatmap with
this correlation matrix.
One concern for me is I'm comparing different factors which may or may
not have different binding profiles. Will the merging step generate
too broad peak regions which just cover everything? Or if the master
peaks region has too many peaks and for all the samples there are too
many 0s in their peaks vector. Will they be correlated because of
regions where there are no peaks instead of where there are peaks?
Also, is the dba.count() and dba.analyze() using the same approach?
Then there may be some regions for some samples with a negative RPKM-
cRPKM because those regions are not called "peaks" originally. Then
the result correlation matrix may have some negative correlation
values in it. I wonder if that comparison is fair in my case (I pooled
all biological replicates before loading datasets to DiffBind, so I'm
comparing biding between different vectors without replicates) ?
4. Can the overlap threshold be changed?
Currently I think the package uses minoverlap 1bp as the threshold for
called 2 peaks overlap. Can I change this threshold for my own
convenience ?
Thanks again for the great tool! I'm looking forward for your reply.
Best Regards,
Huayun Hou
--
Huayun Hou
Department of Molecular Genetics
University of Toronto
B.Sc. Life Science and Psychology
Peking University
NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for
...{{dropped:20}}