Question

DiffBind questions

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 11 weeks ago

Cambridge, UK

Hello- 1. There are two ways to use less memory in dba.count. The best way is to update to the latest version of DiffBind in Bioconductor 2.12. dba.count now has a parameter to do the counting with less memory (but a bit slower): bLowMem = TRUE. The other option is to run dba.count serially so it uses all available memory for each file. You can set bParallel=FALSE to do this -- it is a lot slower though! You can also try to lower the number of cores being used when it runs in parallel, e.g. > DBA$config$cores = 2 2. To get the numbers of peaks in each overlap, you need to set bCorOnly=FALSE. If there is a place in the documentation where I missed this, could you point it out to me so I can fix it? 3. In the case of plotting peaks (without counts), DiffBind first merges all the overlapping peaks to form a master peak list. To construct the vector for each peakset, it assigns a value of -1 for each peak that was not called in that peakset, and the score specified by the peak caller (normalized to be between 0 and 1) for peaks that are present. (If more than one peak got merged together for the peakset, it uses the highest score). As you say, it then computes the correlation between each pair of vectors for the heatmap. Once you have run dba.count, it uses scores based on the counts. Counts are all set to a minimum of 1 (in the case where there are no reads or more reads in the control than the experiment). If there are negative correlation values, this is not because there are negative counts, but because the peak vectors are negatively correlated. If you have peaks with low values for all the samples, you can filter them out using the filter option in dba.count (you can do this without re- counting by setting peaks=NULL). The latest version has more advanced filtering options than the version you are using. The issue with the merge procedure making the peaks very wide, especially in the case where you are looking at a lot of different marks simultaneously, is a legitimate one. You may want to compare the distribution of peak widths before and after the merge.. We have run successful analyses looking at very wide regions of enrichment (e.g. 500kb windows in Chandra et. al. Molecular Cell 2012). If you really want to look at different patterns of marks within a region (e.g. a promoter), you may want to take a look at the MMDiff package -- you can use your existing DiffBind objects in MMDiff. 4. Currently DiffBind will merge together any peaks that overlap by at least one base, and you can not change it. Actually, internally, we do have a parameter controlling how many bases the overlap should be (which can also be used to merge peaks that are close but not actually overlapping), but we have not exposed it in the interface. We are looking at features that enable you to have better control the merging function. I could probably expose the "gap" parameter pretty easily in the development version and make that available to you quickly if you are interested. Do say hello to Mike for me! Cheers- Rory ________________________________ From: Huayun Hou [huayunhou@gmail.com] Sent: 03 May 2013 14:00 To: rory.stark@cancer.org.uk Subject: DiffBind questions Dear Dr. Stark, I'm a master student in Dr. Michael Wilson's lab. We've been using DiffBind to analyze our ChIP-seq data. Thanks for this very handy tools. But I have a few questions regarding some of the functions. 1. Memory allocation error when using dba.count() I have a dataset of 22 samples, each around 10-15 million reads. When I use dba.count() I always get error: cannot allocate vector of size ... I'm using R on cluster, sessionInfo() sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DiffBind_1.2.4 Biobase_2.16.0 GenomicRanges_1.8.13 [4] IRanges_1.14.4 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] amap_0.8-7 edgeR_2.6.12 gdata_2.12.0 gplots_2.11.0 [5] gtools_2.7.0 limma_3.12.3 RColorBrewer_1.0-5 stats4_2.15.0 [9] zlibbioc_1.2.0 I've asked for the newest version of DiffBind but I wonder if you could give any suggestions on working with large dataset with DiffBind. Do I need do some settin g in R or I just need more local memory, for example, run on a super computer. 2. dba.overlap() doesn't give peak numbers. I did: dba.overlap(test,c(1,2),mode=DBA_OLAP_ALL) the result: A B onlyA onlyB inAll Cor Overlap 1.0000000 2.0000000 0.0000000 0.0000000 0.0000000 0.9999698 0.0000000 It doesn't give the number of peaks as described in the manual. 3. How is the correlation calculated? When plotting the correlation heatmap using dba(sampleSheet) plot() My understanding is the package merges all the peak regions from all the samples to generate a master peakset, and ask if each peak region in each sample is in that master peakset or not. Then calculate the correlation between 2 vectors of 0 and 1 and plot the heatmap with this correlation matrix. One concern for me is I'm comparing different factors which may or may not have different binding profiles. Will the merging step generate too broad peak regions which just cover everything? Or if the master peaks region has too many peaks and for all the samples there are too many 0s in their peaks vector. Will they be correlated because of regions where there are no peaks instead of where there are peaks? Also, is the dba.count() and dba.analyze() using the same approach? Then there may be some regions for some samples with a negative RPKM- cRPKM because those regions are not called "peaks" originally. Then the result correlation matrix may have some negative correlation values in it. I wonder if that comparison is fair in my case (I pooled all biological replicates before loading datasets to DiffBind, so I'm comparing biding between different vectors without replicates) ? 4. Can the overlap threshold be changed? Currently I think the package uses minoverlap 1bp as the threshold for called 2 peaks overlap. Can I change this threshold for my own convenience ? Thanks again for the great tool! I'm looking forward for your reply. Best Regards, Huayun Hou -- Huayun Hou Department of Molecular Genetics University of Toronto B.Sc. Life Science and Psychology Peking University NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for ...{{dropped:20}}

DiffBind MMDiff DiffBind MMDiff • 2.3k views

ADD COMMENT • link 11.9 years ago Rory Stark ★ 5.2k