Question: DiffBind questions
0
6.3 years ago by
Rory Stark3.0k
CRUK, Cambridge, UK
Rory Stark3.0k wrote:
In answer to your first question, for the initial occupancy (peak caller) matrix, DiffBind has always used the scores form the peak caller, normalized to be between 0 and 1, with a score of ?1 for peaks not called for a sample. These scores are used for the correlations. Often the clustering at this level is driven by overall signal levels: if many of the samples have few peaks, and some have many peaks (or vice-versa), the ones with a many/few peaks can form their own cluster even if they don;t really have that much in common. As for your analysis, one observation I can make is that overall there is lower correlation between the samples than most of the more successful analyses I have done. This makes sense as you are comparing some quite different transcription factors (HNF6, FOXA1, CTCF, etc) and histone marks (H3K27ac, H34me2, H3K36me3 etc.). It is not surprising that different normalization scores are producing different results in this case! While I have also found that DBA_SCORE_RPKM_FOLD frequently gives a clustering I find intuitive, the default score, DBA_SCORE_TMM_MINUS_EFFECTIVE, only really works if a key assumption is met: namely that the samples have roughly similar overall levels of binding activity. In many cases, this is not the case -- especially when you are comparing so many different factors/marks. If for example you save a high variable in the SN scores after counting, that can be an indication of this. I am planning on changing the default score to be DBA_SCORE_TMM_MINUS_FULL, which normalizes to the full depth of sequencing for each library, rather than the number of reads that overlap the consensus peaks. I suggest you try that score and see how it compares to RPKM_FOLD. Also, from the labels I think perhaps you have combines replicates of specific factors/marks? You may want to break them out so you have replicates of each ChIP, if only as a control to see that the replicates cluster together (with high correlation values). Cheers- Rory From: Huayun Hou <huayunhou@gmail.com> Date: Wed, 28 Aug 2013 16:46:38 -0400 To: Rory Stark <rory.stark at="" cruk.cam.ac.uk=""> Subject: Re: DiffBind questions Hi Dr. Stark, I've been using DiffBind for my analysis and I have a question about the occupancy analysis in DiffBind. After reading in the dba object from sample sheet, a correlation heatmap can be generated base on peaks set data given. Command as follows: >Data <- dba(sampleSheet="sampleSheet.csv") >plot(Data) To my understanding, in the previous version, only the existence of peaks are taken into account, which means essentially the heatmap was made by pair-wise correlation of vectors consist of 0 and 1. But in the new version (1.6.2) I found that the vectors get from peak sets contains also a number (rather than 0 and 1). I might lost when reading the manual but i was wondering if peak score given in the peak sets are also taken into account when the correlation is made? I was trying to understand the global binding correlations between all the factors I'm interested in so I did the correlation heatmap base on the read count. After dba.count, similar correlation heatmaps are made, but by using the following commands, that is using different scores I got very different correlations. dba.plotHeatmap(data_count, ColAttributes=NULL, correlations=T, socre=RPKM_FOLD) dba.plotHeatmap(data_count, ColAttributes=NULL, correlations=T) (default socre method) I attached the heatmap generated in each step. Files labelled with "TMM" and "RPKMfold" correspond to default score method and RPKM_FOLD. The one uses RPKM_FOLD makes more sense to me. The mouse liver chip-seq data was taken from Fuare et. al (2012) and also generated from Odam lab. I wonder if you could given me any suggestions or commands on why the results look different and which one makes more sense in terms of my analysis. Thanks a lot and please let me know if anything is not clear. Best Regards, Huayun Hou 2013/5/4 Rory Stark <rory.stark at="" cruk.cam.ac.uk=""> Hello- 1. There are two ways to use less memory in dba.count. The best way is to update to the latest version of DiffBind in Bioconductor 2.12. dba.count now has a parameter to do the counting with less memory (but a bit slower): bLowMem = TRUE. The other option is to run dba.count serially so it uses all available memory for each file. You can set bParallel=FALSE to do this -- it is a lot slower though! You can also try to lower the number of cores being used when it runs in parallel, e.g. > DBA$config$cores = 2 2. To get the numbers of peaks in each overlap, you need to set bCorOnly=FALSE. If there is a place in the documentation where I missed this, could you point it out to me so I can fix it? 3. In the case of plotting peaks (without counts), DiffBind first merges all the overlapping peaks to form a master peak list. To construct the vector for each peakset, it assigns a value of -1 for each peak that was not called in that peakset, and the score specified by the peak caller (normalized to be between 0 and 1) for peaks that are present. (If more than one peak got merged together for the peakset, it uses the highest score). As you say, it then computes the correlation between each pair of vectors for the heatmap. Once you have run dba.count, it uses scores based on the counts. Counts are all set to a minimum of 1 (in the case where there are no reads or more reads in the control than the experiment). If there are negative correlation values, this is not because there are negative counts, but because the peak vectors are negatively correlated. If you have peaks with low values for all the samples, you can filter them out using the filter option in dba.count (you can do this without re-counting by setting peaks=NULL). The latest version has more advanced filtering options than the version you are using. The issue with the merge procedure making the peaks very wide, especially in the case where you are looking at a lot of different marks simultaneously, is a legitimate one. You may want to compare the distribution of peak widths before and after the merge.. We have run successful analyses looking at very wide regions of enrichment (e.g. 500kb windows in Chandra et. al. Molecular Cell 2012). If you really want to look at different patterns of marks within a region (e.g. a promoter), you may want to take a look at the MMDiff package -- you can use your existing DiffBind objects in MMDiff. 4. Currently DiffBind will merge together any peaks that overlap by at least one base, and you can not change it. Actually, internally, we do have a parameter controlling how many bases the overlap should be (which can also be used to merge peaks that are close but not actually overlapping), but we have not exposed it in the interface. We are looking at features that enable you to have better control the merging function. I could probably expose the "gap" parameter pretty easily in the development version and make that available to you quickly if you are interested. Do say hello to Mike for me! Cheers- Rory ________________________________________ From: Huayun Hou [huayunhou@gmail.com] Sent: 03 May 2013 14:00 To: rory.stark at cancer.org.uk Subject: DiffBind questions Dear Dr. Stark, I'm a master student in Dr. Michael Wilson's lab. We've been using DiffBind to analyze our ChIP-seq data. Thanks for this very handy tools. But I have a few questions regarding some of the functions. 1. Memory allocation error when using dba.count() I have a dataset of 22 samples, each around 10-15 million reads. When I use dba.count() I always get error: cannot allocate vector of size ... I'm using R on cluster, sessionInfo() sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DiffBind_1.2.4 Biobase_2.16.0 GenomicRanges_1.8.13 [4] IRanges_1.14.4 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] amap_0.8-7 edgeR_2.6.12 gdata_2.12.0 gplots_2.11.0 [5] gtools_2.7.0 limma_3.12.3 RColorBrewer_1.0-5 stats4_2.15.0 [9] zlibbioc_1.2.0 I've asked for the newest version of DiffBind but I wonder if you could give any suggestions on working with large dataset with DiffBind. Do I need do some settin g in R or I just need more local memory, for example, run on a super computer. 2. dba.overlap() doesn't give peak numbers. I did: dba.overlap(test,c(1,2),mode=DBA_OLAP_ALL) the result: A B onlyA onlyB inAll Cor Overlap 1.0000000 2.0000000 0.0000000 0.0000000 0.0000000 0.9999698 0.0000000 It doesn't give the number of peaks as described in the manual. 3. How is the correlation calculated? When plotting the correlation heatmap using dba(sampleSheet) plot() My understanding is the package merges all the peak regions from all the samples to generate a master peakset, and ask if each peak region in each sample is in that master peakset or not. Then calculate the correlation between 2 vectors of 0 and 1 and plot the heatmap with this correlation matrix. One concern for me is I'm comparing different factors which may or may not have different binding profiles. Will the merging step generate too broad peak regions which just cover everything? Or if the master peaks region has too many peaks and for all the samples there are too many 0s in their peaks vector. Will they be correlated because of regions where there are no peaks instead of where there are peaks? Also, is the dba.count() and dba.analyze() using the same approach? Then there may be some regions for some samples with a negative RPKM-cRPKM because those regions are not called "peaks" originally. Then the result correlation matrix may have some negative correlation values in it. I wonder if that comparison is fair in my case (I pooled all biological replicates before loading datasets to DiffBind, so I'm comparing biding between different vectors without replicates) ? 4. Can the overlap threshold be changed? Currently I think the package uses minoverlap 1bp as the threshold for called 2 peaks overlap. Can I change this threshold for my own convenience ? Thanks again for the great tool! I'm looking forward for your reply. Best Regards, Huayun Hou -- Huayun Hou Department of Molecular Genetics University of Toronto B.Sc. Life Science and Psychology Peking University NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above- named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. Cancer Research UK Registered charity in England and Wales (1089464), Scotland (SC041666) and the Isle of Man (1103) A company limited by guarantee. Registered company in England and Wales (4325234) and the Isle of Man (5713F). Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD. -- Huayun Hou Department of Molecular Genetics University of Toronto B.Sc. Life Science and Psychology Peking University