Diffbind: edgeR or DESeq2 ?
6
4
Entering edit mode
@floriannoack-9934
Last seen 6.8 years ago

Hi everybody,

right now Iam trying to analyse my MeDip data set using Diffbind and Iam quite confused whather to use edgeR or DESeq2 for the analysis (iam a biologist with zero training in statistics and bioinformatics try to teach himself all the stuff). I tried already both options and DESeq2 gives me ~15 % less differential methylated sites. However Iam not sure how to judge if I get ride of false postive or i simply loss positiv peaks. Is there a rule of thumbe when to use edgeR or DESeq2.

 

Thanks for any suggestion :)

diffbind medip-seq edger deseq2 • 7.6k views
ADD COMMENT
5
Entering edit mode
@gordon-smyth
Last seen just now
WEHI, Melbourne, Australia

No, there's no rule of thumb. It is probably a matter of personal preference.

For what it's worth, the Diffbind authors themselves used the edgeR options of Diffbind in their original publications using Diffbind to analyze ChiP-seq data, but DESeq2 didn't exist then.

You've actually asked a loaded question. edgeR and DESeq2 are two of the most popular Bioconductor packages. They have a lot of similarities, have comparable performance and are in direct competition with one another. They are developed by two prominent bioinformatics groups, both of whom are active in the Bioconductor project. (Note that I am the senior author of the edgeR project.) Etiquette requires that we leave the judgement to third party comparison studies rather than answering your question to recommend our software over that of the other group.

I just did a Google Scholar search for "edgeR DESeq2" to find third-party comparison studies. The first listed study ( http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1767-y ) decided that edgeR was best. The second listed study ( http://rnajournal.cshlp.org/content/20/11/1684.short ) recommended edgeR and DESeq2 as equal top. The third study ( https://peerj.com/articles/576 ) liked QuasiSeq best with edgeR and DESeq2 second and third. And so it goes. I am sure there are and will be other studies making different recommendations. And these comparisons are for RNA-seq rather than MeDip anyway.

 

ADD COMMENT
0
Entering edit mode

Multiple Testing Correction: DESeq2 automatically corrects for multiple testing using the Benjamini-Hochberg method. Adjust the threshold (e.g., padj < 0.05) as needed.

Visualization: Use functions like plotMA, plotCounts, or heatmap to visualize the differential expression results.

Complex Contrasts: If you need more complex comparisons, you can use the listValues argument or relevel your factors to focus on specific comparisons.

This approach allows you to systematically compare all categories in your dataset using DESeq2.

ADD REPLY
2
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 28 days ago
Cambridge, UK

From DiffBind's perspective, the main difference between edgeR and DESeq2 relates to the way they normalise the data. DiffBind uses edgeR's TMM normalization method in a way that assumes that most of the sites are not differentially bound. If this assumption is violated -- for example, we have experiments where we knock down the factor we are ChIPing -- the result as normalized using method=DBA_EDGER can be (quite) incorrect. It is for this reason that in the development version of DiffBind, the default has been changed to method=DBA_DESEQ2, at least until we can add some more advanced normalization features.

In your case, if one of your conditions is hypo-methylated and the other hyper-methylated, the normalization can be doing the wrong thing. It is worth looking to see if this is what is going on. One way to do this is to use dba.plotMA() to make three plots, and compare them:

> par(mfrow=c(3,1))
> dba.plotMA(myDBA, method=DBA_EDGER,  bNormalized=FALSE)
> dba.plotMA(myDBA, method=DBA_EDGER,  bNormalized=TRUE)
> dba.plotMA(myDBA, method=DBA_DESEQ2, bNormalized=TRUE)

Comparing the non-normalized data to the version normalized using edgeR and DESeq2 may shed light on what is going on.

Besides that, a ~15% difference is actually pretty decent agreement. After all, the threshold is somewhat arbitrary (the default is changing from 0.10 to 0.05 in future releases). It may be interesting to see how well the different methods agree on the specific sites they identify. You can get an idea of how consistent things are by plotting a Venn diagram:

> dbsites <- dba.plotVenn(myDBA,contrast=1,method=DBA_ALL_METHODS)

Note that the three peaksets in the Venn diagram (peaks identified by only edgeR, only DESeq2,and both methods) are returned in dbsites.

-Rory

ADD COMMENT
2
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 28 days ago
Cambridge, UK

The read score (eg DBA_SCORE_TMM_MINUS_EFFECTIVE) is only used for plotting global data, and is not used for analysis (contrasts). The normalisation is controlled by two parameters in dba.analyze(). bFullLibrarySize controls whether the total number of aligned reads is used for nomalization (TRUE, default), or only the number of reads that overlap peaks (FALSE) is used. bSubControl will subtract the control reads (scaled if that was specified in dba.count()). It is good to read the commentary regarding the wisdom of doing this. The danger is that the differential analysis methods assume they are working with true, "raw" read counts, and subtracting the control reads alters the count distribution (the actual effect of this alteration is not well studied). The benefit is that there may be some places where a high control signal can be used to eliminate what could become a false positive. We are now recommending using the GreyListChIP package to generate "blacklists" from the control tracks, and filter overlapping reads prior to peak calling and read counting.

Calculating fold changes between ChIP and control, in my experience, doesn't yield very useful numbers because there are generally so many fewer reads in the control, and the fold change becomes overly sensitive to differences of one or two reads.

Given the large number of "peaks" you have, and the nature of MeDip, it may make more sense to try something like Gordon's csaw package for this analysis, which uses a windowing approach to identify enriched regions and skips a separate peak calling step.

Cheers-

Rory

ADD COMMENT
0
Entering edit mode

Hi Rory,

Thanks for your suggestion about using GreyListChIP in handling the false positive signal. I'm now trying this approach. 

I've got some question about applying the greylist.

 

If doing the filtering before peak calling:

1) How to do the filtering?

2) Do you know if this step compatible with MACS2? or No need to apply the greylist if using MACS2 (MACS2 consider Input signal during peak calling?)

 

If doing the filtering after peak calling:

1) To prepare files (.bed & .bam) for DiffBind, what is the procedure? (Should I just remove the peaks overlapped with the greylist in the bed file or I need to do something with the bam file as DiffBind mainly use bam for calculation?)

 

Thanks a lot! 

Yours Sincerely,

Kylie

ADD REPLY
0
Entering edit mode

Hi,

I recommend filtering before peak calling.  Peak callers such as MACS2 do use the input to filter out peaks that are also present in input, but they do not cope particularly well in regions where the input is particularly noisy, and tend to call a lot of spurious peaks.  (That's sort of the whole reason for GreyListChIP.)  A package like, say, bedtools https://github.com/arq5x/bedtools2 can be used to do the filtering.

ADD REPLY
2
Entering edit mode
@floriannoack-9934
Last seen 6.8 years ago

In other words you recommend not to use the substraction method but filter out peaks falling in the mention blacklist and set bSubControl=F ?

About the csaw package: Iam not sure if a windowed approach is usefull in my case since I have quite "sharp" and seperated peaks (mostly around 125- 250 bp) rather then a broad pileup of reads over a huge region. The pileups get also quite nicely detected from MACS2.

ADD COMMENT
1
Entering edit mode

Optimally, you would blacklist prior to peak calling, as removing the reads from the problematic regions helps the accuracy of the modelling used to make the peak calls. Once the blacklist is applied (either before or after peak calling), you can set bSubControl=FALSE.

-R

ADD REPLY
0
Entering edit mode

If you are comparing expression data using DESeq2, you can integrate this approach by using the peak regions detected by MACS2 as input regions for differential analysis.

Convert peak regions to a format suitable for DESeq2. Use the windowed peak counts as input to DESeq2 to identify differentially enriched regions. This windowed strategy should complement the sharp peak detection nicely, allowing you to capture and analyze the precise regions of interest in your data.

ADD REPLY
0
Entering edit mode
@floriannoack-9934
Last seen 6.8 years ago

Thanks a lot for the fast response from both of you. The samples I compare are linage related cell types and I know already from some other analysis which I did that I have only a few sites which show a different methylation  (3k out of 600k called peaks in the other approach I used) so I think edgeR will be the better approach. However I will use both tools and compare them visually + the venn diagram.

However I have another question: with the DBA_SCORE_TMM_MINUS_EFFECTIVE option (bScaleControl have to be set to true to use this option right ?) its possible to substract the background (input control) signal from the sample signal. In my naive biological view this seems to be okay since we do something like this in other kind of analysis (Western Blot analysis, imaging ... ). But I read in quite a few comment where people warn to substracting the normalized background signal in Chip experiments, so Iam a bit confused. In the other approach I used I calculated a fold change of normalized read number sample and normalized read number input (control).

 

Thanks again your a big help :).

ADD COMMENT
0
Entering edit mode

DBA_SCORE_TMM_MINUS_EFFECTIVE:

Purpose: This scoring method leverages the Trimmed Mean of M-values (TMM) normalization from the edgeR package. It adjusts for differences in library sizes and composition between samples. Functionality: By subtracting the effective library size, it provides a more accurate normalization, especially useful when dealing with datasets where some samples may have varying sequencing depths or composition biases. Scale Control:

Purpose: Scaling controls are used to adjust for differences in sequencing depth and other systematic variations between samples. When to Set to TRUE: When performing normalization methods that require scaling factors to be applied to control samples, ensuring that the normalization accounts for these systematic differences.

ADD REPLY
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 28 days ago
Cambridge, UK

While there are indeed some theoretical issues with altering the read counts by subtracting the Input reads from the ChIP reads, empirically we have found that doing so represents a fairly conservative use of the Input to limit false positives.

A more theoretically sound method is to use blacklists and greylists to simply eliminate regions where the Input control is showing an unusually high signal. These should be applied prior to peak calling if there is a peak calling step. The blacklists can be obtained from UCSC or ENCODE, and the GreyListChIP package can be used to build a custom blacklist based on your Inputs. 

It is not a requirement that the bScaleControl option in dba.count() be set to TRUE, but it is certainly recommended if you set bSubControl=TRUE in a subsequent call to dba.analyze(). Setting score=DBA_SCORE_TMM_MINUS_EFFECTIVE in dba.count() actually has no effect on how counts are calculated and normalized for the analysis. These scores are only used for plots of global data. When you call dba.analyze(), the options included there are used to adjust the read counts (specifically, bSubControl=TRUE). 

Cheers-

Rory

ADD COMMENT

Login before adding your answer.

Traffic: 610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6