Search
Question: Diffbind: edgeR or DESeq2 ?
4
gravatar for florian.noack
20 months ago by
florian.noack60 wrote:

Hi everybody,

right now Iam trying to analyse my MeDip data set using Diffbind and Iam quite confused whather to use edgeR or DESeq2 for the analysis (iam a biologist with zero training in statistics and bioinformatics try to teach himself all the stuff). I tried already both options and DESeq2 gives me ~15 % less differential methylated sites. However Iam not sure how to judge if I get ride of false postive or i simply loss positiv peaks. Is there a rule of thumbe when to use edgeR or DESeq2.

 

Thanks for any suggestion :)

ADD COMMENTlink modified 18 days ago by Rory Stark2.1k • written 20 months ago by florian.noack60
3
gravatar for Gordon Smyth
20 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

No, there's no rule of thumb. It is probably a matter of personal preference.

For what it's worth, the Diffbind authors themselves used the edgeR options of Diffbind in their original publications using Diffbind to analyze ChiP-seq data, but DESeq2 didn't exist then.

You've actually asked a loaded question. edgeR and DESeq2 are two of the most popular Bioconductor packages. They have a lot of similarities, have comparable performance and are in direct competition with one another. They are developed by two prominent bioinformatics groups, both of whom are active in the Bioconductor project. (Note that I am the senior author of the edgeR project.) Etiquette requires that we leave the judgement to third party comparison studies rather than answering your question to recommend our software over that of the other group.

I just did a Google Scholar search for "edgeR DESeq2" to find third-party comparison studies. The first listed study ( http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1767-y ) decided that edgeR was best. The second listed study ( http://rnajournal.cshlp.org/content/20/11/1684.short ) recommended edgeR and DESeq2 as equal top. The third study ( https://peerj.com/articles/576 ) liked QuasiSeq best with edgeR and DESeq2 second and third. And so it goes. I am sure there are and will be other studies making different recommendations. And these comparisons are for RNA-seq rather than MeDip anyway.

 

ADD COMMENTlink modified 20 months ago • written 20 months ago by Gordon Smyth32k
2
gravatar for Rory Stark
20 months ago by
Rory Stark2.1k
CRUK, Cambridge, UK
Rory Stark2.1k wrote:

From DiffBind's perspective, the main difference between edgeR and DESeq2 relates to the way they normalise the data. DiffBind uses edgeR's TMM normalization method in a way that assumes that most of the sites are not differentially bound. If this assumption is violated -- for example, we have experiments where we knock down the factor we are ChIPing -- the result as normalized using method=DBA_EDGER can be (quite) incorrect. It is for this reason that in the development version of DiffBind, the default has been changed to method=DBA_DESEQ2, at least until we can add some more advanced normalization features.

In your case, if one of your conditions is hypo-methylated and the other hyper-methylated, the normalization can be doing the wrong thing. It is worth looking to see if this is what is going on. One way to do this is to use dba.plotMA() to make three plots, and compare them:

> par(mfrow=c(3,1))
> dba.plotMA(myDBA, method=DBA_EDGER,  bNormalized=FALSE)
> dba.plotMA(myDBA, method=DBA_EDGER,  bNormalized=TRUE)
> dba.plotMA(myDBA, method=DBA_DESEQ2, bNormalized=TRUE)

Comparing the non-normalized data to the version normalized using edgeR and DESeq2 may shed light on what is going on.

Besides that, a ~15% difference is actually pretty decent agreement. After all, the threshold is somewhat arbitrary (the default is changing from 0.10 to 0.05 in future releases). It may be interesting to see how well the different methods agree on the specific sites they identify. You can get an idea of how consistent things are by plotting a Venn diagram:

> dbsites <- dba.plotVenn(myDBA,contrast=1,method=DBA_ALL_METHODS)

Note that the three peaksets in the Venn diagram (peaks identified by only edgeR, only DESeq2,and both methods) are returned in dbsites.

-Rory

ADD COMMENTlink written 20 months ago by Rory Stark2.1k
2
gravatar for Rory Stark
20 months ago by
Rory Stark2.1k
CRUK, Cambridge, UK
Rory Stark2.1k wrote:

The read score (eg DBA_SCORE_TMM_MINUS_EFFECTIVE) is only used for plotting global data, and is not used for analysis (contrasts). The normalisation is controlled by two parameters in dba.analyze(). bFullLibrarySize controls whether the total number of aligned reads is used for nomalization (TRUE, default), or only the number of reads that overlap peaks (FALSE) is used. bSubControl will subtract the control reads (scaled if that was specified in dba.count()). It is good to read the commentary regarding the wisdom of doing this. The danger is that the differential analysis methods assume they are working with true, "raw" read counts, and subtracting the control reads alters the count distribution (the actual effect of this alteration is not well studied). The benefit is that there may be some places where a high control signal can be used to eliminate what could become a false positive. We are now recommending using the GreyListChIP package to generate "blacklists" from the control tracks, and filter overlapping reads prior to peak calling and read counting.

Calculating fold changes between ChIP and control, in my experience, doesn't yield very useful numbers because there are generally so many fewer reads in the control, and the fold change becomes overly sensitive to differences of one or two reads.

Given the large number of "peaks" you have, and the nature of MeDip, it may make more sense to try something like Gordon's csaw package for this analysis, which uses a windowing approach to identify enriched regions and skips a separate peak calling step.

Cheers-

Rory

ADD COMMENTlink written 20 months ago by Rory Stark2.1k

Hi Rory,

Thanks for your suggestion about using GreyListChIP in handling the false positive signal. I'm now trying this approach. 

I've got some question about applying the greylist.

 

If doing the filtering before peak calling:

1) How to do the filtering?

2) Do you know if this step compatible with MACS2? or No need to apply the greylist if using MACS2 (MACS2 consider Input signal during peak calling?)

 

If doing the filtering after peak calling:

1) To prepare files (.bed & .bam) for DiffBind, what is the procedure? (Should I just remove the peaks overlapped with the greylist in the bed file or I need to do something with the bam file as DiffBind mainly use bam for calculation?)

 

Thanks a lot! 

Yours Sincerely,

Kylie

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by kyliecode0

Hi,

I recommend filtering before peak calling.  Peak callers such as MACS2 do use the input to filter out peaks that are also present in input, but they do not cope particularly well in regions where the input is particularly noisy, and tend to call a lot of spurious peaks.  (That's sort of the whole reason for GreyListChIP.)  A package like, say, bedtools https://github.com/arq5x/bedtools2 can be used to do the filtering.

ADD REPLYlink written 4 weeks ago by Gord Brown540
2
gravatar for florian.noack
20 months ago by
florian.noack60 wrote:

In other words you recommend not to use the substraction method but filter out peaks falling in the mention blacklist and set bSubControl=F ?

About the csaw package: Iam not sure if a windowed approach is usefull in my case since I have quite "sharp" and seperated peaks (mostly around 125- 250 bp) rather then a broad pileup of reads over a huge region. The pileups get also quite nicely detected from MACS2.

ADD COMMENTlink written 20 months ago by florian.noack60
1

Optimally, you would blacklist prior to peak calling, as removing the reads from the problematic regions helps the accuracy of the modelling used to make the peak calls. Once the blacklist is applied (either before or after peak calling), you can set bSubControl=FALSE.

-R

ADD REPLYlink written 19 months ago by Rory Stark2.1k
0
gravatar for florian.noack
20 months ago by
florian.noack60 wrote:

Thanks a lot for the fast response from both of you. The samples I compare are linage related cell types and I know already from some other analysis which I did that I have only a few sites which show a different methylation  (3k out of 600k called peaks in the other approach I used) so I think edgeR will be the better approach. However I will use both tools and compare them visually + the venn diagram.

However I have another question: with the DBA_SCORE_TMM_MINUS_EFFECTIVE option (bScaleControl have to be set to true to use this option right ?) its possible to substract the background (input control) signal from the sample signal. In my naive biological view this seems to be okay since we do something like this in other kind of analysis (Western Blot analysis, imaging ... ). But I read in quite a few comment where people warn to substracting the normalized background signal in Chip experiments, so Iam a bit confused. In the other approach I used I calculated a fold change of normalized read number sample and normalized read number input (control).

 

Thanks again your a big help :).

ADD COMMENTlink written 20 months ago by florian.noack60
0
gravatar for Rory Stark
18 days ago by
Rory Stark2.1k
CRUK, Cambridge, UK
Rory Stark2.1k wrote:

While there are indeed some theoretical issues with altering the read counts by subtracting the Input reads from the ChIP reads, empirically we have found that doing so represents a fairly conservative use of the Input to limit false positives.

A more theoretically sound method is to use blacklists and greylists to simply eliminate regions where the Input control is showing an unusually high signal. These should be applied prior to peak calling if there is a peak calling step. The blacklists can be obtained from UCSC or ENCODE, and the GreyListChIP package can be used to build a custom blacklist based on your Inputs. 

It is not a requirement that the bScaleControl option in dba.count() be set to TRUE, but it is certainly recommended if you set bSubControl=TRUE in a subsequent call to dba.analyze(). Setting score=DBA_SCORE_TMM_MINUS_EFFECTIVE in dba.count() actually has no effect on how counts are calculated and normalized for the analysis. These scores are only used for plots of global data. When you call dba.analyze(), the options included there are used to adjust the read counts (specifically, bSubControl=TRUE). 

Cheers-

Rory

ADD COMMENTlink written 18 days ago by Rory Stark2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 163 users visited in the last hour