normalized data by DESeq2 look suspicious
2
2
Entering edit mode
tonja.r ▴ 80
@tonjar-7565
Last seen 8.1 years ago
United Kingdom

I ran into some complications while analyzing/normalizing my data. It would be kind of you if you could help me.

We have ChIP-seq data for 5 histones from treated and untreated mice. Each histone has two biological replicates and one technical replicate of only one biological replicate. Before performing any differential testing, I wanted to check if the normalization with the size factors from DESeq2 will improve my data or show some possible experiment errors etc. Also, I am intended to use normalized counts (by size factors) outside the DESeq to check for differential analysis and also VST and rlog transformed data as input for PEER. 

Raw counts for histone K4me3:

Square root of counts from histone K4me3:

4vjTsjorbsw.jpg

After normalization with the sizeFactors:
cds=estimateSizeFactors(dds)
o=counts(cds,normalized=TRUE)
First pic are the normalized counts and the second is the square root on the normalized counts.
_RiYbqifNfg.jpg

VST and rlog transformations:


It seems that something strange is happening there. Also the same plots for other data show some kind of strange curves after the normalization.

deseq2 • 2.9k views
ADD COMMENT
1
Entering edit mode
@ryan-c-thompson-5618
Last seen 6 weeks ago
Icahn School of Medicine at Mount Sinai…

Normalization in ChIP-seq samples is a complicated question. The correct normalization to use depends on how you did your counts, how you did your experiment, what questions you want to ask, the nature of the histone marks being studied, and the consistency of the antibodies used to IP. The csaw package's user guide has an entire section on ChIP-seq normalization, which you should read in its entirety before continuing. But broadly speaking, histone mark data generally has a bimodal distribution low-abundance peak consisting of counts for unmarked regions, and a higher-abundance peak consisting of counts in regions with the histone mark, and the relationship between the two peaks depends on lots of factors. Because of this, methods for estimating size factors in RNA-seq may not be appropriate for ChIP-Seq. (Also, it may be difficult to normalize different histone marks together, depending on how similar their binding profiles are.)

Once you have chosen suitable size factors, though, both the VST and rlog transformations are probably reasonable, as Michael Love has explained.

ADD COMMENT
0
Entering edit mode

Yeah, to echo Ryan, you should take a look at Aaron and Gordon's workflow which covers normalization extensively:

http://master.bioconductor.org/help/workflows/chipseqDB/

also published on F1000:

http://f1000research.com/articles/4-1080/v1

ADD REPLY
0
Entering edit mode

Just a small question that I am still not sure about: why would I choose VST or log transformations and not just take data normalized with the suitable size factors if my goal to find DE genes (not using DESeq2)?

ADD REPLY
0
Entering edit mode

In DESeq2 we use the transformations for making EDA plots, such as PCA, hierarchical clustering, etc.

If you are using some other software, that software will probably have it's own internal correction for sequencing depth.

It's up to you to read the manual of that other software or ask the maintainer of that software what is expected as input: raw counts, counts normalized for sequencing depth, or some other expected input.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

I'm not sure exactly what you are referring to as strange, but here's my thoughts:

VST and rlog are similar to log2(K_ij / s_j). You can read the DESeq2 paper to see what kind of model the rlog is built on. It works well for many RNA-seq experiments, but we didn't try it on ChIP-seq experiments. The VST is a more general transformation, which estimates overdispersion and then applies the transformation which should approximately stabilize the variance for the null "genes". You can read the DESeq paper for more details on this.

From looking at these plots, I would go with VST or log2(normalized counts + pseudocount), but it's in general hard to say for a given experiment, what is noise and what is signal, especially without replicates.

You could also try quantile normalization if you want the distributions to align.

ADD COMMENT
0
Entering edit mode

I mean after normalization the curves should look quite similar (as in the plot of the normalized counts here ) but the plots  "After normalization with the sizeFactors:" show some samples as outliers and they are also not similar across the samples that we know can have a batch effect (1s and 2s were done together, where 3s were done after). So, it is either that the normalization does not suit this kind of data or there are problems with the data itself. So, I am concerned if I can use the normalized data (by sizeFactor) for my downstream analysis that does not involve DESeq2.
In the manual it is also mentioned that VST and log transformations are not suitable for DE analysis, only for the explorative tools like PCA or other machine learning algorithms as the variance does not depend on the mean anymore. Why is it bad for the DE analysis?

ADD REPLY
0
Entering edit mode

It's hard to say what's the right normalization for a given dataset. Best is to look at boxplots, densities, PCA, etc. and convince yourself which one makes sense to you.

re: why DESeq() for DE testing: it's just that the negative binomial test usually has more power than a naive row-wise test (e.g. t-test) of log(x + pc) transformed data. See slide 22 in these teaching materials.

ADD REPLY

Login before adding your answer.

Traffic: 593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6