Hello,
I need advice on the RNA-seq analysis results that I did using DESeq2. I have 4 experiments each with 3 biological replicates. At first I analyzed them all together. I put in the batch effect in the DESeq2 model.
PCA plot of all the experiments together:
and the raw p-values distribution:
Since p-values distribution for group 3 vs. group 2 is "hill-shaped", and since group 2 and group 3 clustering together separately from the other two groups, I decided to analyze group 2 and 3 separately.
The PCA shows two outliers, one for each group. However, even if I removed the outliers, the batch effect still contributed to the first PCA. So, I decided to not remove the outliers and run SVA to estimate the surrogate variables.
I used SVA to estimate two surrogate variables. These are the plot of the two SVA variables against the batch effect.
Although only the first SVA strongly correlate with the batch effect, I decided to include both variables in DESeq2 model. Not sure if I should just include 1 sva variable. So, deseq2 design is set as follows:
DESeq2Table.sva$SV1 <- svseq$sv[,1] DESeq2Table.sva$SV2 <- svseq$sv[,2] design(DESeq2Table.sva) <- ~ SV1 + SV2 + group
My p-value distribution after applying sva is still hill-shaped.
So, I went ahead and applied fdrtool. This is my p-value distribution after fdrtool.
The p-value distribution now looks good.
But, the thing is because I thought the estimated dispersion does not exactly follow a normal distribution (is it supposed to follow a normal distribution?), I decided to try fittype='local' instead of fittype='parametric' that was used in the figures below. Bear in mind, the following figures are done on the analysis using all samples from 4 experiments.
dispersion plots using fittype="parametric"
and these are the dispersion plots using fittype="local"
The median absolute log residual (i.e. median(abs(log(dispGeneEst)-log(dispFit))) for the parametric is 13.93 and for local is 13.37. So, it does show that local fit is slightly better.
And now all the p-values distribution look good, including the group 3 vs. group 2 that was hill-shaped. Note, I did not apply fdrtool for this one.
So, now I'm confused as to which method I should use.
As a comparison these are the results using the different methods.
All samples using parametric
Test vs Ref | min baseMean | # down | # up | # total |
---|---|---|---|---|
group 2 vs. group 1 | 10.15 | 2650 | 1905 | 4555 |
group 3 vs. group 1 | 4.5 | 3042 | 2472 | 5514 |
group 4 vs. group 1 | 2.38 | 3281 | 2765 | 6046 |
group 3 vs. group 2 | 5.8 | 44 | 83 | 127 |
group 4 vs. group 2 | 5.42 | 2178 | 2225 | 4403 |
group 3 vs. group 4 | 8.13 | 1775 | 1769 | 3544 |
group 3 vs. group2 with SVA and fdrtool adjustment
Test vs Ref |
min baseMean | # down |
# up |
# total |
group 3 vs. group 2 | 10.22 | 231 | 261 | 492 |
All samples with local fit
Test vs Ref | min baseMean | # down | # up | # total |
---|---|---|---|---|
group 2 vs. group 1 | 3.27 | 3062 | 2308 | 5370 |
group 3 vs. group 1 | 2.3 | 3453 | 2987 | 6440 |
group 4 vs. group 1 | 1.74 | 3738 | 3304 | 7042 |
group 3 vs. group 2 | 7.58 | 85 | 113 | 198 |
group 4 vs. group 2 | 3.25 | 2681 | 2668 | 5349 |
group 3 vs. group 4 | 4.5 | 2233 | 2153 | 4386 |
As you can see, especially for group 3 vs. group 2, the number of differential genes varies a lot using different methods (parametric or local or sva + fdrtool). Now I'm not sure which gene list I should use for downstream analysis.
Your feedback is highly appreciated. Please let me know if you need other figures.
Thank you in advance for your time.