I am using DESeq2 (R version 3.5.2, DESeq2_1.22.2) on RNASeq read counts to get DE genes between two conditions, each with a few samples. I used these commands:
[
dds <- DESeqDataSetFromMatrix(countData = rc_data, colData = colData, design = ~ TimePoint + PhenotypeTag);
Then:
dds_tmp = dds;
unnorm_cnt_colSums = colSums(counts(dds_tmp))
unnorm_cnt_colSums_mean_std = c(mean(unnorm_cnt_colSums),sd(unnorm_cnt_colSums))
dds_tmp <- estimateSizeFactors(dds_tmp);
cnt_norm <- counts(dds_tmp, normalized=TRUE);
norm_cnt_colSums = colSums(cnt_norm)
norm_cnt_colSums_mean_std = c(mean(norm_cnt_colSums),sd(norm_cnt_colSums))
result <- list (unnorm_cnt_colSums = unnorm_cnt_colSums, unnorm_cnt_colSums_mean_std = unnorm_cnt_colSums_mean_std, dds_tmp = dds_tmp,
norm_cnt_colSums = norm_cnt_colSums, norm_cnt_colSums_mean_std = norm_cnt_colSums_mean_std);
]
Above, besides standard DESeq2 commands, I computed the total reads (colSums) for each sample, and also mean and sd of total reads across the samples, before and after normalization. I was surprised to see that even after normalization, colSums varied a lot across the samples.
$unnorm_cnt_colSums
E01T2_ind25_S17_001 E23T2_ind3 E39T2_ind25 E39T3_ind1 G13T2B
17957634 13496855 15947932 11386670 18747079
G21T2B
13435778
$unnorm_cnt_colSums_mean_std
[1] 15161991 2873733
$norm_cnt_colSums
E01T2_ind25_S17_001 E23T2_ind3 E39T2_ind25 E39T3_ind1 G13T2B
13237059 17014225 13498446 16324754 15379562
G21T2B
13529330
$norm_cnt_colSums_mean_std
[1] 14830563 1631526
Has anyone observed such variability after DESeq2 normalization, or I might be just making some error somewhere.
Thanks a lot.
Thanks, Michael.
I plotted all four diagnostic plots suggested in the beginner's guide. It turns out this subset had only one sample for a condition, so, MA plot didn't look nice, but I tried another set with several samples for each of two conditions. The MA plot is nice but Total normalized read count still varies (although variance in total normalized counts is reduced as compared to that in total un-normalized counts, e.g.,
In other projects, we generally got just 1-5% variation in Total read Count after DESeq2 normalization. Hence, we could compute the standard log2-fold-change externally using the normalized read counts (without the shrinkage that may impact the low read count genes heavily; I do exclude genes with max(raw read count across all samples) < 5 before doing DESeq2). Now that Total normalized read counts differ substantially across samples, I suppose I cannot do that anymore, and I should go by the log2FC provided in results(dds). This will be impacted by the shrinkage. Is there a way to avoid shrinkage, and is that reasonable, or it may mess up the statistics possibly resulting erroneous interpretation of DESeq2 results (e.g., # DE genes, etc).
Related to this, is it OK to apply DESeq2 to TPM. I suppose it will interfere with the Poisson distribution assumption. Thanks a lot.
You probably have a subset of samples with some huge counts to a subset of gene(s), which affects the total count but not median ratio normalization (this is good, and why median ratio normalization is robust to large DE for a few genes, while total count is not robust).
We have a tximport pipeline for using abundances and number of mapped reads to generate count matrices for count based tools. These are well tested methods and software so you could take that approach.
Dear Michael:
Thanks again for the insights. I think the mapping/counting program I used (OSA) is fine (of course, I can try another one), but as you suggested, from hist of raw read counts I do see some gene with very high count, e.g., output of hist:
I will dig deeper into it and see if such genes are the same across all samples. May be just getting rid of them, assuming they are not important genes for me, will resolve it. Thanks a lot.
Again, note that DESeq2’s normalization is robust to these and it only affects these column sum statistics you are computing. So as long as you don’t worry about this you don’t need to modify the data.
Thanks. The only reason I was considering to get rid of such outlier genes is so that I can compute the log2 fold change externally (using normalized read counts) without shrinkage. Is there a way to switch-off shrinkage when computing log2FoldChange. I would like to find how much L2FC with and without shrinkage differ, esp for genes with read counts > 10 or 50.
DESeq2 does not use shrinkage when you do DESeq() then results(). See the vignette for details. Shrinkage was moved out into a separate function a while back. So no matter what you do with DESeq2, a few outlier genes shouldn't affect the others.
Thanks, Michael. Great.
Dear Michael:
For my dataset, I checked the L2FC obtained from DESeq2 against those calculated manually [e.g., log2(rowMeans(samples from group2)+smallnum) - log2(rowMeans(samples from group1)+smallnum) ] from DESeq2 normalized read counts. For some comparisons, there is good agreement between the two, but not for others. For some of these, I get messages like "replacing outliers and refitting for 3689 genes" (total about 23000 genes). So, I tried passing minReplicatesForReplace=Inf. No change. Does DESeq2 log2FoldChange use rlog transformed values. Then, I looked at plotPCA and I see one or two potential outlier samples. I reran after removing them, there is some change but the difference between the two L2FCs is still quite a bit (e.g., -0.703214491 (DESeq2) vs. -0.609022789 (manual)). My sizeFactor is also very different from 1, e.g., 1.9, or 0.3 or 0.12 for some samples. Are these taken into account when calculating log2FoldChange.
Further, I could not get a grasp of what quantity becomes about the same across the samples after normalizing. I see geom_mean(sizeFactor) ~ 1, but how does that normalize. In my case even the colMedians of normalized read counts are varying quite a bit (e.g., 19.55221 16.15844 12.83976 13.61203....; for this example, sizeFactor are: 1.4320629 1.4234050 1.1682461 1.4692885 1.1978780...). What might be going on.
Thanks.
MRM
"Does DESeq2 log2FoldChange use rlog transformed values"
Nope, these are different approaches, see paper or description of transformations in vignette.
"what quantity becomes about the same across the samples after normalizing"
See DESeq (original) paper from 2010 for description of the size factor method. Briefly, it sets the size factor to the median of the ratio of the sample against a reference.
Thanks. Went over the paper, seems like I am computing L2FC correctly. The difference is surprising.
The log2FoldChange from the GLM does not in general equal what you are calculating, e.g. in particular your estimator doesn't adjust for other covariates.
Thanks. It is just that in a few other projects I have seen great agreement between the two L2FCs, and sizeFactor have been close to 1. In this data set, they are varying a lot. So, as suggested in the normalization, this could be due to the nature of the data itself.