DESeq2 normalization resulting in very different Total read counts for samples
1
0
Entering edit mode
MRM • 0
@mrm-22859
Last seen 17 months ago

I am using DESeq2 (R version 3.5.2, DESeq2_1.22.2) on RNASeq read counts to get DE genes between two conditions, each with a few samples. I used these commands:

[

dds <- DESeqDataSetFromMatrix(countData = rc_data, colData = colData, design = ~ TimePoint + PhenotypeTag);

Then:
dds_tmp = dds;

unnorm_cnt_colSums = colSums(counts(dds_tmp))
unnorm_cnt_colSums_mean_std = c(mean(unnorm_cnt_colSums),sd(unnorm_cnt_colSums))

dds_tmp <- estimateSizeFactors(dds_tmp);
cnt_norm <- counts(dds_tmp, normalized=TRUE);

norm_cnt_colSums = colSums(cnt_norm)
norm_cnt_colSums_mean_std = c(mean(norm_cnt_colSums),sd(norm_cnt_colSums))

result <- list (unnorm_cnt_colSums = unnorm_cnt_colSums, unnorm_cnt_colSums_mean_std = unnorm_cnt_colSums_mean_std, dds_tmp = dds_tmp,
norm_cnt_colSums = norm_cnt_colSums, norm_cnt_colSums_mean_std = norm_cnt_colSums_mean_std);



]

Above, besides standard DESeq2 commands, I computed the total reads (colSums) for each sample, and also mean and sd of total reads across the samples, before and after normalization. I was surprised to see that even after normalization, colSums varied a lot across the samples.

$unnorm_cnt_colSums E01T2_ind25_S17_001 E23T2_ind3 E39T2_ind25 E39T3_ind1 G13T2B 17957634 13496855 15947932 11386670 18747079 G21T2B 13435778$unnorm_cnt_colSums_mean_std
[1] 15161991  2873733

$norm_cnt_colSums E01T2_ind25_S17_001 E23T2_ind3 E39T2_ind25 E39T3_ind1 G13T2B 13237059 17014225 13498446 16324754 15379562 G21T2B 13529330$norm_cnt_colSums_mean_std
[1] 14830563  1631526


Has anyone observed such variability after DESeq2 normalization, or I might be just making some error somewhere.

Thanks a lot.

deseq2 normalization • 339 views
0
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

This isn’t a problem, while TPM would give zero variance, median ratio size factors minimize sequencing depth variation over all genes. Rather than looking at the sum, check the MA plots.

0
Entering edit mode

Thanks, Michael.

I plotted all four diagnostic plots suggested in the beginner's guide. It turns out this subset had only one sample for a condition, so, MA plot didn't look nice, but I tried another set with several samples for each of two conditions. The MA plot is nice but Total normalized read count still varies (although variance in total normalized counts is reduced as compared to that in total un-normalized counts, e.g.,

$unnorm_cnt_colSums_mean_std [1] 16213653 5467635$dds_tmp
class: DESeqDataSet
dim: 33212 55

$norm_cnt_colSums_mean_std [1] 15071131 2002666  In other projects, we generally got just 1-5% variation in Total read Count after DESeq2 normalization. Hence, we could compute the standard log2-fold-change externally using the normalized read counts (without the shrinkage that may impact the low read count genes heavily; I do exclude genes with max(raw read count across all samples) < 5 before doing DESeq2). Now that Total normalized read counts differ substantially across samples, I suppose I cannot do that anymore, and I should go by the log2FC provided in results(dds). This will be impacted by the shrinkage. Is there a way to avoid shrinkage, and is that reasonable, or it may mess up the statistics possibly resulting erroneous interpretation of DESeq2 results (e.g., # DE genes, etc). Related to this, is it OK to apply DESeq2 to TPM. I suppose it will interfere with the Poisson distribution assumption. Thanks a lot. ADD REPLY 0 Entering edit mode You probably have a subset of samples with some huge counts to a subset of gene(s), which affects the total count but not median ratio normalization (this is good, and why median ratio normalization is robust to large DE for a few genes, while total count is not robust). We have a tximport pipeline for using abundances and number of mapped reads to generate count matrices for count based tools. These are well tested methods and software so you could take that approach. ADD REPLY 0 Entering edit mode Dear Michael: Thanks again for the insights. I think the mapping/counting program I used (OSA) is fine (of course, I can try another one), but as you suggested, from hist of raw read counts I do see some gene with very high count, e.g., output of hist: $breaks
[1]      0  50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 550000 600000 650000

\$counts
[1] 58372     4     1     0     1     0     0     1     1     0     0     0     1


I will dig deeper into it and see if such genes are the same across all samples. May be just getting rid of them, assuming they are not important genes for me, will resolve it. Thanks a lot.

0
Entering edit mode

Again, note that DESeq2’s normalization is robust to these and it only affects these column sum statistics you are computing. So as long as you don’t worry about this you don’t need to modify the data.

0
Entering edit mode

Thanks. The only reason I was considering to get rid of such outlier genes is so that I can compute the log2 fold change externally (using normalized read counts) without shrinkage. Is there a way to switch-off shrinkage when computing log2FoldChange. I would like to find how much L2FC with and without shrinkage differ, esp for genes with read counts > 10 or 50.

0
Entering edit mode

DESeq2 does not use shrinkage when you do DESeq() then results(). See the vignette for details. Shrinkage was moved out into a separate function a while back. So no matter what you do with DESeq2, a few outlier genes shouldn't affect the others.

0
Entering edit mode

Thanks, Michael. Great.

0
Entering edit mode

Dear Michael:

For my dataset, I checked the L2FC obtained from DESeq2 against those calculated manually [e.g., log2(rowMeans(samples from group2)+smallnum) - log2(rowMeans(samples from group1)+smallnum) ] from DESeq2 normalized read counts. For some comparisons, there is good agreement between the two, but not for others. For some of these, I get messages like "replacing outliers and refitting for 3689 genes" (total about 23000 genes). So, I tried passing minReplicatesForReplace=Inf. No change. Does DESeq2 log2FoldChange use rlog transformed values. Then, I looked at plotPCA and I see one or two potential outlier samples. I reran after removing them, there is some change but the difference between the two L2FCs is still quite a bit (e.g., -0.703214491 (DESeq2) vs. -0.609022789 (manual)). My sizeFactor is also very different from 1, e.g., 1.9, or 0.3 or 0.12 for some samples. Are these taken into account when calculating log2FoldChange.

Further, I could not get a grasp of what quantity becomes about the same across the samples after normalizing. I see geom_mean(sizeFactor) ~ 1, but how does that normalize. In my case even the colMedians of normalized read counts are varying quite a bit (e.g., 19.55221 16.15844 12.83976 13.61203....; for this example, sizeFactor are: 1.4320629 1.4234050 1.1682461 1.4692885 1.1978780...). What might be going on.

Thanks.

MRM

0
Entering edit mode

"Does DESeq2 log2FoldChange use rlog transformed values"

Nope, these are different approaches, see paper or description of transformations in vignette.

"what quantity becomes about the same across the samples after normalizing"

See DESeq (original) paper from 2010 for description of the size factor method. Briefly, it sets the size factor to the median of the ratio of the sample against a reference.

0
Entering edit mode

Thanks. Went over the paper, seems like I am computing L2FC correctly. The difference is surprising.

0
Entering edit mode

The log2FoldChange from the GLM does not in general equal what you are calculating, e.g. in particular your estimator doesn't adjust for other covariates.

0
Entering edit mode

Thanks. It is just that in a few other projects I have seen great agreement between the two L2FCs, and sizeFactor have been close to 1. In this data set, they are varying a lot. So, as suggested in the normalization, this could be due to the nature of the data itself.