I am working with an RNA-seq data set that contains a lot of 0s across the genes. When I do Deseq2 to look for differential gene expression, it finds several significantly dysregulated genes. However, when I look in the normalised count file, I am finding very little support for these genes across all my samples under one condition.
I have been investigating how the normalisation algorithm works, though I hit a bit of a stumbling block when it comes to 0s. In my understanding, the normalisation works as follows. Deseq2 normalisation divides counts by the sample-specific size factors. It generates this size-factor by the median ratio of gene counts relative to the mean per gene. Normalization works via the assumption that most of the genes are not DE and that the median ratio captures the scaling factor from the non-DE genes. This can also be summarised as the steps below.
- Step 1: creates a pseudo-reference sample (row-wise geometric mean)
- Step 2: calculates the ratio of each sample to the reference
- Step 3: calculate the normalization factor for each sample (size factor)
- Step 4: calculate normalised counts
I have 71 samples, across 6 conditions, and I do my process as following :
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~cond) dds <- collapseReplicates(dds, dds$ID2) ## have 3 wells per sample/patient so combining them dds <- estimateSizeFactors(dds) # filter genes where there are less than 5 samples with normalised counts greater or equal to 10 keep <- rowSums( counts(dds, normalized=TRUE) >= 10 ) >= 5 dds <- dds[keep,] dds <- DESeq(dds) res <- results(dds) c1_vs_c2 <- results(dds, contrast=c("cond", "c1", "c2"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH") c3_vs_c4 <- results(dds, contrast=c("cond", "c3", "c4"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH") c5_vs_c6 <- results(dds, contrast=c("cond", "c5", "c6"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH") ##lfc shrink c1_vs_c2_lfc <- lfcShrink(dds, contrast=c("cond", "c1", "c2"), res=c1_vs_c2) c3_vs_c4_lfc <- lfcShrink(dds, contrast=c("cond", "c3", "c4"), res=c3_vs_c4) c5_vs_c6_lfc <- lfcShrink(dds, contrast=c("cond", "c5", "c6"), res=c5_vs_c6)
When I view the counts, the normalised counts and on the sizeFactors, I have results and I get no errors. Even though every row contains at least 1 zero.
head(counts(dds)) c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3 A2M 2 0 4 5 25 5 0 1 0 AAAS 0 0 0 23 30 28 0 2 0 AAGAB 0 0 24 54 90 0 0 211 0 head(counts(dds, normalized=TRUE)) c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3 A2M 1.349375 0.00000 3.703930 1.2096526 9.425593 1.945083 AAAS 0.000000 0.00000 0.000000 5.5644018 11.310711 10.892464 AAGAB 0.000000 0.00000 22.223580 13.0642476 33.932133 0.000000 head(dds$sizeFactor) c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3 1.4821678 0.9565306 1.0799340 4.1334183 2.6523531 2.5705847
In a geometric mean typically one of three options occur:
- 1 is added to each value in the set and then 1 is subtracted from the result.
- Blank and 0 values are ignored in the calculation
- 0s are converted to 1 for the calculation
I am at a loss on how the normalisation factor is handling zeros. I have been crazy enough to do the maths on my raw data to see if I can reproduce the normalised counts using all three methods but so far I haven't.
How does Deseq2 handle 0s in the input counts? I would appreciate any help on this matter.