Deseq2's geometric mean calculation and 0s
1
0
Entering edit mode
lmrogers34 • 0
@lmrogers34-19639
Last seen 4 days ago
Germany

Hi,

I am working with an RNA-seq data set that contains a lot of 0s across the genes. When I do Deseq2 to look for differential gene expression, it finds several significantly dysregulated genes. However, when I look in the normalised count file, I am finding very little support for these genes across all my samples under one condition.

I have been investigating how the normalisation algorithm works, though I hit a bit of a stumbling block when it comes to 0s. In my understanding, the normalisation works as follows. Deseq2 normalisation divides counts by the sample-specific size factors. It generates this size-factor by the median ratio of gene counts relative to the mean per gene. Normalization works via the assumption that most of the genes are not DE and that the median ratio captures the scaling factor from the non-DE genes. This can also be summarised as the steps below.

Overview :

• Step 1: creates a pseudo-reference sample (row-wise geometric mean)
• Step 2: calculates the ratio of each sample to the reference
• Step 3: calculate the normalization factor for each sample (size factor)
• Step 4: calculate normalised counts

I have 71 samples, across 6 conditions, and I do my process as following :

    dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~cond)
dds <- collapseReplicates(dds, dds$ID2) ## have 3 wells per sample/patient so combining them dds <- estimateSizeFactors(dds) # filter genes where there are less than 5 samples with normalised counts greater or equal to 10 keep <- rowSums( counts(dds, normalized=TRUE) >= 10 ) >= 5 dds <- dds[keep,] dds <- DESeq(dds) res <- results(dds) c1_vs_c2 <- results(dds, contrast=c("cond", "c1", "c2"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH") c3_vs_c4 <- results(dds, contrast=c("cond", "c3", "c4"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH") c5_vs_c6 <- results(dds, contrast=c("cond", "c5", "c6"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH") ##lfc shrink c1_vs_c2_lfc <- lfcShrink(dds, contrast=c("cond", "c1", "c2"), res=c1_vs_c2) c3_vs_c4_lfc <- lfcShrink(dds, contrast=c("cond", "c3", "c4"), res=c3_vs_c4) c5_vs_c6_lfc <- lfcShrink(dds, contrast=c("cond", "c5", "c6"), res=c5_vs_c6)  When I view the counts, the normalised counts and on the sizeFactors, I have results and I get no errors. Even though every row contains at least 1 zero. head(counts(dds)) c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3 A2M 2 0 4 5 25 5 0 1 0 AAAS 0 0 0 23 30 28 0 2 0 AAGAB 0 0 24 54 90 0 0 211 0 head(counts(dds, normalized=TRUE)) c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3 A2M 1.349375 0.00000 3.703930 1.2096526 9.425593 1.945083 AAAS 0.000000 0.00000 0.000000 5.5644018 11.310711 10.892464 AAGAB 0.000000 0.00000 22.223580 13.0642476 33.932133 0.000000 head(dds$sizeFactor)
c1_r1    c1_r2    c1_r3     c2_r1     c2_r2     c2_r3
1.4821678 0.9565306 1.0799340 4.1334183 2.6523531 2.5705847


In a geometric mean typically one of three options occur:

• 1 is added to each value in the set and then 1 is subtracted from the result.
• Blank and 0 values are ignored in the calculation
• 0s are converted to 1 for the calculation

I am at a loss on how the normalisation factor is handling zeros. I have been crazy enough to do the maths on my raw data to see if I can reproduce the normalised counts using all three methods but so far I haven't.

How does Deseq2 handle 0s in the input counts? I would appreciate any help on this matter.

deseq2 normalization • 526 views
2
Entering edit mode
@mikelove
Last seen 3 days ago
United States

Genes with a zero are not included in the typical median ratio method, e.g. Anders and Huber 2010, which is used by default in DESeq().

This method may break down if you have very few genes without a zero. But this is not typical for the bulk RNA-seq datasets that the median ratio normalization method is designed for.

0
Entering edit mode

thanks for getting back to me so quickly.
I did read the paper but I clearly am too tired on a Friday evening (where I am) to have properly understood the parts about 0. So Deseq2 ignored the zeros and calculates the geometric mean of the row without them?

It is a bit of an unusual dataset. It is not typically bulk but not single cell. It was data that was initially sorted with flow cytometry to look at pathogen-specific cells. Which is why I am trying to understand what effect this non-standard data might be having on the algorithim.

1
Entering edit mode

You may need an alternative size factor estimation method if there are no genes that don't have a zero. But I don't know what would be appropriate for your data. Perhaps look into sum factors from scran.

0
Entering edit mode

Sorry Double post.

Just to clarify, I don't have a single gene without at least one zero.