I'm performing a DE analysis between 2 groups of 3 replicates. Samples are in vitro-derived, expecting low within-group variance and very (very) high between group differences. Two different cell types are compared and are expected to be quite different.
The samples are exosomal small RNAs, which have a very skewed read count distribution. Quite more than typical small RNA-Seq, which have a few miRNAs hogging all the reads. Now, the top players are fold changes larger than e.g. something 10 places lower in the list.
The problem: all libraries are sequenced with rather similar depth.
Running DESeq2 and extracting normalized counts, the resulting total counts per sample vastly differ.
Code and numbers:
colSums(DATA) 9894485 11149046 10372395 7993503 8691350 6914504
First 3 samples are Cell Type 1, next 3 are Cell Type 2. Cell Type 1 seems to have more initial read counts but the differences are not dramatic.
So we get: 9.9M 11.1M 10.4M vs 8M 8.7M 6.9M
Going the DESeq2 way now:
dds <- DESeq2::DESeqDataSetFromMatrix(countData = DATA, colData = setup, design = ~ Celltype) normalized.counts<-as.data.frame(DESeq2::counts(dds, normalized=TRUE)) colSums(normalized.counts) 5592529 5557342 5692559 4376354 33643295 18068891
5.6M 5.6M 5.7M vs 4.4M 33.6M 18M
and is not what I would've expected
I tried to manually select normalization method but the results are similar (or worse).
poscounts 5716077 5571737 5865363 4693899 32937169 18947475
iterate 5253843 5173743 5365296 4229576 42754599 20749122
Turning the original counts into reads per million and doing some simple comparisons, I see completely the opposite results vs the DESeq2 way. In rpm, group1 has many up-regulated transcripts, whereas in DESeq2 results the lion's share goes to group 2.
I believe this is an artifact of the normalization process.
** Edited **
Based on Mike's suggestion, I'm adding further information about the samples here:
Number of features > 5
5311 5316 5308 5098 4022 4603