DESeq2 counts distribution across both genes and samples
2
0
Entering edit mode
igor ▴ 40
@igor
Last seen 19 months ago
United States

DESeq2 assumes negative binomial distribution for counts distribution. That refers to the distribution of counts for a single gene across all samples.

What about distribution of counts of all genes across one sample? Each sample is normalized based on the geometric mean of all counts (or size factor). That sounds like it does not take into account the distribution. For example, two samples have the same number of reads, but one sample has a lot of low and high counts and another sample has only medium counts. The means would be the same, but a lot of genes would be different between the two samples. Is that taken into account?

deseq2 rnaseq • 1.8k views
0
Entering edit mode
@mikelove
Last seen 7 hours ago
United States

"That refers to the distribution of counts for a single gene across all samples."

Not exactly. The counts K_ij are not iid across samples, because the mean value mu_ij differs across samples. Even within the same group, the mu_ij is not equal because the s_j are not equal. Take a look at the formula in the Materials and Methods section of the DESeq2 paper.

"Each sample is normalized based on the geometric mean of all counts (or size factor)."

Again, not exactly. The normalization is the median ratio of the counts for a sample compared to a pseudo-reference sample. The pseudo-reference is created by taking the geometric mean across rows. Take a look at Eq 5 in the original DESeq paper.

I'm not sure what the concern is here. A way to think about the size factor estimation is that, if you plot two samples in a scatter plot, such that each point is a gene, we are looking for a size factor vector such that the ratio of the size factors for these two samples gives the slope of the line going through the points.