Question

DEseq2 - Use normalized DEseq2 counts to plot most expressed genes per group

0

Entering edit mode

Bstn132 • 0

@bstn132-9487

Last seen 8.3 years ago

Boston

I’m working on RNAseq data using the DEseq2 R package. Most of the analysis will be differential expression between 2 (or 3) groups of samples with at least 3 biological replicates.

Before displaying the differential expression of certain groups of genes, I would like to plot a heatmap of the, for example 50, most expressed genes in group 1 and showing the expression of those genes in the other group(s). There are multiple ways to normalize the gene counts per sample (like % of reads), however to account for differences in total mapped reads between samples I use the default method for normalization in the DEseq2 package (the estimateSizeFactors function with default arguments, where a size-factor per sample is calculated as the median of the ratios: gene-count/geometric mean of that gene, if I understand correctly).

This seems to give robust results and if I adjust the counts for a couple of samples by multiplying or dividing the gene counts for a few of my samples by a factor of 10, the normalized gene counts remain similar.

My question is, what would be the best way to describe these values. For example these are the normalized results for the top2 genes:

	Group1-sample1	Group1-sample2	Group1-sample3	Group2-sample1	Group2-sample2	Group2-sample3
Gene1	46774.604	51353.7355	64484.2654	617.716637	1074.286180	662.753292
Gene2	26206.402	32438.9899	43765.2866	37326.099729	23284.689905	23229.030093

Total mapped reads per sample (after removing low-count genes):

	Group1-sample1	Group1-sample2	Group1-sample3	Group2-sample1	Group2-sample2	Group2-sample3
Mapped reads	4061637	4445092	3618525	3159137	3143315	3884281

It is obvious that the difference between group 1 and 2 for Gene1 is much larger than for gene2. But when showing (or heatmap plotting) this, the exact interpretation of the values remains a challenge. I can describe it as “read-counts normalized for total mapped reads per sample”. However, I am afraid that people will make those exact values more important than they really are.. Alternatively, I could use the normalized counts and divide them by the total mapped reads (or make it normalized reads per 1 million mapped reads) but that may undo the normalization if large differences between total mapped read counts between samples are present. Alternatively would be to divide by the colSums value for the normalized counts. A last option would be the log2(n+1) transformation but it will still be difficult to explain the value of that number.

How would you use and plot these values?

deseq2 R rnaseq • 12k views

ADD COMMENT • link updated 8.3 years ago by Michael Love 41k • written 8.3 years ago by Bstn132 • 0

score 6 · Accepted Answer · 2016-01-08

In our vignette, we use a transformation to variance stabilize the normalized counts, which helps to minimize the influence (visual and in calculating distances) of the sampling noise associated with low counts. I would call these "rlog normalized counts" or "variance stabilized normalized counts", which are typically close to log2( count / size factor) for large counts. See the vignette section on transformations and heatmaps.

If you prefer to use log2( count / size factor + pseudocount) (you can obtain these with the normTransform function) or just count/size factor (obtained with counts(dds, normalized=TRUE)), I would call these "(log transformed) normalized counts". If you want to say more than this, you can say that the samples with the lowest sequencing depth are scaled up and the samples with the highest sequencing depth are scaled down, so the final matrix of normalized counts contain values on the scale of the samples with sequencing depth in the middle.