Search
Question: DEseq2 - Use normalized DEseq2 counts to plot most expressed genes per group
0
gravatar for Bstn132
2.8 years ago by
Bstn1320
Boston
Bstn1320 wrote:

I’m working on RNAseq data using the DEseq2 R package. Most of the analysis will be differential expression between 2 (or 3) groups of samples with at least 3 biological replicates.

Before displaying the differential expression of certain groups of genes, I would like to plot a heatmap of the, for example 50, most expressed genes in group 1 and showing the expression of those genes in the other group(s). There are multiple ways to normalize the gene counts per sample (like % of reads), however to account for differences in total mapped reads between samples I use the default method for normalization in the DEseq2 package (the estimateSizeFactors function with default arguments, where a size-factor per sample is calculated as the median of the ratios: gene-count/geometric mean of that gene, if I understand correctly). 

This seems to give robust results and if I adjust the counts for a couple of samples by multiplying or dividing the gene counts for a few of my samples by a factor of 10, the normalized gene counts remain similar. 

My question is, what would be the best way to describe these values. For example these are the normalized results for the top2 genes:

  Group1-sample1 Group1-sample2 Group1-sample3 Group2-sample1 Group2-sample2 Group2-sample3
Gene1 46774.604 51353.7355 64484.2654 617.716637 1074.286180 662.753292
Gene2 26206.402 32438.9899 43765.2866 37326.099729 23284.689905 23229.030093

Total mapped reads per sample (after removing low-count genes):

  Group1-sample1 Group1-sample2 Group1-sample3 Group2-sample1 Group2-sample2 Group2-sample3
Mapped reads 4061637 4445092 3618525 3159137 3143315 3884281

 

It is obvious that the difference between group 1 and 2 for Gene1 is much larger than for gene2. But when showing (or heatmap plotting) this, the exact interpretation of the values remains a challenge. I can describe it as “read-counts normalized for total mapped reads per sample”. However, I am afraid that people will make those exact values more important than they really are.. Alternatively, I could use the normalized counts and divide them by the total mapped reads (or make it normalized reads per 1 million mapped reads) but that may undo the normalization if large differences between total mapped read counts between samples are present. Alternatively would be to divide by the colSums value for the normalized counts. A last option would be the log2(n+1) transformation but it will still be difficult to explain the value of that number.

How would you use and plot these values?

ADD COMMENTlink modified 2.8 years ago by Michael Love20k • written 2.8 years ago by Bstn1320
3
gravatar for Michael Love
2.8 years ago by
Michael Love20k
United States
Michael Love20k wrote:

In our vignette, we use a transformation to variance stabilize the normalized counts, which helps to minimize the influence (visual and in calculating distances) of the sampling noise associated with low counts. I would call these "rlog normalized counts" or "variance stabilized normalized counts", which are typically close to log2( count / size factor) for large counts. See the vignette section on transformations and heatmaps.

If you prefer to use log2( count / size factor + pseudocount) (you can obtain these with the normTransform function) or just count/size factor (obtained with counts(dds, normalized=TRUE)), I would call these "(log transformed) normalized counts". If you want to say more than this, you can say that the samples with the lowest sequencing depth are scaled up and the samples with the highest sequencing depth are scaled down, so the final matrix of normalized counts contain values on the scale of the samples with sequencing depth in the middle.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Michael Love20k

Thank you so much. I had read those sections, but after re-reading them and reading the DEseq2 paper it made more sense. For clustering of samples and displaying the highest expressed genes I will use the rlog transformed counts.

ADD REPLYlink written 2.8 years ago by Bstn1320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 233 users visited in the last hour