Question

deseq2 normalized data with heatmap

0

Entering edit mode

linouhao ▴ 20

@linouhao-15901

Last seen 13 months ago

United States

in http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html, it used three types of data to plot heatmap

ntd <- normTransform(dds)
vsd <- vst(dds, blind=FALSE)
rld <- rlog(dds, blind=FALSE)

which data is right? and there is also another function

normalized_counts <- counts(dds, normalized=TRUE)

how to select?

I also found a problem, I used deseq2 to find significant genes, and select the most differenet genes, and used vsd data to plot heatmap, but the plot obsviously not show high contrast color in two groups

can you help me, thanks a lot. by the way, deseq2 can automatically discard low exression genes when does diff analysis, is it right?

deseq2 • 16k views

ADD COMMENT • link updated 2.9 years ago by Ian ▴ 10 • written 3.7 years ago by linouhao ▴ 20

score 5 · Answer 1 · 2020-08-25

5

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 13 days ago

Republic of Ireland

In DESeq2, you should use vsd or rld for clustering and heatmap analysis, and anything else that is 'downstream' of the differential expression analysis (e.g. PCA). Pay close attention to data distributions, in this regard.

The differential expression analysis itself, i.e., the test statistics, can be regarded as being derived from the normalised 'counts'.

It is perfectly fine to pre-filter your vsd or rld data for your statistically significantly differentially expressed genes prior to performing clustering, in which case you are performing a supervised clustering analysis.

Regarding filtering, DESeq2 will not filter anything out. You can pre-filter your data prior to normalisation for low-count genes, if you wish (and this is recommended). When you derive test statistics with results(), some genes may fail the 'independent filtering' part, and their p-values set to NA. please see:

I also found a problem, I used deseq2 to find significant genes, and select the most differenet genes, and used vsd data to plot heatmap, but the plot obsviously not show high contrast color in two groups

You may need to specify custom breaks, or additionally scale your input data to standardised / Z scores. Take a look around the posts on Biostars.

Kevin

ADD COMMENT • link 3.7 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

thanks a lot, prefilter seems to be not needed, referring the link https://support.bioconductor.org/p/65256/

ADD REPLY • link 3.7 years ago linouhao ▴ 20

2

Entering edit mode

It is no major issue to pre-filter your raw data for genes of low counts. What Michael Love is implying is that DESeq2 has some inherent 'quality control' measures that will nevertheless deal with these (genes of low counts) when performing the differential expression analysis.

ADD REPLY • link 3.7 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

yes, you are totally right, but I am encountered with strange problems. a gene shows 32 foldchange, but the mean(assay(vsd)["NELL1", ][1:13]) is 5.625 mean(assay(vsd)["NELL1", ][14:26]) is 7.73 so why such a high differential gene , the mean expression in tumor and normal group just show such little difference?

thanks a lot

ADD REPLY • link 3.7 years ago linouhao ▴ 20

2

Entering edit mode

When using results(), have you additionally performed fold-change shrinkage via lfcShrink()?

Regarding the vsd object, the variance-stabilised data is measured on a scale that is quite different from the normalised or raw counts; so, a direct comparison of the fold-change from results() to that derived from the values in vsd is not possible.

ADD REPLY • link 3.7 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

I do not know what you mean shrinkage via lfcShrink(), can you show me the code? I just do like following library(DESeq2) dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design= ~ condition) dds <- DESeq(dds) res <- results(dds)

you said can not compare vst and results() directly, But I select the most diff genes and show in heatmap to show the difference, if it can not show much difference, what is the use of heatmap

ADD REPLY • link 3.7 years ago linouhao ▴ 20

4

Entering edit mode

Regarding shrinkage: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#log-fold-change-shrinkage-for-visualization-and-ranking

Regarding the heatmap, we typically scale the data prior to generating the heatmap. Please take a look at my tutorial here (see 4 (a)): https://github.com/kevinblighe/E-MTAB-6141

ADD REPLY • link 3.7 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

vst seems to hava scaled the data. you mean use lfcShrink to select most genes? this will not change the counts, am I right?

the heatmap is really useful, ,can you give me a email, I want to send my data to you, is it ok? thanks a lot

ADD REPLY • link 3.7 years ago linouhao ▴ 20

0

Entering edit mode

Hi, I cannot receive your data by email and do your work for you - sorry.

The methodology behind lfcShrinkage is contained in the link that I gave earlier. To give an answer, though: it does not alter the expression data - it just alters the fold-change estimates in the results.

ADD REPLY • link 3.7 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

thanks a lot. but just the expression give change thge plot a lot

ADD REPLY • link 3.7 years ago linouhao ▴ 20

0

Entering edit mode

Hey Kevin Blighe , Thanks for the helpful posts!

I noticed in your link to this heatmap analysis, you used scale() and not vst(). However, in your earlier reply (and in some other posts I've seen), you said you should use vsd for heatmaps and clustering analyses.

I've been wondering which is the correct methodology, or if using either scale() or vst() is fine. I tried both on my data and got a nice heatmap with well-defined clusters using scaled_data, where:

normalized_data <- subset(counts(dds,normalized=T), rownames(counts(dds,normalized=T)) %in% significant_gene_names)

scaled_data <- t(scale(t(normalized_data)))

and I got a less nice looking heatmap using vst_sig, where:

vst <- vst(dds, blind=FALSE)

vst <- assay(vst)

vst <- as.data.frame(vst)

vst_sig <- vst[rownames(vst) %in% significant_gene_names,]

Is it poor practice not to use the vst method? Is it okay to just use scale() as you did in your link? Thank you!

ADD REPLY • link 2.9 years ago Ian ▴ 10

2

Entering edit mode

Hey Ian, specifically just for the heatmap and/or the clustering, we can additionally scale and center the vsd or rld data.

The scale() function in R, by default, merely [by row] centers your data (mean = 0) and transforms it to Z-scores. This just makes it easier for the human brain to interpret the heatmap colour gradients, whereby 0 is then just the mean expression, whereas, e.g., blue or yellow represent different standard deviations below and above that mean, respectively, with higher absolute number relating to higher intensity.

In your above code, I wouldn't do t(scale(t(normalized_data))). I would instead run scale on vst_sig; so:

heat <- t(scale(t(vst_sig)))

You can still just use vst_sig on its own, with no scaling anywhere, but it will be more difficult to set colours and breaks.

Some do prefer to also use the counts, i.e., normalized_data; however, these are positive integer values (or double values if normalised) that are shifted toward 0 and follow a negative binomial distribution. If you run hist() on normalized_data, vst, and t(scale(t(vst))), you'll see how wildly different are the distributions.

In the case where you use normalized_data, the colour scheme would usually be white for 0, then a gradual increasing gradient toward dark red, dark purple, dark yellow, etc.