Question

Closed:Unsupervised clustering heatmap of gene expression data

0

Entering edit mode

Biologist ▴ 110

@biologist-9801

Last seen 4.1 years ago

Hello,

I have obtained Level 3 data gene-level transcription estimates, as in log2(x+1) transformed RSEM normalized count from Xena browser. I would like to plot a hierarchical clustering heatmap with top 30% highly variable genes. I have few doubts for this.

1) Do I need to normalize the data again?

2) How should I apply filtering steps to reduce the number of genes for clustering?

If I need to normalize the data again Do you think the code below is right? Lets think the matrix "h" with rows as genes and sample as columns. The matrix has 20,000 genes

library(limma)
y <- normalizeQuantiles(h) #Quantile Normalization

#keep genes that have about 10 counts or more in at least 14 samples
keep <- rowSums(y > log2(11)) >= 14
table(keep)

keep
FALSE TRUE
3624 16906

y2 <- y[keep,]

library("genefilter")
vars <- apply(y2, 1, IQR)
f2 <- y2[vars > quantile(vars, 0.7), ] #selecting top 30% highly variable genes

I finally got around 5000 genes with this steps. Do you think this is right for unsupervised clustering?

Thank you

hierarchical clustering heatmap rsem geneexpression • 28 views

ADD COMMENT • link 6.8 years ago Biologist ▴ 110