Hello,
I have obtained Level 3 data gene-level transcription estimates, as in log2(x+1) transformed RSEM normalized count from Xena browser. I would like to plot a hierarchical clustering heatmap with top 30% highly variable genes. I have few doubts for this.
1) Do I need to normalize the data again?
2) How should I apply filtering steps to reduce the number of genes for clustering?
If I need to normalize the data again Do you think the code below is right? Lets think the matrix "h" with rows as genes and sample as columns. The matrix has 20,000 genes
library(limma)
y <- normalizeQuantiles(h) #Quantile Normalization
#keep genes that have about 10 counts or more in at least 14 samples
keep <- rowSums(y > log2(11)) >= 14
table(keep)
keep
FALSE TRUE
3624 16906
y2 <- y[keep,]
library("genefilter")
vars <- apply(y2, 1, IQR)
f2 <- y2[vars > quantile(vars, 0.7), ] #selecting top 30% highly variable genes
I finally got around 5000 genes with this steps. Do you think this is right for unsupervised clustering?
Thank you