Hi,
I have performed RNAseq analysis (filtering, normalization, and interested group comparisons using EdgeR
package. Further, I am also interested in performing unsupervised Hierarchical clustering heatmap
(maybe using ComplexHeatmap
or coolmap
). Here, unsupervised refers to employing a list of genes that is not identified through group comparisons (i.e. not informed by grouping labels). Typically, such lists would be filtered based on detection thresholds and variance across samples (e.g. detected in at least 10% of samples, and top 1000 most variable genes) or something along those lines. As a next step a, I want to generate a heatmap.
I am not sure if rowVars
on the log2cpm data would get do this. Please provide suggestions if there is a built in method in EdgeR
or a efficient way to identify identify top most variable genes.
Thank you,
Mohammed
Thank you very much Gordon Smyth Yes, I too agree. I just want to include this as an additional step.
Maybe this should work for the following cases?
Case 1: (Getting the top 1000 highly variable genes) - These are the genes most variable across all samples regardless of which samples they are
Case 2: Detected in at least 10% of samples
Case 3: Filtered based on detection thresholds and variance across samples (e.g. detected in at least 10% of samples, and top 1000 most variable genes)
Do you know how I could write the code to include 1000 genes detected in at least 10% of samples here?
The filtering based on detection thresholds is already done as part of the edgeR DE analysis, for example by
filterByExpr
. You simply runcpm
on the filtered DGEList object to get the log2CPM values. There is no need to filter again.Gordon Smyth, yes, I used
filterByExpr
>cpm
. I understand that now I have to just run the below as I want to select high variable genes, am I right?Your code doesn't look right to me. I think you're using order() where sort() would be more appropriate.
To order your cpm matrix by row variances: