Dear Bio Communities,
Now I'm try using RSEM expected count data to perform GSVA analysis. But this kind of count data doesn't the integer count, which is usually calculated by HT-Seq et al. As the author of GSVA suggested (Why Negative value from GSVA), the row count data of RNA-Seq calculated by HT-Seq or STAR should be normalized by logCPM, which is then fed into GSVA function. Should I do the same procedures for the expected count data? Any suggestions would be greatly appreciation!
dge <- DGEList(expected_count)
dge.norm <- calcNormFactors(dge)
normLogCPMs <- cpm(dge.norm, log=TRUE, normalized.lib.sizes=TRUE)
gsva_es <- gsva(normLogCPMs, gset.idx.list=hallmark)
Thanks for your reply sir!
Thanks Gordon! Indeed, GSVA should be then also fine.
Which criteria for filtering the genes with very low count, edgeR::filterByExpr or just rowMeans(expected_count) > 1 [not rowMeans(CPM) > 1]. And IMO, only mRNA should be used to performed GSVA analysis instead of lncRNA or any other gene type, is that correct? Thanks very much sir!
The help page of
filterByExpr()
explains how to use that function to filter out lowly-expressed genes and particularly which of its parameters allow you to tune it towards a more stringent or lenient filtering strategy. The help page also points you out to the article by Chen et al. (2016), whose section entitled "Filtering to remove low counts" should allow you to fully understand howfilterByExpr()
operates under the hood and what the parameter values mean. I personally like to look at the distribution of logCPM units of expression averaged per gene, i.e.:which should show you two modes, one on the left corresponding to lowly-expressed genes and one on the right corresponding to the expressed ones. In general, you want to keep the genes that produce the mode on the right. Looking at the plot you can decide a
cutoff
and build a logical mask on those average expression values:or use
filterByExpr()
to fine tune the filtering strategy. Regarding what genes/transcripts (rows in your expression data matrix) should you use with GSVA, this is only relevant with respect to what gene sets are you going to use. If your gene sets do not have any lncRNA annotated to them, then they will be discarded from the GSVA calculations. What gene sets do you give as input is as important as what expression values you're giving. Its important that your gene sets relevant to the biology of what you are studying and that the most relevant ones include a minimum number of reliably expressed genes.Thanks so much sir! It's a great help to me!
I also wonder whether some kinds of gene type, e.g. lncRNA, are okay even the geneset doesn't have these kinds of type. IMO, the unrelevant gene types are not affect the results of GSVA. Is that correct sir?
The overall number of rows in your gene expression data matrix, i.e., the overall number of genes/transcripts, has an effect for methods such as GSVA or ssGSEA since they are what we may call competitive methods, because they calculate their score based on differences in expression ranks between genes inside and outside the gene set (as opposed to self contained methods that only use information of genes inside the gene set, such is in the z-score method). A few genes more or less outside gene sets will have a negligible impact, specially for small gene sets, but if you want assess the impact of lncRNAs in your calculations just do them twice, once with them included and other time excluding them, and check whether the results change. In any case, the filtering for lowly-expressed genes is the best way to have an educated guess at what genes should be kept in the analysis.
Got that sir! Thanks for your detailed reply!