Question

GSVA perfomed on RSEM Expected Count data

0

Entering edit mode

Yang Shi ▴ 10

@ea61ff7a

Last seen 13 months ago

Zheng Zhou

Dear Bio Communities,

Now I'm try using RSEM expected count data to perform GSVA analysis. But this kind of count data doesn't the integer count, which is usually calculated by HT-Seq et al. As the author of GSVA suggested (Why Negative value from GSVA), the row count data of RNA-Seq calculated by HT-Seq or STAR should be normalized by logCPM, which is then fed into GSVA function. Should I do the same procedures for the expected count data? Any suggestions would be greatly appreciation!

dge <- DGEList(expected_count)
dge.norm <- calcNormFactors(dge)
normLogCPMs <- cpm(dge.norm, log=TRUE, normalized.lib.sizes=TRUE)
gsva_es <- gsva(normLogCPMs, gset.idx.list=hallmark)

GSVA RNA-Seq RSEM Expected_Count • 3.2k views

ADD COMMENT • link 3.5 years ago Yang Shi ▴ 10

score 3 · Accepted Answer · 2022-07-26

3

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 37 minutes ago

WEHI, Melbourne, Australia

All edgeR functions including calcNormFactors and cpm work fine with non-integer genewise expected counts from RSEM. Hence any downstream analysis will also be fine.

ADD COMMENT • link 3.5 years ago Gordon Smyth 53k

0

Entering edit mode

Thanks for your reply sir!

ADD REPLY • link 3.5 years ago Yang Shi ▴ 10

0

Entering edit mode

Thanks Gordon! Indeed, GSVA should be then also fine.

ADD REPLY • link 3.5 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Which criteria for filtering the genes with very low count, edgeR::filterByExpr or just rowMeans(expected_count) > 1 [not rowMeans(CPM) > 1]. And IMO, only mRNA should be used to performed GSVA analysis instead of lncRNA or any other gene type, is that correct? Thanks very much sir!

ADD REPLY • link 3.5 years ago Yang Shi ▴ 10

1

Entering edit mode

The help page of filterByExpr() explains how to use that function to filter out lowly-expressed genes and particularly which of its parameters allow you to tune it towards a more stringent or lenient filtering strategy. The help page also points you out to the article by Chen et al. (2016), whose section entitled "Filtering to remove low counts" should allow you to fully understand how filterByExpr() operates under the hood and what the parameter values mean. I personally like to look at the distribution of logCPM units of expression averaged per gene, i.e.:

dge <- DGEList(expected_count)
dge.norm <- calcNormFactors(dge)
normLogCPMs <- cpm(dge.norm, log=TRUE, normalized.lib.sizes=TRUE)
hist(rowMeans(normLogCPMs))

which should show you two modes, one on the left corresponding to lowly-expressed genes and one on the right corresponding to the expressed ones. In general, you want to keep the genes that produce the mode on the right. Looking at the plot you can decide a cutoff and build a logical mask on those average expression values:

keep <- rowMeans(normLogCPMs) > cutoff
dge.norm.filt <- dge.norm[keep, ]

or use filterByExpr() to fine tune the filtering strategy. Regarding what genes/transcripts (rows in your expression data matrix) should you use with GSVA, this is only relevant with respect to what gene sets are you going to use. If your gene sets do not have any lncRNA annotated to them, then they will be discarded from the GSVA calculations. What gene sets do you give as input is as important as what expression values you're giving. Its important that your gene sets relevant to the biology of what you are studying and that the most relevant ones include a minimum number of reliably expressed genes.

ADD REPLY • link 3.5 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thanks so much sir! It's a great help to me!

ADD REPLY • link 3.5 years ago Yang Shi ▴ 10

0

Entering edit mode

I also wonder whether some kinds of gene type, e.g. lncRNA, are okay even the geneset doesn't have these kinds of type. IMO, the unrelevant gene types are not affect the results of GSVA. Is that correct sir?

ADD REPLY • link 3.5 years ago Yang Shi ▴ 10

1

Entering edit mode

The overall number of rows in your gene expression data matrix, i.e., the overall number of genes/transcripts, has an effect for methods such as GSVA or ssGSEA since they are what we may call competitive methods, because they calculate their score based on differences in expression ranks between genes inside and outside the gene set (as opposed to self contained methods that only use information of genes inside the gene set, such is in the z-score method). A few genes more or less outside gene sets will have a negligible impact, specially for small gene sets, but if you want assess the impact of lncRNAs in your calculations just do them twice, once with them included and other time excluding them, and check whether the results change. In any case, the filtering for lowly-expressed genes is the best way to have an educated guess at what genes should be kept in the analysis.