How to run GSVA with partially NA genes
1
0
Entering edit mode
blakebowen • 0
@25cc0172
Last seen 6 days ago
Australia

Hello,

I've been trying to run GSVA and have come across the warning message below. I actually have no genes where all samples have constant expression values. It looks like this warning is being caused by genes with partial NA values (missing data for a subset samples), even though the actual row variance > 0. The NA values are a result of combining multiple different microarray and RNA-seq datasets. Any gene that has one or more NA value is being removed from the analysis, some of which overlap with gene sets I am testing, therefore I would like to avoid this if possible.

Is there any ways to perform GSVA with these partially NA genes included in the analysis? Or any alternative suggestions for how best to perform this analysis?

Thanks in advance for the help!

> gsva_res <- gsva(expr = rna_mat, gset.idx.list = gene_set_list)

Estimating GSVA scores for 10 gene sets.
Estimating ECDFs with Gaussian kernels
|===============================================================================================| 100%

Warning messages:
1: In .filterFeatures(expr, method) :
11204 genes with constant expression values throuhgout the samples.
2: In .filterFeatures(expr, method) :
Since argument method!="ssgsea", genes with constant expression values are discarded.

> # how many genes have more than 1 NA value
> table(rowSums(is.na(rna_mat)) > 0)

FALSE  TRUE
12066 11204

GSVA • 108 views
2
Entering edit mode
Robert Castelo ★ 2.9k
@rcastelo
Last seen 2 days ago
Barcelona/Universitat Pompeu Fabra

hi,

a warning is not an error and this particular warning says that genes with constant expression values are discarded from calculations, which in general seems like a wise thing to do. are the results you are getting still not useful for you?

you say you have NA values, which is something quite unusual in transcriptomics data, so the warning about genes with constant gene expression profiles is likely to be caused by genes that either have all their values set as NA, or have a combination of NA values and constant non-NA values. the methods implemented in GSVA do not expect to find NA values so if you really want to keep those gene expression profiles, you should impute missing values in some way similar to what people do with proteomics data. however, keep in mind that imputing large amounts of NA values can severely bias your analysis.

0
Entering edit mode

I was a little worried about discarding so many genes, although sounds like this can't be helped. The results are still quite useful so I think I will not go down the imputation route. Cheers!