Search
Question: Pre-processing of RNA sequencing data for GSVA
1
24 months ago by
t.kuilman140
Netherlands
t.kuilman140 wrote:

I would like to run gene set variance analysis (GSVA) on RNA sequencing data using the GSVA package, but I am confused as to what kind of input I am supposed to use for these analyses. In the vignette, an eset (pickrellCountsArgonneCQNcommon_eset) is used as input for the gsva() function:

> head(exprs(pickrellCountsArgonneCQNcommon_eset)[, 1:6])
NA19099 NA18523 NA19144 NA19137 NA18861 NA19116
8567      326     209     318     343     331     340
23139     255     169     245     361     274     239
7580       72      69     124      76     104     146
55619     487     590     678     540     502     989
3008        7       9      24       4      14      24
5162      511     272     269     450     488     475

The format of the data containing only integers suggests that the input should not be normalised or log-transformed in any kind of way. Also, the vignette describes no prior normalization of these data. On the other hand, several publications I have read mention that they use normalized count data for GSVA. What is the appropriate type of input for using the gsva() function on RNA sequencing data? If this is raw read counts, then what is the exact pre-processing performed 'under the hood'? If this is normalized data, is a simple cpm enough, or should I use rpkm / fpkm / tpm, which do take gene length into account?