I would like to run gene set variance analysis (GSVA) on RNA sequencing data using theĀ GSVA package, but I am confused as to what kind of input I am supposed to use for these analyses. In the vignette, an eset (pickrellCountsArgonneCQNcommon_eset) is used as input for the gsva() function:
> head(exprs(pickrellCountsArgonneCQNcommon_eset)[, 1:6])
NA19099 NA18523 NA19144 NA19137 NA18861 NA19116
8567 326 209 318 343 331 340
23139 255 169 245 361 274 239
7580 72 69 124 76 104 146
55619 487 590 678 540 502 989
3008 7 9 24 4 14 24
5162 511 272 269 450 488 475
The format of the data containing only integers suggests that the input should not be normalised or log-transformed in any kind of way. Also, the vignette describes no prior normalization of these data. On the other hand, several publications I have read mention that they use normalized count data for GSVA. What is the appropriate type of input for using the gsva() function on RNA sequencing data? If this is raw read counts, then what is the exact pre-processing performed 'under the hood'? If this is normalized data, is a simple cpm enough, or should I use rpkm / fpkm / tpm, which do take gene length into account?
