I would like to run gene set variance analysis (GSVA) on RNA sequencing data using theĀ GSVA
package, but I am confused as to what kind of input I am supposed to use for these analyses. In the vignette, an eset (pickrellCountsArgonneCQNcommon_eset) is used as input for the gsva()
function:
> head(exprs(pickrellCountsArgonneCQNcommon_eset)[, 1:6]) NA19099 NA18523 NA19144 NA19137 NA18861 NA19116 8567 326 209 318 343 331 340 23139 255 169 245 361 274 239 7580 72 69 124 76 104 146 55619 487 590 678 540 502 989 3008 7 9 24 4 14 24 5162 511 272 269 450 488 475
The format of the data containing only integers suggests that the input should not be normalised or log-transformed in any kind of way. Also, the vignette describes no prior normalization of these data. On the other hand, several publications I have read mention that they use normalized count data for GSVA. What is the appropriate type of input for using the gsva()
function on RNA sequencing data? If this is raw read counts, then what is the exact pre-processing performed 'under the hood'? If this is normalized data, is a simple cpm enough, or should I use rpkm / fpkm / tpm, which do take gene length into account?