I have been using DESeq2 package for RNA-sq data analysis and really like the VST data in log2 units. But was unsure about usage of VST data for certain analyses.
Specifically, can the VST data be used to calculate a gene signature score (average across all the genes in a given signature) with the aim of comparing signature scores with or without a given condition? I have generated batch-effect corrected VST data using DESeq2 and LIMMA.
I understand VST data doesn’t take into account gene length whereas TPM does and may not be used to compare expression across genes.
Thanks for any feedback !
Thanks for your reply !
The concern with gene length was for following reason:
Does calculating a gene signature using expression values that hadn’t incorporated gene length into a normalization procedure may potentially influence the signature scores towards the influence of longer genes that would have had more counts (akin to comparing expression between different genes within a sample) ?
Your precision in RNA-seq is inherently influenced by the count. The count is proportional to the length of the transcript and the expression level. But you can't undo this precision difference by dividing out the length. The best you can do is stabilize the variance, which ensures that the wrong transformation does not make the imprecise features overly contribute to the distance metric.
Thanks for your insight !