i would like to ask a more general question about data transformation methodologies, regarding raw RNA-Seq data for exploratory data analysis, and not for DE expression. In detail, i have identified a small 38 gene signature in a specific type of cancer/TCGA dataset, which shows some interesting results about survival, etc. I want next to test this signature to different types of TCGA datasets, to inspect the pattern of the expression of the genes, and also from clustering to compare the survival estimates of any resulted clusters. For my current downloaded TCGA dataset that i would like to test, i have raw HTSEQ counts for 371 cancer samples. Thus:
1) Which type of transformation or transformations should i follow ? For instance the simple log2(counts +1), which might have a negative impact on very low counts ? Or a more "robust" approach coulbe be implemented ? For example, variance stabilizing transformation from DESeq2 would be "enough ? Or i could alternatively try the cpm transformation from edgeR, although they present some distinct characteristics ?
2) If i would like to use the cpm transformation, i should apply it on the raw counts ? For example:
logCPM.counts <- cpm(dt, prior.count=2, log=TRUE)
dt must be a matrix of raw counts, or should be a DGElist object after TMM normalization in order to account also for sequencing depth ?
Thank you in advance,