Hey guys. Good afternoon.
I would like to separate the TCGA-STAD data into two groups, one with high expression for gene x, and one with low expression for gene x. I would like to separate these groups according to quartiles, taking the upper and lower quartiles. Then, I would perform the DESeq2 analysis between these two groups, since I hypothesize that they have different biological characteristics.
However, I am in doubt on which data I should perform this separation into quartiles, whether it is in the raw data, in the TPM normalized data, or in the data normalized by DESeq2.
I thought of it this way. As DESeq2 only accepts a group design as input, I would create a "fake" variable, randomly placing number 1 or 2, to use as a group variable just to get the normalized data, since the design does not affect normalization. After obtaining the normalized data, I would separate the quartiles and the patients that are part of each quartile. It would then use that information to filter the raw data and use it as a design to run DESeq2. However, I feel this feels wrong.
Could anyone give me some suggestion? I couldn't find a thread about it.
Thanks in advance.