So, I have read count matrix from TCGA LIHC and its metadata. The project consist of around 400 samples (around 50 normal).
I have run DESeq2 to calculate logfoldchange and log expression. I have the result.
Then, I decided to do some clustering to get smaller size of the samples. From this clustering, I choose around 150 samples (31 normal).
Then, I run this subset data with DESeq2.
After I got both of the result, I tried to compare the logfoldchange if I use all sample and if I use only subest of it. I create scatter plot for logfoldchange subset vs all samples. Surprisingly, the result is not really correlated. I attached the plot.
Then, from the log expression that I got from DESeq2 for both subset and all samples, I tried to calculate the mean for normal and tumor. The correlation for normal category and tumor category if I use all sample vs subset is really high. I attached the plot.
My question is, if the rowmeans for both category from all data vs subset has a really high similarity, why is the logfoldchange calculated from it totally different?
From manually checking the logfoldchange, I found several genes that are not differentially express (around 0 in the logfoldchange) if I use all data samples, but really highly downregulated (around -3 in the logfoldchange) if I use subset data samples.
This is the link for the data I use: https://www.dropbox.com/s/l4eszfvatr8627r/question.tar.gz?dl=0
If anyone knows why this happen, please help me.
UPDATE: my code and session info are posted here: