Hello,
So, I have read count matrix from TCGA LIHC and its metadata. The project consist of around 400 samples (around 50 normal).
I have run DESeq2 to calculate logfoldchange and log expression. I have the result.
Then, I decided to do some clustering to get smaller size of the samples. From this clustering, I choose around 150 samples (31 normal).
Then, I run this subset data with DESeq2.
After I got both of the result, I tried to compare the logfoldchange if I use all sample and if I use only subest of it. I create scatter plot for logfoldchange subset vs all samples. Surprisingly, the result is not really correlated. I attached the plot.
Then, from the log expression that I got from DESeq2 for both subset and all samples, I tried to calculate the mean for normal and tumor. The correlation for normal category and tumor category if I use all sample vs subset is really high. I attached the plot.
My question is, if the rowmeans for both category from all data vs subset has a really high similarity, why is the logfoldchange calculated from it totally different?
From manually checking the logfoldchange, I found several genes that are not differentially express (around 0 in the logfoldchange) if I use all data samples, but really highly downregulated (around -3 in the logfoldchange) if I use subset data samples.
This is the link for the data I use: https://www.dropbox.com/s/l4eszfvatr8627r/question.tar.gz?dl=0
If anyone knows why this happen, please help me.
UPDATE: my code and session info are posted here:
Hello, I've updated the post with the code that I used and my session info
There’s a lot going on inside the function above. Are you sure that you are performing the same contrast when you call results() both times? Can you print out the two results objects? At the top of the DESeqResults table it will print out the two levels being compared.
I don't understand which object that you mean. In that code, it is really simple I think.
Line 5 and 6 is reading from inputted file and line 7 actually setting the category. In the TCGA data, there are 3 types of sample type, Solid Tumor Tissue, Solid Tissue Normal, and Recurrent Tumor. In that line, I just change the name Solid Tissue Normal to "normal" and the remaining to "tumor". So, there are only 2 types of category, normal and tumor.
Line 10-13 just remove rows (representing gene) that has rowMeans<=10, so the remaining read count matrix consists of genes that has on average >10 read count. The remaining is DESeq2 workflow that I copy directly from DESeq2 tutorial.
Why are you subsetting the samples? And how exactly do you do this?
The reason I subset my sample is the DESeq2 log fold change is really weird. There are not many genes that are differentially expressed and only around ~600 genes with p-value adjusted less than 0.05 (~500 genes have logfoldchange >0.65 or <-0.65, |0.65| is my cut for up/down regulated). So, I noticed that the calculation result produced many not significant result. That's why I was thinking maybe in the normal and tumor samples, there are many data variation that makes the calculation can not produce significant result.
It was proven after I use hierarchical clustering. The normal sample can be divided into 2 subgroup and the tumor samples can be divided into several groups with quite big height difference. From these clustering result, I choose group that has the most member for both normal and tumor category. That's why I've chosen only 30 from 50 normal sample and 100 from more than 300 tumor sample.
I subset the samples using hierarchical clustering. I calculated the distance matrix using euclidean distance. I use DESeq2 assay result for the input to calculate distance matrix.
It is certainly possible to get totally different LFC after subsetting both condition groups using a data driven procedure. I can easily imagine how this could happen. So while I'd be careful to make sure you've tracked the metadata and counts across the subsampling, the loss of correlation from full data to subset doesn't indicate a problem with software, but something about the heterogeneity of your data.
This is also a procedure which can inflate the false positive rate. By peeking at the data, splitting by the observed distances, and then performing testing of clusters in each group against each other, you've necessarily reduced the within-group variance (because you've selected clusters of data within each group).
Additional info, I have processed both all sample and subset sample read count matrix using Limma/Voom and the result is more weird. If I compare log fold change result between DESeq2 and Limma/Voom for subset sample, the logfoldchange data is really similar. If I calculate Pearson correlation of logfoldchange from DEseq2 and Limma/Voom, the correlation is 0.83.
But, if I do the same for the result using all sample, I got almost 0 correlation between Limma/voom result and DESeq2 result.