Question

DESeq2 result really different when only use some sample vs all sample

0

Entering edit mode

bharata1803 ▴ 60

@bharata1803-7698

Last seen 5.1 years ago

Japan

Hello,

So, I have read count matrix from TCGA LIHC and its metadata. The project consist of around 400 samples (around 50 normal).

I have run DESeq2 to calculate logfoldchange and log expression. I have the result.

Then, I decided to do some clustering to get smaller size of the samples. From this clustering, I choose around 150 samples (31 normal).

Then, I run this subset data with DESeq2.

After I got both of the result, I tried to compare the logfoldchange if I use all sample and if I use only subest of it. I create scatter plot for logfoldchange subset vs all samples. Surprisingly, the result is not really correlated. I attached the plot.

Then, from the log expression that I got from DESeq2 for both subset and all samples, I tried to calculate the mean for normal and tumor. The correlation for normal category and tumor category if I use all sample vs subset is really high. I attached the plot.

My question is, if the rowmeans for both category from all data vs subset has a really high similarity, why is the logfoldchange calculated from it totally different?

From manually checking the logfoldchange, I found several genes that are not differentially express (around 0 in the logfoldchange) if I use all data samples, but really highly downregulated (around -3 in the logfoldchange) if I use subset data samples.

This is the link for the data I use: https://www.dropbox.com/s/l4eszfvatr8627r/question.tar.gz?dl=0

If anyone knows why this happen, please help me.

UPDATE: my code and session info are posted here:

deseq2 rnaseq • 2.5k views

ADD COMMENT • link updated 5 months ago by blowfish • 0 • written 6.5 years ago by bharata1803 ▴ 60

score 0 · Answer 1 · 2017-10-30

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 19 hours ago

United States

Can you post the code you used and the output of sessionInfo()?

ADD COMMENT • link 6.5 years ago Michael Love 41k

0

Entering edit mode

Hello, I've updated the post with the code that I used and my session info

ADD REPLY • link 6.5 years ago bharata1803 ▴ 60

0

Entering edit mode

There’s a lot going on inside the function above. Are you sure that you are performing the same contrast when you call results() both times? Can you print out the two results objects? At the top of the DESeqResults table it will print out the two levels being compared.

ADD REPLY • link 6.5 years ago Michael Love 41k

0

Entering edit mode

I don't understand which object that you mean. In that code, it is really simple I think.

Line 5 and 6 is reading from inputted file and line 7 actually setting the category. In the TCGA data, there are 3 types of sample type, Solid Tumor Tissue, Solid Tissue Normal, and Recurrent Tumor. In that line, I just change the name Solid Tissue Normal to "normal" and the remaining to "tumor". So, there are only 2 types of category, normal and tumor.

Line 10-13 just remove rows (representing gene) that has rowMeans<=10, so the remaining read count matrix consists of genes that has on average >10 read count. The remaining is DESeq2 workflow that I copy directly from DESeq2 tutorial.

ADD REPLY • link 6.5 years ago bharata1803 ▴ 60

0

Entering edit mode

Why are you subsetting the samples? And how exactly do you do this?

ADD REPLY • link 6.5 years ago Michael Love 41k

0

Entering edit mode

The reason I subset my sample is the DESeq2 log fold change is really weird. There are not many genes that are differentially expressed and only around ~600 genes with p-value adjusted less than 0.05 (~500 genes have logfoldchange >0.65 or <-0.65, |0.65| is my cut for up/down regulated). So, I noticed that the calculation result produced many not significant result. That's why I was thinking maybe in the normal and tumor samples, there are many data variation that makes the calculation can not produce significant result.

It was proven after I use hierarchical clustering. The normal sample can be divided into 2 subgroup and the tumor samples can be divided into several groups with quite big height difference. From these clustering result, I choose group that has the most member for both normal and tumor category. That's why I've chosen only 30 from 50 normal sample and 100 from more than 300 tumor sample.

I subset the samples using hierarchical clustering. I calculated the distance matrix using euclidean distance. I use DESeq2 assay result for the input to calculate distance matrix.

ADD REPLY • link 6.5 years ago bharata1803 ▴ 60

0

Entering edit mode

It is certainly possible to get totally different LFC after subsetting both condition groups using a data driven procedure. I can easily imagine how this could happen. So while I'd be careful to make sure you've tracked the metadata and counts across the subsampling, the loss of correlation from full data to subset doesn't indicate a problem with software, but something about the heterogeneity of your data.

This is also a procedure which can inflate the false positive rate. By peeking at the data, splitting by the observed distances, and then performing testing of clusters in each group against each other, you've necessarily reduced the within-group variance (because you've selected clusters of data within each group).

ADD REPLY • link 6.5 years ago Michael Love 41k

0

Entering edit mode

Additional info, I have processed both all sample and subset sample read count matrix using Limma/Voom and the result is more weird. If I compare log fold change result between DESeq2 and Limma/Voom for subset sample, the logfoldchange data is really similar. If I calculate Pearson correlation of logfoldchange from DEseq2 and Limma/Voom, the correlation is 0.83.

But, if I do the same for the result using all sample, I got almost 0 correlation between Limma/voom result and DESeq2 result.

ADD REPLY • link 6.5 years ago bharata1803 ▴ 60

score 0 · Answer 2 · 2023-11-07

Hi, following your post today.

From a biological view this behavior of the foldchanges seems understandable in my eyes. When you have 400 samples from your TCGA LIHC dataset containing normal samples and different disease entities like adenoma and adenocarcinoma, adding different tissue subtypes like bile ducts and pure liver tissue. Isn´t it imaginable that when you cut out 250 samples, that you get a different behavior of the genes over all this choosen samples compared to the full dataset?

Why do we assume, that the wohle dataset behaves the same way as the subset?