DESeq2 - groups with biological replicates from the same patient
1
0
Entering edit mode
S ▴ 10
@399a8e69
Last seen 7 months ago
Spain

Hello, I have the following dataset.

Sample  Patient  Condition
Sample1 Patient1 Condition1
Sample2 Patient1 Condition1
Sample3 Patient2 Condition1
Sample4 Patient2 Condition1
Sample5 Patient3 Condition2
Sample6 Patient3 Condition2
Sample7 Patient4 Condition2
Sample8 Patient4 Condition2

Samples from the same patient are biological replicates since they were taken from the same culture in different days and processed separately. Should I add Patient in the design formula ~Patient + Condition or would it be fine if I leave ~Condition?

Thanks in advance.

Best regards,

S.

DESeq2 • 2.0k views
ADD COMMENT
0
Entering edit mode

I would be pretty worried about leaving solely as ~condition. You have clustered data here. Failing to account for clustering can often lead to high type i error. Eg as described here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6634702/

ADD REPLY
0
Entering edit mode

Hi Oscar, there is no mention to clustering.

ADD REPLY
0
Entering edit mode

Hi, thanks for quick response. I'm not a statistician so forgive me if I am using the wrong term. By clustered data I meant that the data from condition 1 is likely to fall into two distinct clusters (one from patient 1 and one from patient 2). The paper linked describes the issues with this (though admittedly in a non-RNA-seq context)

ADD REPLY
0
Entering edit mode

Precisely why I said:

please check the PCA bi-plots to assess sample grouping.

ADD REPLY
0
Entering edit mode

Yes I see that. So, to expand, if strong grouping IS seen, then the ~condition approach is not appropriate, as will likely lead to poorly controlled type I error.

Given that such grouping/clustering is very likely in such a scenario, what would you suggest in this situation?

ADD REPLY
1
Entering edit mode
Kevin Blighe ★ 3.9k
@kevin
Last seen 1 day ago
Republic of Ireland

I see no major issue using just ~ condition. Would we ever expect a situation whereby, e.g., Patient1 had both Condition1 and Condition2 (?) - rhetorical question.

If you proceed with just ~ condition, please check the PCA bi-plots to assess sample grouping.

they were taken from the same culture in different days

Keep in mind, therefore, that time may have an effect.

Kevin

ADD COMMENT
1
Entering edit mode

Hello. Thanks very much for your reply. The same patient won't be in both groups (the experiment does not have paired samples). I just have biological replicates from the same patient within the same group. These biological replicates should be almost identical but they were collected in different days and as you say time might have an effect. Thanks

ADD REPLY
0
Entering edit mode

I find this response rather surprising. Sample1 and Sample2 (for example) are not independent because they are derived from the same patient. By solely using ~condition are you not telling DESeq2 that Sample1 and Sample2 are independent replicates from Condition1? This will surely greatly increase the chance of false positive detection of differentially expressed genes?

I did some simulations to test this under conditions with 0 differentially expressed genes. As predicted, in my tests taking two samples from the same 'patient' results in huge numbers of false positive differentially expressed genes, whereas taking only a single sample from each (or combining counts) leads to almost zero false positives.

ADD REPLY
0
Entering edit mode

Hi Oscar, one cannot have the formula ~Patient + Condition here. It makes no sense. Note my comment: "please check the PCA bi-plots to assess sample grouping"

ADD REPLY
0
Entering edit mode

understood, but I would be very worried that claiming we have 4 independent replicates per condition will lead to large type i error rate here?

ADD REPLY
0
Entering edit mode

I see, so regarding PCA plots, would your advice be to only proceed with using solely condition if there is no clear grouping/clustering of the patients, and instead the variance is dominated by condition?

What would your advice instead be if strong clustering/grouping is observed? Perhaps collapsing the replicates?

ADD REPLY
0
Entering edit mode

My advice would be to not use Chat GPT.

ADD REPLY
0
Entering edit mode

I'm not using Chat GPT. I've been discussing at length this issue of having multiple samples derived from the same patients with many colleagues, to try to understand the best way to process such data. I found this forum post on a google search and was surprised by your answer, so was seeking clarification. I think we are agreed that, in the case there is strong grouping/clustering, solely using ~condition could lead to very high type i error?

ADD REPLY
0
Entering edit mode

If by clustering you mean there might be a within-subject correlation, then yes it's a possibility. But you cannot control for that using a fixed effect. You would need to use limma-voom, blocking on subject, to estimate the within-subject correlation and then fit a generalized least squares model.

ADD REPLY
0
Entering edit mode

Thanks James for your helpful response. This particular experimental set up is very common in iPSC studies, where multiple differentiations are performed per cell line, leading to high potential for within-subject (within cell-line) correlation. The 'convention' appears to be to simply ignore the fact that there are multiple replicates from the same subject/cell-line and treat them all as independent replicates, which will surely lead to increase type I error in many cases.

I guess we should switch to limma-voom for RNA seq with this kind of experimental design then? Are you aware of any specific guides for performing the blocking and then fitting a generalised least squares model? Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6