Repeated Measures, Within-Subject, Between-Groups RNAseq Analysis
1
0
Entering edit mode
wjh180 • 0
@wjh180-15427
Last seen 2.6 years ago

Hello All,

I'm currently working through an analysis of 96 RNAseq samples generated from a rather complicated experimental design and need some help constructing my design matrices. Briefly, the experiment involves repeated measures of 12 individuals across two experimental conditions one week apart (No Stress and Stress). Furthermore, individuals are split into two groups, "Risk" or "Control. So factors are as follows:

Participant: 12 levels (1-12); Status: 2 levels [Risk (N=6) or Control (N=6)]; Session: 2 levels (No_Stress or Stress)

Time: 4 levels (1-4, sample times are the same in both sessions, e.g. No_Stress.1 = Stress.1)

This is summarized in the table below.

Participant Status Session Time Group
1 Control No_Stress 1 No_Stress.1
1 Control No_Stress 2 No_Stress.2
1 Control No_Stress 3 No_Stress.3
1 Control No_Stress 4 No_Stress.4
1 Control Stress 1 Stress.1
1 Control Stress 2 Stress.2
1 Control Stress 3 Stress.3
1 Control Stress 4 Stress.4
...
12 Risk No_Stress 1 No_Stress.1
12 Risk No_Stress 2 No_Stress.2
12 Risk No_Stress 3 No_Stress.3
12 Risk No_Stress 4 No_Stress.
12 Risk Stress 1 Stress.1
12 Risk Stress 2 Stress.2
12 Risk Stress 3 Stress.3
12 Risk Stress 4 Stress.4

I'm interested in the following hypotheses:

i) Which genes are responsive to time (i.e. circadian)?

ii) Which genes are responsive to stress?

iii) Which genes are deferentially responsive to stress as a function of status (Risk versus Control)?

FILTERING AND NORMALIZATION STEPS:

X <- DGEList(counts=counts, genes=rownames(counts))
keep <- rowSums(cpm(X)>=1) >=2
X_filtered <- X[keep,keep.lib.sizes=FALSE]
X_filter_norm <- calcNormFactors(X_filtered)
X_filter_norm <- estimateGLMCommonDisp(X_filter_norm, method="deviance",robust=TRUE,subset=NULL)

I have no true replicates so I used "estimateGLMCommonDisp" with the settings above as was recommended in Section 2.11 of the edgeR User Guide.

QUESTION 1: Is there any preference to using the combined factor "Group" versus nesting Time within Session.

In other words is

time_matrix1 <- model.matrix(~0+Group)

preferred over

time_matrix2 <- model.matrix(~0+Session+Session:Time)

For Questions 2 & 3 I'm assuming using Group is best, but if not I would replace it with "Session+Session:Time" accordingly below.

QUESTION 2: I don't necessarily care about differences at the participant level, but I would like to control for them. Does my inclusion of Participant as a factor in my design matrix take care of this? For example is the model below sufficient to tests hypotheses i and ii?

time_matrix3 <- model.matrix (~0+Group+Participant)
fit <- glmFit(X_filter_norm, time_matrix3)

Or do I need to nest the combined factor within Participant?

time_matrix4 <- model.matrix(~0+Participant+Participant:Group)

QUESTION 3: Is the fully nested model below appropriate for hypothesis iii?

full_matrix <- model.matrix(~Status + Status:Participant + Status:Participant:Group)

Or can I removed the Participant factor and have the matrix below instead?

full_matrix <- model.matrix(~Status + Status:Group)

ANY help or guidance would be much appreciated! I've gotten this far by piecing together different strategies I've read on other posts and in user manuals, but I still haven't found a solid answer or example of a design this complicated.

Thanks!

edgeR rnaseq repeated measures • 761 views
3
Entering edit mode
@ryan-c-thompson-5618
Last seen 12 months ago
Scripps Research, La Jolla, CA

For Question 1, the two designs are equivalent, and you should use whichever one you find more convenient.

For Question 2, I would go with your first design, ~0+Group+Participant. The second design you suggest would fit a separate group effect for every participant, which you have stated is not your goal (and furthermore this design would have as many coefficients as samples and therefore be impossible to fit).

For Question 3, you cannot include both Participant and Status as fixed effects in the same design, because the first is nested within the second. However, omitting Participant entirely is not ideal either. The recommended way to handle this design is to use limma's duplicateCorrelation to fit a mixed model with Participant as a random effect. You'll have to switch from edgeR to voom and limma to do this. (If you want to do this, I recommend you switch the entire analysis to limma so that you can compare results consistently between different models.)

Lastly, there's no reason for you to use estimateGLMCommonDisp. Just use estimateDisp. You do have replicates: you have multiple participants in each status group. Generally, "no replicates" means that your design matrix has as many columns as you have samples, so your model has zero residual degrees of freedom. That's not the case for you unless you want to use time_matrix4, which I recommended against.

0
Entering edit mode

Just to add to Ryan's answer; you can also look for Status-specific Group effects. For example:

design <- model.matrix(~ factor(Participant) + Status:Group, stuff)
design <- design[,!grepl("GroupNo_Stress.1", colnames(design))] # Get to full rank.

The first 12 terms represent patient-specific effects and can be ignored. The remaining terms represent the Status-specific log-fold change of each Group over the no-stress group at time point 1. These can be compared to each other to see if there is any interaction between Status and Group.

However, as mentioned, a direct comparison between Status is not possible while blocking on Participant. In such cases, you have to switch to limma and duplicateCorrelation(). If you want to continue to use edgeR, you need to subset your dataset so that only one sample from each participant is present, to avoid problems from correlations between participants.

Needless to say, make sure Participant is indeed a factor.