edgeR: multi-subject time series analysis
1
0
Entering edit mode
@bharathananth-10049
Last seen 3.0 years ago

Hi

I have a time series RNA-seq data from multiple subjects -- 14 time points each for 11 subjects. I am interested in finding inter-individual differences in expression and genes with similar expression across subjects. I looked at the example 3.5 in the EdgeR User Manual. I fit using the QL framework two models:

model1 <- model.matrix( ~ pattern + batch_effects)

model2 <- model.matrix(~pattern:subjects + batch_effects)

Weirdly, for the same fdr cutoff, I get more genes in model1 compared to model2. I find this rather odd. I would have expected that there are more possibilities for fitting individual patterns for each subject than one common profile across subjects. Has this something to do with estimating dispersion in a model dependent manner?

Is it better for achieve my goal, to use model2 with an "analysis of deviance" approach and appropriate constraints to check pattern:subjects equal across subjects ?

Or should I use a multi-level solution more like edgeR vs voom

Your input would be very much appreciates.

Thanks

BA

timecourse multiple groups edgeR • 753 views
0
Entering edit mode

What is pattern? What is subject? What is batch_effect? What contrasts are you performing, and what code are you using to do them? Give us a little example like this:

pattern <- rep(1:14, 11) # time points
subjects <- rep(1:11, each=14) # subjects
batch_effects <- ??? # not described in your post

0
Entering edit mode
t <- seq(0,42, by=3)

c <- cos(2*pi*t/T)  # a time course basis function

s <- sin(2*pi*t/T)

batch_effect <- (e.g., subjects (1,4,10), (2,5,11), (3,6,9), (7,8) were sequenced together ) # sequencing libraries

subjects <- rep(1:11, each = 14)

Initially I only, used

design1 <- model.matrix( ~ c + s  + batch_effects)

design2 <- model.matrix(~ c:subjects + s:subjects + batch_effects)
# After appropriate estimation of dispersion for each model

fit_1 <- glmQLFit(y, design1, robust = TRUE)
qlf_1 <- glmQLFTest(fit_1, coef = c(2,3))
fit_1 <- glmQLFit(y, design2, robust = TRUE)
qlf_2 <- glmQLFTest(fit_2, coef = 2:23)
fdr.cutoff <- 0.05

The number of significant genes under the two models are

N1 <- sum(p.adjust(qlf_1$table$PValue, method = "BH") < fdr.cutoff)
N2 <- sum(p.adjust(qlf_2$table$PValue, method = "BH") < fdr.cutoff)

I get N1 > N2, although I would have expected N2 > N1. Or maybe, I constructing these tests incorrectly?

I would like to know inter-subject variability of these oscillating genes and also which genes are oscillating in common across subjects.

0
Entering edit mode

Your example doesn't work. s and c are not the same length as subjects, and batch_effects is not defined. I'd like something that actually works so I can see the output of model.matrix.

0
Entering edit mode

Sorry, I didnt realize you want a piece of code you could just run. Here goes ....

T <- 24
t <- rep(seq(3,42, by=3), 11)
c <- cos(2*pi*t/T)  # a time course basis function
s <- sin(2*pi*t/T)
batch_effect <- factor(ceiling(runif(14*11)*11))

subjects <- sprintf("P%02d", rep(1:11, each = 14))
design1 <- model.matrix( ~ c + s  + batch_effects)

design2 <- model.matrix(~ c:subjects + s:subjects + batch_effects)
0
Entering edit mode

This is still not quite correct, batch_effects should be batch_effect.

0
Entering edit mode
Aaron Lun ★ 26k
@alun
Last seen 9 hours ago
The city by the bay

set.seed(0)
T <- 24
t <- rep(seq(3,42, by=3), 11)
c <- cos(2*pi*t/T)  # a time course basis function
s <- sin(2*pi*t/T)
batch_effects <- factor(ceiling(runif(14*11)*11))
subjects <- sprintf("P%02d", rep(1:11, each = 14))
design2 <- model.matrix(~ c:subjects + s:subjects + batch_effects)


... gives me the following for colnames(design2):

 [1] "(Intercept)"     "batch_effects2"  "batch_effects3"  "batch_effects4"
[5] "batch_effects5"  "batch_effects6"  "batch_effects7"  "batch_effects8"
[9] "batch_effects9"  "batch_effects10" "batch_effects11" "c:subjectsP01"
[13] "c:subjectsP02"   "c:subjectsP03"   "c:subjectsP04"   "c:subjectsP05"
[17] "c:subjectsP06"   "c:subjectsP07"   "c:subjectsP08"   "c:subjectsP09"
[21] "c:subjectsP10"   "c:subjectsP11"   "subjectsP01:s"   "subjectsP02:s"
[25] "subjectsP03:s"   "subjectsP04:s"   "subjectsP05:s"   "subjectsP06:s"
[29] "subjectsP07:s"   "subjectsP08:s"   "subjectsP09:s"   "subjectsP10:s"
[33] "subjectsP11:s"


... so setting coef=2:23 will be mostly testing for the batch effect, rather than the effect of time.

I will assume that you actually intended to drop all of the subject-time interaction terms in design2. In this model, you are allowing each subject to have its own behaviour with respect to time, as represented by the interaction terms for each subject. When you drop the interaction terms, your null hypothesis is that there is no time effect in any of your subjects.

In contrast, with design1, you are assuming that all subjects have the same behaviour with respect to time. This means that when you set coef=2:3, you are testing whether the average effect of time across all patients is zero. This is clearly a different null hypothesis than that for design2, so it's not surprising that you get different numbers of DE genes.

There is no guarantee that the (corrected) design2 contrast will yield more DE genes than your contrast with design1. Yes, there is greater scope for variability in the true time effect between subjects - but there is also greater scope for random variability, which needs to be properly handled by the test in order to avoid detecting spurious differences between subjects. You also lose many residual d.f. due to the complexity of the model, reducing the precision of the dispersion estimates and potentially reducing power.

It seems like design2 is what you actually want to answer your scientific question. This model actually allows for differences in the time effect between subjects, which is what you're interested in - whereas design1 does not.

0
Entering edit mode

A further question. In design1, are we testing that the average time behavior is the same or that time behavior is the same for all subjects? I guess that is a subtle difference. The former would permit genes that have the same behavior in say 6 out of 11 subjects.

0
Entering edit mode

With design1 and coef=2:3, you are testing whether the average time effect across all subjects is zero. The concept of subject-specific behaviour does not even exist in this parametrisation. All you can say, for the genes that reject the null, is that there is a non-zero average effect in those genes. You can't make any statements about a time effect being present in a majority of subjects, because it's entirely possible to get an average effect that's driven by a subset of subjects.