Question

edgeR: multi-subject time series analysis

0

Entering edit mode

BharathAnanth ▴ 80

@bharathananth-10049

Last seen 6.1 years ago

Hi

I have a time series RNA-seq data from multiple subjects -- 14 time points each for 11 subjects. I am interested in finding inter-individual differences in expression and genes with similar expression across subjects. I looked at the example 3.5 in the EdgeR User Manual. I fit using the QL framework two models:

model1 <- model.matrix( ~ pattern + batch_effects)

model2 <- model.matrix(~pattern:subjects + batch_effects)

Weirdly, for the same fdr cutoff, I get more genes in model1 compared to model2. I find this rather odd. I would have expected that there are more possibilities for fitting individual patterns for each subject than one common profile across subjects. Has this something to do with estimating dispersion in a model dependent manner?

Is it better for achieve my goal, to use model2 with an "analysis of deviance" approach and appropriate constraints to check pattern:subjects equal across subjects ?

Or should I use a multi-level solution more like edgeR vs voom

Your input would be very much appreciates.

Thanks

BA

timecourse multiple groups edgeR • 1.7k views

ADD COMMENT • link updated 6.7 years ago by Aaron Lun ★ 28k • written 6.7 years ago by BharathAnanth ▴ 80

0

Entering edit mode

What is pattern? What is subject? What is batch_effect? What contrasts are you performing, and what code are you using to do them? Give us a little example like this:

pattern <- rep(1:14, 11) # time points
subjects <- rep(1:11, each=14) # subjects
batch_effects <- ??? # not described in your post

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

t <- seq(0,42, by=3)

c <- cos(2*pi*t/T)  # a time course basis function

s <- sin(2*pi*t/T)

batch_effect <- (e.g., subjects (1,4,10), (2,5,11), (3,6,9), (7,8) were sequenced together ) # sequencing libraries

subjects <- rep(1:11, each = 14)

Initially I only, used

design1 <- model.matrix( ~ c + s  + batch_effects)

design2 <- model.matrix(~ c:subjects + s:subjects + batch_effects)

# After appropriate estimation of dispersion for each model

fit_1 <- glmQLFit(y, design1, robust = TRUE)
qlf_1 <- glmQLFTest(fit_1, coef = c(2,3))

fit_1 <- glmQLFit(y, design2, robust = TRUE)
qlf_2 <- glmQLFTest(fit_2, coef = 2:23)

fdr.cutoff <- 0.05

The number of significant genes under the two models are

N1 <- sum(p.adjust(qlf_1$table$PValue, method = "BH") < fdr.cutoff)
N2 <- sum(p.adjust(qlf_2$table$PValue, method = "BH") < fdr.cutoff)

I get N1 > N2, although I would have expected N2 > N1. Or maybe, I constructing these tests incorrectly?

I would like to know inter-subject variability of these oscillating genes and also which genes are oscillating in common across subjects.

ADD REPLY • link 6.7 years ago BharathAnanth ▴ 80

0

Entering edit mode

Your example doesn't work. s and c are not the same length as subjects, and batch_effects is not defined. I'd like something that actually works so I can see the output of model.matrix.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Sorry, I didnt realize you want a piece of code you could just run. Here goes ....

T <- 24

t <- rep(seq(3,42, by=3), 11)

c <- cos(2*pi*t/T)  # a time course basis function

s <- sin(2*pi*t/T)

batch_effect <- factor(ceiling(runif(14*11)*11))

subjects <- sprintf("P%02d", rep(1:11, each = 14))

design1 <- model.matrix( ~ c + s  + batch_effects)

design2 <- model.matrix(~ c:subjects + s:subjects + batch_effects)

ADD REPLY • link 6.7 years ago BharathAnanth ▴ 80

0

Entering edit mode

This is still not quite correct, batch_effects should be batch_effect.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

score 0 · Answer 1 · 2017-11-30

Taking your last comment, correcting your error and running your code:

set.seed(0)
T <- 24
t <- rep(seq(3,42, by=3), 11)
c <- cos(2*pi*t/T)  # a time course basis function
s <- sin(2*pi*t/T)
batch_effects <- factor(ceiling(runif(14*11)*11))
subjects <- sprintf("P%02d", rep(1:11, each = 14))
design2 <- model.matrix(~ c:subjects + s:subjects + batch_effects)

... gives me the following for colnames(design2):

 [1] "(Intercept)"     "batch_effects2"  "batch_effects3"  "batch_effects4" 
 [5] "batch_effects5"  "batch_effects6"  "batch_effects7"  "batch_effects8" 
 [9] "batch_effects9"  "batch_effects10" "batch_effects11" "c:subjectsP01"  
[13] "c:subjectsP02"   "c:subjectsP03"   "c:subjectsP04"   "c:subjectsP05"  
[17] "c:subjectsP06"   "c:subjectsP07"   "c:subjectsP08"   "c:subjectsP09"  
[21] "c:subjectsP10"   "c:subjectsP11"   "subjectsP01:s"   "subjectsP02:s"  
[25] "subjectsP03:s"   "subjectsP04:s"   "subjectsP05:s"   "subjectsP06:s"  
[29] "subjectsP07:s"   "subjectsP08:s"   "subjectsP09:s"   "subjectsP10:s"  
[33] "subjectsP11:s"

... so setting coef=2:23 will be mostly testing for the batch effect, rather than the effect of time.

I will assume that you actually intended to drop all of the subject-time interaction terms in design2. In this model, you are allowing each subject to have its own behaviour with respect to time, as represented by the interaction terms for each subject. When you drop the interaction terms, your null hypothesis is that there is no time effect in any of your subjects.

In contrast, with design1, you are assuming that all subjects have the same behaviour with respect to time. This means that when you set coef=2:3, you are testing whether the average effect of time across all patients is zero. This is clearly a different null hypothesis than that for design2, so it's not surprising that you get different numbers of DE genes.

There is no guarantee that the (corrected) design2 contrast will yield more DE genes than your contrast with design1. Yes, there is greater scope for variability in the true time effect between subjects - but there is also greater scope for random variability, which needs to be properly handled by the test in order to avoid detecting spurious differences between subjects. You also lose many residual d.f. due to the complexity of the model, reducing the precision of the dispersion estimates and potentially reducing power.

It seems like design2 is what you actually want to answer your scientific question. This model actually allows for differences in the time effect between subjects, which is what you're interested in - whereas design1 does not.