Deseq2: How to adjust for baseline read counts when testing for groups differences at post-treatment
2
0
Entering edit mode
Nick • 0
@nick-7728
Last seen 9.6 years ago
United States

Hello,

I am new to the use of DESeq2 and cannot seem to get my head around how to properly set up a baseline adjusted model for an experimental design.

We have conducted an experiment in which 10 subjects were randomized to either treatment or control conditions.  Given the small sample size, randomization was incomplete and minor differences remain in read counts between conditions at baseline.  I would therefore like to fit an ANCOVA “type” model to adjust for baseline read counts when testing for the effect of treatment assignment at the end of the experiment.  Something like:   y = b0 + b1(baseline count) + b2(treatment); where y= read count for gene 1 post-treatment, baseline count = read count for gene 1 pre-treatment, and treatment is an indicator variable for treatment assignment. I would like to do this for ~200 genes.  Basically, how would I set up this model to get the baseline adjusted treatment effect for each of the 200 genes?

I assume this has been described elsewhere, but I have been unable to find a detailed example (only examples for paired designs where each subject experiences both conditions). I have provided a bit of code below to give an idea of what I would like to.

 Thanks in advance for any thoughts or assistance in this matter.

 

library(DESeq2)

dds.pre <- makeExampleDESeqDataSet(betaSD=1, n=10, m=10)

dds.post <- makeExampleDESeqDataSet(betaSD=2, n=10, m=10)

 

colData(dds.pre)

colData(dds.post)

rowData(dds.pre)

rowData(dds.post)

 

#log2fold change for differences in treatment group at end post treatment

dds <- DESeq(dds.post)

res <- results(dds)

res

 

#What I would like to do (something along the lines of...but unsure how to format up the data to properly join the two matrices and get post treatment gene 1 adjusted for baseline gene 1)

dds.post.adj <- formula (~ baseline_count + condition)    

#where baseline_count allows each post treatment gene expression count to be adjusted for its baseline expression count

 

dds.2 <- DESeq(dds.post.adj)

res.2 <- results(dds.2)

res.2

deseq2 deseq • 5.0k views
ADD COMMENT
0
Entering edit mode

Can I rephrase your experiment to see if I got this right?

You essentially have 10 samples, each of which comes from a different person/patient/animal. About half of these samples were "treated" and the others were not.

Now you just want to ask what the relative difference of expression is between the treated and untreated group for each gene?

ADD REPLY
0
Entering edit mode

So, you have two RNA-Seq libraries from each subject, one from before and one from after treatment?

 

ADD REPLY
0
Entering edit mode

Just to echo these other two comments, the best way to convey would just be to give an example of the column data:

subject  condition
1  control
1  treat
2  control
2  treat
...

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 4 days ago
United States

hi Nick,

You wouldn't include the counts in the column data or in the design formula.

The test for differences across treatment, controlling for baseline is accomplished with a standard interaction design (although read below for a modification of this): ~ condition + time + condition:time, and then you would test for the conditiontreatment.timeposttreatment effect, using the 'name' argument of results(). Check the vignette on how to ensure that control and baseline are the reference levels of the factors using relevel().

To control for the differences across subject, if I am reading your description correctly, you can just do this by adding a new column to the colData:

dds$subject.nested <- factor(c(1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5))

In this column, subject 6 will be coded the same as subject 1, with a 1. You can do this by adding a new term to the design:

design(dds) <- ~ condition + time + condition:subject.nested + condition:time

This controls for subject differences within each condition. 

dds <- DESeq(dds, betaPrior=FALSE) # for designs with interaction terms
ADD COMMENT
0
Entering edit mode

Michael,

Thank you for the detailed answer to this query.  This is perfectly answers my question.

ADD REPLY
0
Entering edit mode

Hello Michael,

Thanks a lot for your clear answer. I am just wondering if the two approaches y.post~y.baseline+condition.post and y~condition+time+condition:subject.nested+condition:time are equivalent ? Indeed, if I simply run the two linear models and I compare the pvalues for condtion at the post treatment timr I do not obtain the same result (pvalue=0.57 for the fisrt approach and pvalue=0.55 for the second one) :

sujet<-factor(rep(1:10,each=2))
condition<-factor(rep(c("ctl","tt"),each=10))
time<-factor(rep(c("pre","post"),times=10))
set.seed(1)
y<-rnorm(20)
mod1<-lm(y[time=="post"]~y[time=="pre"]+condition[time=="post"])
anova(mod1)
subject.nested<-factor(rep(rep(1:5,each=2),times=2))
mod2<-lm(y~condition+time+condition:subject.nested+condition:time)
library(emmeans)
emmeans(mod2,pairwise~condition:time,adjust="none")

Why don't we get the same results ? Is it reasonable to replace the first approach by the second one? Thanks in advance,

sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 15063)

Matrix products: default

locale: [1] LCCOLLATE=FrenchFrance.1252 LCCTYPE=FrenchFrance.1252
[3] LCMONETARY=FrenchFrance.1252 LCNUMERIC=C
[5] LC
TIME=French_France.1252

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] emmeans1.4.3.01 DESeq21.22.2 SummarizedExperiment1.12.0 [4] DelayedArray0.8.0 BiocParallel1.16.6 matrixStats0.54.0
[7] Biobase2.42.0 GenomicRanges1.34.0 GenomeInfoDb1.18.2
[10] IRanges
2.16.0 S4Vectors0.20.1 BiocGenerics0.28.0

loaded via a namespace (and not attached): [1] bit640.9-7 splines3.5.2 Formula1.2-3
[4] assertthat
0.2.1 latticeExtra0.6-28 blob1.1.1
[7] GenomeInfoDbData1.2.0 pillar1.4.1 RSQLite2.1.1
[10] backports
1.1.4 lattice0.20-38 glue1.3.1
[13] digest0.6.19 RColorBrewer1.1-2 XVector0.22.0
[16] checkmate
1.9.3 colorspace1.4-1 plyr1.8.4
[19] htmltools0.3.6 Matrix1.2-15 XML3.98-1.20
[22] pkgconfig
2.0.2 genefilter1.64.0 zlibbioc1.28.0
[25] mvtnorm1.0-10 purrr0.3.2 xtable1.8-4
[28] scales
1.0.0 htmlTable1.13.1 tibble2.1.3
[31] annotate1.60.1 ggplot23.2.0 nnet7.3-12
[34] lazyeval
0.2.2 survival2.43-3 magrittr1.5
[37] crayon1.3.4 estimability1.3 memoise1.1.0
[40] foreign
0.8-71 tools3.5.2 data.table1.12.2
[43] stringr1.4.0 munsell0.5.0 locfit1.5-9.1
[46] cluster
2.0.7-1 AnnotationDbi1.44.0 compiler3.5.2
[49] rlang0.3.4 grid3.5.2 RCurl1.95-4.12
[52] rstudioapi
0.10 htmlwidgets1.3 bitops1.0-6
[55] base64enc0.1-3 gtable0.3.0 DBI1.0.0
[58] R6
2.4.0 gridExtra2.3 knitr1.25
[61] dplyr0.8.1 bit1.1-14 Hmisc4.2-0
[64] stringi
1.4.3 Rcpp1.0.1 geneplotter1.60.0
[67] rpart4.1-13 acepack1.4.1 coda0.19-2
[70] tidyselect
0.2.5 xfun_0.7

ADD REPLY
0
Entering edit mode

Those don’t look the same to me. I’d recommend discussing your analysis plan with a statistician.

Note in your first model you include (random) data as a fixed (nonrandom) covariate, and only perform the LM on half the data. This is very different from the second model.

ADD REPLY
0
Entering edit mode

Hi Michael,

When you say:

The test for differences across treatment, controlling for baseline is accomplished with a standard interaction design (although read below for a modification of this): ~ condition + time + condition:time, and then you would test for the conditiontreatment.timeposttreatment effect, using the 'name' argument of results().

I was wondering if the following contrast design would be equivalent (to test for differences across treatment, controlling for baseline)? contrast=list(c("conditiontreatment_vs_conditioncontrol, "conditiontreatment.timeposttreatment"))

Many thanks, Vincent

ADD REPLY
0
Entering edit mode

The first term is the difference at time=0 and usually we don't want to add that in.

ADD REPLY
0
Entering edit mode

Thanks Michael!

But I'm a bit confused, "controlling for baseline" and "take into account difference at time=0" are not the same?

ADD REPLY
0
Entering edit mode

But you specifically do not want to include that term, to “control for baseline”.

ADD REPLY
0
Entering edit mode
Nick • 0
@nick-7728
Last seen 9.6 years ago
United States

Steve, Simon and Michael,

Thank you for the quick responses.  My apologies for not being clear with respect to what exactly I was asking. 

Yes, I have 2 libraries for each of the 10 subjects. Libraries were generated for each subject (n=10) at baseline and then again at the end of the experiment (n=10) (I am calling this post-treatment).  For a total of 20 libraries. A total of 5 subjects were assigned to treatment and the other 5 subjects served as controls. I want to assess differential gene expression at post-treatment controlling for pre-treatment expression levels. An example of the column data I might envision would be something like the following, but for each of 200 genes.

subject condition baseline_expression_gene1

1  treatment 100 (numeric value reflecting baseline read count)

2 control 50

3 treatment 20

 

Where I could set up a model that would  "adjust" or account for differences in baseline expression (t the extent that I can) when estimating the treatment effect (i.e include baseline expression as a model covariate). Something like: formula (~ baseline_count + condition)

 

The long form data set up as shown by Michael above would look like

subject condition time

1 treatment baseline

1 treatment post-treatment

 2 control baseline

2 control post-treatment

 

Thus, I do not think I can estimate the treatment effect if setting this up as a paired design?

Thanks again and in advance for your assistance in this matter

 

     

ADD COMMENT

Login before adding your answer.

Traffic: 695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6