Question: Controlling numerical covariates: differential gene expression with limma voom
0
7 months ago by
Onyi Ukay0 wrote:

Here is a snapshot of a data.frame, df

Subject | Cell_type | Condition | Viral_load
--------|-----------|-----------|------------
1       | A         | normal    | 10
1       | B         | normal    | 180
2       | A         | diseased  | 500
2       | B         | diseased  | 90
3       | A         | normal    | 720
3       | B         | normal    | 63

I would like to find differentially expressed genes in the following comparisons:
i. A.normal vs A.diseased
ii. B.normal vs B.diseased
iii. A.normal vs B.normal
iv. A.diseased vs B.diseased

What is the right way to incorporate into my analysis an interactive design formula for Cell_type and Condition, while blocking for the Subject and controlling for Viral_load covariate?

My idea is to add a supplementary column to df called Group,

Subject | Cell_type | Condition | Viral_load | Group
--------|-----------|-----------|------------|-------
1       | A         | normal    | 10         | A.normal
1       | B         | normal    | 180        | B.normal
2       | A         | diseased  | 500        | A.diseased
2       | B         | diseased  | 90         | B.diseased
3       | A         | normal    | 720        | A.normal
3       | B         | normal    | 63         | B.normal

Then run the following analysis:

design <- model.matrix(~ 0 + Group + Viral_load, data = df)
colnames(design) <- c(levels(df$Group), "Viral_load") v <- voom(TMM, design = design) dupcor <- duplicateCorrelation(v, design = v$design, block = df$Subject) fit <- lmFit(v, block = df$Subject, correlation = dupcor$consensus.correlation) cm <- makeContrasts(A.normal_vs_diseased = A.normal - A.diseased, B.normal_vs_diseased = B.normal - B.diseased, normal.A_vs_B = A.normal - B.normal, diseased.A_vs_B = A.diseased - B.diseased, levels = v$design)
fit <- contrasts.fit(fit, cm)
fit <- eBayes(fit)

Have I accurately controlled for Viral_load as a covariate, blocked by Subject and extracted the appropriate contrasts?

modified 7 months ago by mikhael.manurung90 • written 7 months ago by Onyi Ukay0

Hi, I would really appreciate it if you upvote my answer if you find it helpful :)

Answer: Controlling numerical covariates: differential gene expression with limma voom
2
7 months ago by
mikhael.manurung90 wrote:

Hi Onyi,

Your design and contrast matrices are correct.

I am not sure about your variable naming convention, but please make sure the TMM input variable to your voom command is the expression set.

Last but not least, it is now recommended to do the duplicateCorrelation calculation twice each. Also keep in mind to include the block to your voom call! Here is an example:

v <- voom(TMM, design = design, block=df$Subject) dupcor <- duplicateCorrelation(v, design = v$design, block = df$Subject) v <- voom(TMM, design = design, correlation=dupcor$consensus.correlation, block=df$Subject) dupcor <- duplicateCorrelation(v, design = v$design, block = df$Subject) v <- voom(TMM, design = design, correlation=dupcor$consensus.correlation, block=df\$Subject)


Further discussion on this can be seen at using duplicateCorrelation with limma+voom for RNA-seq data

Best, Mikhael

I'm not technically a limma developer, but I hope my reputation is high enough for you.

As Mikhael said, your design matrices and contrasts are correct. I will just note that edgeR uses log-link GLMs, which means that your model assumes that log-expression scales linearly with viral load (i.e., raw expression increases exponentially with viral load). Perhaps a more reasonable relationship for most genes would be that log-expression scales linearly with log-viral load. Or you could be even more general and use a spline on the (log-)viral load, allowing you to block out any smooth relationship with viral load. All of this assumes, of course, that the viral load is not confounded with your conditions of interest, which will cause problems of varying levels to the different blocking strategies mentioned above.

The other thing I should point out is that the TMM should typically be a DGEList that has been run through calcNormFactors. If you give voom a raw matrix of counts, it won't do any normalization by default (aside from library size normalization when the log-CPMs are computed). You can set normalize.method but the available choices tend to be unnecessarily aggressive, being designed to deal with the horrors of microarrays.

Thanks for affirming Mikhael's answer Aaron.

TMM is indeed a DGEList, defined as:

dge  <- DGEList(counts=counts)
keep <- filterByExpr(dge, design, min.count=5)
dge  <- dge[keep,,keep.lib.sizes=FALSE]
TMM  <- calcNormFactors(dge)

Thank you for also pointing out a different problem that I could have. On the suggested more general case, what exactly do you mean by "use a spline on the log-viral_load?" Does using a spline take care of the unknown relationship between log-cpm values and log-viral_load, or does it simply "block out any smooth relationship with viral load" [I assume you mean linear relationship here]?