adjusting for confounding effects in edgeR
1
0
Entering edit mode
Victoria • 0
@victoria-21093
Last seen 5.1 years ago

I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) 
x <- read.delim("rawCountMatrix.txt", row.names = "Gene") 
group <- factor(c(1,2,2,2,2,2,1,1,1,1)) 
RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) 
Age <- c(106,89,76,82,87,87,73,77,78,88) 
PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) 
y <- calcNormFactors(y) 
design <- model.matrix(~group + RIN + Age + PMI) 
y <- estimateDisp(y,design)

fit <- glmFit(y,design) 
lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)

I appriciate any comments

edger confounding • 2.5k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 12 hours ago
The city by the bay

An additive model is appropriate. The various factors/covariates don't seem to have strong correlations with each other, so the interpretation of the coefficients is fairly straightforward in design.

  • The intercept is the average expression in group 1. (Technically, it is the expected expression of a hypothetical sample with zero RIN, zero age and zero PMI. While that probably doesn't any biological sense, we don't directly interpret this coefficient anyway, so we can just ignore this absurdity.)
  • group2 is the log-fold change in the average expression of group 2 over group 1.
  • RIN is the log-fold change per unit of increase in the RIN. Same for age and PMI.

It seems pretty obvious which coefficient is of most interest here, and so you should specify it with coef=.

P.S. Use glmQLFit and glmQLFTest instead.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your reply Aaron,

  • Can I also add the batches in my additive model like this? (The batches are different dates of RNA extraction from tissue.) Batch <- factor(c("a","a","b","a","a","c","a","c","a","c")) design <- model.matrix(~group + Age + RIN + PMI + Batch)

If yes, is it the best way to consider batch effects for a well designed study?

  • I used the glmLRT(), basically because it's recommended in the edgeR guide that if there is no replicates, LRT is recommended. My data come from human tissue without replicates. And also the results make more sense with regards to the function of the resulted DEGs.

Why do you think I need to use QFL instead?

thanks again,

ADD REPLY
0
Entering edit mode

Can I also add the batches in my additive model like this?

Yes.

If yes, is it the best way to consider batch effects for a well designed study?

Yes, but the design of this study has some room for improvement. The sole sample in batch B is effectively useless.

Why do you think I need to use QFL instead?

Your data has replicates. Why do you think you don't have replicates? If you didn't, estimateDisp wouldn't work.

ADD REPLY
0
Entering edit mode

Thanks again for your response.

I understand that the batch B is not useful, however we had to remove some other samples in this analysis (as the original study was different from this analysis). With this design, do you think additive model is enough for batch effect correction? How can I check my data to be fairly confident that there are no batch effects? (in my MDS and PCA plots, the the two groups do not cluster, should I be concerned about that in a brain Transcriptome data?)

By no replicates, I meant these samples come from different individuals, compared to cell line experiments where we actually have replicates for every condition. Can I ask why you think I should use QFL? I tried but it gives me zero DEGs.

Best,

ADD REPLY
0
Entering edit mode

With this design, do you think additive model is enough for batch effect correction?

You don't have much of a choice. The only alternative would be to combine batch and group into a single factor by pasteing them together, and then compare coefficients 1a to 2a and 1c to 2c. This would handle non-additive batch effects but sacrifices a residual d.f., and you don't have many of these left.

How can I check my data to be fairly confident that there are no batch effects?

See whether including the batch term in the design matrix changes the results. If it does, then there is probably a relevant batch effect that needs to be modeled properly.

in my MDS and PCA plots, the the two groups do not cluster, should I be concerned about that in a brain Transcriptome data?

It just means that you don't have strong differential expression between groups.

By no replicates, I meant these samples come from different individuals, compared to cell line experiments where we actually have replicates for every condition.

Individuals are replicates in your study. Think about it. If you (or someone else) were to repeat the experiment, how would you collect new samples? You probably wouldn't collect RNA from the same set of humans, you would go round up a new set of humans to try to independently reproduce the results of your first experiment. Thus, each individual is a replicate, and this is correct because you want your analysis to yield conclusions about the population as a whole, not just the actual subset of humans that you happened to sample.

By comparison, if you are using cell lines, an appropriate replicate would be a new passage/culture of the same cell line. This is because, if you were to repeat the experiment, you would just get the cells out of the fridge, grow them up again and extract the RNA. This level of replication allows you to make conclusions about the specific cell line, which is kind of useful if the cell line is meant to be a model system for something interesting. You cannot, however, make statements about other cell lines. See also my comments here about the problems with considering different cell lines as "replicates" of each other.

Can I ask why you think I should use QFL?

Because it's more accurate. LRT yields more false positives. If you're seeing lots of DE genes with the LRT and few genes with the QL F-test, the latter is probably correct.

I tried but it gives me zero DEGs.

You seem to dislike the prospect of getting no significant DE genes. Unless you have a solid reason for this (e.g., you're comparing WT to KO and you don't see the KO'd gene in the DE set), you should be more open-minded about the outcome of the DE analysis.

Let me put it this way. You can either listen to what the analysis is trying to tell you right now, and accept that there is no strong DE in your data; or you can throw money down the drain by trying to follow up on weak targets, and then accept that there is no strong DE in your data.

ADD REPLY
0
Entering edit mode

Thanks a lot for your response and detailed explanation!

ADD REPLY

Login before adding your answer.

Traffic: 579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6