I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:
library(edgeR)
x <- read.delim("rawCountMatrix.txt", row.names = "Gene")
group <- factor(c(1,2,2,2,2,2,1,1,1,1))
RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3)
Age <- c(106,89,76,82,87,87,73,77,78,88)
PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)
y <- DGEList(counts=x,group=group)
y <- calcNormFactors(y)
design <- model.matrix(~group + RIN + Age + PMI)
y <- estimateDisp(y,design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit)
Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)
I appriciate any comments
Thanks a lot for your reply Aaron,
If yes, is it the best way to consider batch effects for a well designed study?
Why do you think I need to use QFL instead?
thanks again,
Yes.
Yes, but the design of this study has some room for improvement. The sole sample in batch B is effectively useless.
Your data has replicates. Why do you think you don't have replicates? If you didn't,
estimateDisp
wouldn't work.Thanks again for your response.
I understand that the batch B is not useful, however we had to remove some other samples in this analysis (as the original study was different from this analysis). With this design, do you think additive model is enough for batch effect correction? How can I check my data to be fairly confident that there are no batch effects? (in my MDS and PCA plots, the the two groups do not cluster, should I be concerned about that in a brain Transcriptome data?)
By no replicates, I meant these samples come from different individuals, compared to cell line experiments where we actually have replicates for every condition. Can I ask why you think I should use QFL? I tried but it gives me zero DEGs.
Best,
You don't have much of a choice. The only alternative would be to combine
batch
andgroup
into a single factor bypaste
ing them together, and then compare coefficients1a
to2a
and1c
to2c
. This would handle non-additive batch effects but sacrifices a residual d.f., and you don't have many of these left.See whether including the batch term in the design matrix changes the results. If it does, then there is probably a relevant batch effect that needs to be modeled properly.
It just means that you don't have strong differential expression between groups.
Individuals are replicates in your study. Think about it. If you (or someone else) were to repeat the experiment, how would you collect new samples? You probably wouldn't collect RNA from the same set of humans, you would go round up a new set of humans to try to independently reproduce the results of your first experiment. Thus, each individual is a replicate, and this is correct because you want your analysis to yield conclusions about the population as a whole, not just the actual subset of humans that you happened to sample.
By comparison, if you are using cell lines, an appropriate replicate would be a new passage/culture of the same cell line. This is because, if you were to repeat the experiment, you would just get the cells out of the fridge, grow them up again and extract the RNA. This level of replication allows you to make conclusions about the specific cell line, which is kind of useful if the cell line is meant to be a model system for something interesting. You cannot, however, make statements about other cell lines. See also my comments here about the problems with considering different cell lines as "replicates" of each other.
Because it's more accurate. LRT yields more false positives. If you're seeing lots of DE genes with the LRT and few genes with the QL F-test, the latter is probably correct.
You seem to dislike the prospect of getting no significant DE genes. Unless you have a solid reason for this (e.g., you're comparing WT to KO and you don't see the KO'd gene in the DE set), you should be more open-minded about the outcome of the DE analysis.
Let me put it this way. You can either listen to what the analysis is trying to tell you right now, and accept that there is no strong DE in your data; or you can throw money down the drain by trying to follow up on weak targets, and then accept that there is no strong DE in your data.
Thanks a lot for your response and detailed explanation!