Hello everyone
I am Yogender gowtham and I am a graduate student from Clemson University (USA). I am working on understanding the transcriptome of Chinese hamster ovary (CHO) cells using RNA-seq. To perform differential gene expression analysis between my treatments, I am using edgeR software package.
Since I have multiple factors in my treatments, I am using GLM to perform differential gene expression analysis. However I have few questions related to the likelihood ratio test and the likelihood ratio in the output.
1) From what I understand, the likelihood ratio tests compares two models - a full model and a reduced model to see which one fits better. Could someone please tell me what would be the null and alternate hypothesis for this test?
2) Below is a typical output from my pairwise comparison.
logFC | logCPM | LR | PValue | FDR | |
gene23551 | 3.680215 | 7.995512 | 4263.779 | 0 | 0 |
It looks like greater the likelihood ratio, lower is the p-value. Is the LR value a estimate from likelihood ratio test? What exactly is it and how does it relate to p-value?
Thank you
To answer your first question, we need to know what design matrix you've used, and what parameters you used to run
glmLRT
(namely, what you specified in thecoef
orcontrast
arguments).Thank you for your reply Aaron. I have five different conditions with two factors: Cell lines (DP12 and DH) and serum level (0, 5 and 10).
The design matrix I used is:
design <- model.matrix(~0+group)
rownames(design) <- samples$file
design
colnames(design)
#[1] "groupDH_10" "groupDH_5" "groupDP12_0" "groupDP12_10" "groupDP12_5"
This was used to call the contrasts:
designcontrasts<- makeContrasts(
DH.10v5 = groupDH_5 - groupDH_10,
DP.10v5 = groupDP12_5- groupDP12_10,
DH.5vDP5 = groupDP12_5 - groupDH_5,
DH.10vDP10 = groupDP12_10 - groupDH_10,
DP.10v0 = groupDP12_0 - groupDP12_10,
DP.5v0 = groupDP12_0 - groupDP12_5,
levels=design)
Please let me know if you need more details.
Thanks
Two things: first. you need to show how you called
glmLRT
in addition to the above code. Second, Based on the names of your contrasts, I'm guessing you probably don't want to test all of them at once, but rather you want to test them individually one at a time, which would require six separate calls toglmLRT
That is correct. I did six separate calls to glmLRT. For examples:
# glmLRT
#DHFR 5% vs 10%
lrt<- glmLRT(glmfit, contrast = designcontrasts[, "DH.10v5"])
# toptags
result <- topTags(lrt, n = Inf, sort.by="PValue")
head(result)
de <- rownames(result[result$table$FDR<0.01,])
head(result$table)
sum(result$table$FDR<=0.01)#4792
write.table(result$table, file = "DHFR 5vs10_new.txt", sep="\t")
Could you please explain in detail what the coefficients actually are, with respect to the above example of contrast between DHFR_5 - DHFR_10. Please let me know if you need more details with regards to code.
Thanks
Each coefficient of your
design
matrix represents the log-average count for one of your groups. You can figure which ones are which fairly easily from the column names of the matrix.Now, for the comparison above, the contrast vector of
designcontrasts[, "DH.10v5"]
probably looks likec(-1, 1, 0, 0, 0)
. Each entry of the vector corresponds to a coefficient of the design matrix. The specified contrast is effectively saying:That's your null hypothesis, i.e., that DH_10 is equal to DH_5. Then,
glmLRT
internally refactorsdesign
into an alternative formulation so that, upon dropping the last coefficient of this alternative model, the reduced model is formed in which your null hypothesis is true. The dropped coefficient represents the log-fold change between DH_10 and DH_5.Hello Ryan and Aaron
I was reviewing the comments and I noticed your comment on calling the contrast at once. Could you please tell how I could do it all at once, using the above example as reference? And would having six separate calls to glmLRT affect the statistical power?
The way I have been calling them is below:
To drop them all at once:
This does an ANOVA-like comparison, where the "global" null hypothesis is that there are no differences between any of the groups. While this will generally detect more DE genes than running them as separate pairwise contrasts - subtle changes across multiple contrasts provide more evidence against the global null, and there's less damage from the multiple testing correction when you just do a single ANOVA compared to 6 separate contrasts - it's not really a fair comparison, because the null hypotheses are different between the two approaches.
Long story short; if you want to find DE between any of your groups, use the ANOVA. If you want to find DE between two specific groups, use the appropriate pairwise comparison. The ANOVA isn't the be-all and end-all of DE detection, as it won't tell you which of the groups are involved in the DE for each gene, just that there is DE "somewhere" in the data. (Technically, you can dig out the differences with post hoc tests, but that's not supported in edgeR. Besides, if you're going to do that, you might as well do the pairwise comparisons to start with.)