design matrix in GLM
1
0
Entering edit mode
@myprogramming2016-9741
Last seen 4.2 years ago

Hi,

I am looking for differentially expressed genes between three different groups.

What is the preferred method of design matrix in GLM from the following?

design<-model.matrix(~0+group,data=y$samples) or design<-model.matrix(~group,data=y$samples)

Thanks

edger • 1.1k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 5 hours ago
The city by the bay

The models are the effectively the same. The only difference lies in how you set up the contrasts. With the first model, the coefficients represent the average expression in each group. Thus, you'll have to use makeContrasts to set up comparisons between groups, and supply that as contrast in glmLRT. With the second model, the coefficients represent log-fold changes of particular groups over the group chosen as the intercept. As such, you can drop them directly with coef if you want to compare to the intercept group. (Of course, if you want to compare two non-intercept groups, then you'll still have to use makeContrasts.)

Provided you perform the same DE comparison, both parametrizations should give you identical results. I find the first model (i.e., the intercept-free approach) a bit easier to interpret in general, but that's my personal taste.

ADD COMMENT
0
Entering edit mode

Thanks Aaron. I checked both the designs. They are yielding similar output.

Secondly, I have seen in the example case studies in the edgeR manual that they have used glmQFit().

I am just wondering whether I should use glmFit() or glmQFit()?  How do I decide?

I have data in three biological replicates with the different groups.

I am using following codes to identify DEG in three different groups. Please comment on the codes.

1. x<-read.delim("test.txt",header=T,sep="\t",row.names=1)
2. y<-DGEList(counts=x,group=group)
3. keep<-rowSums(cpm(y)>3)>=3
4. y<-y[keep,,keep.lib.sizes=FALSE]
5. y <- calcNormFactors(y,method ="TMM" )
6. design<-model.matrix(~0+group,data=y$samples) 7. colnames(design)<-levels(y$samples\$group)
8. y<-estimateGLMCommonDisp(y,design)
9. y<-estimateGLMTrendedDisp(y,design)
10. y<-estimateGLMTagwiseDisp(y,design)
11. fit<-glmFit(y,design)
12. BvsA<-makeContrasts(B-A,levels=design)
13. lrt_BvsA <-glmLRT(fit,contrast=BvsA)
14. topTags(lrt_LvsE)

Thanks

ADD REPLY
1
Entering edit mode

The code looks fine. You could replace steps 8-10 with a single estimateDisp call, as that's the newer function. As for the difference between glmFit and glmQLFit - the latter estimates a quasi-likelihood dispersion, which allows downstream tests to better account for uncertainty in dispersion estimation. The standard glmFit + glmLRT pipeline treats the estimated dispersions as true values, which isn't totally accurate.

ADD REPLY
0
Entering edit mode

Thanks for your comments on the code.

You mean I should use glmQLFit() instead of glmFit + glmLRT. I am not quite sure. Please guide me

1. fit <- glmQLFit(y, design, robust=TRUE)
2. qlf <- glmQLFTest(fit)
3. topTags(qlf)

In addition, I would like to subset DE data using 5% FDR and log-fold change cut-off. Could you please suggest some code.

ADD REPLY
1
Entering edit mode

You can use either pipeline. I prefer to use glmQLFit + glmQLFTest, as it provides more accurate type I error control. As to your other question, I would suggest using glmTreat rather than filtering on the log-fold change.

ADD REPLY
0
Entering edit mode

Thanks for your help.

ADD REPLY

Login before adding your answer.

Traffic: 701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6