Question

design matrix in GLM

0

Entering edit mode

myprogramming2016 • 0

@myprogramming2016-9741

Last seen 6.9 years ago

Hi,

I am looking for differentially expressed genes between three different groups.

What is the preferred method of design matrix in GLM from the following?

design<-model.matrix(~0+group,data=y$samples)

or

design<-model.matrix(~group,data=y$samples)

Thanks

edger • 1.8k views

ADD COMMENT • link updated 8.0 years ago by Aaron Lun ★ 28k • written 8.0 years ago by myprogramming2016 • 0

score 3 · Answer 1 · 2016-04-15

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 5 hours ago

The city by the bay

The models are the effectively the same. The only difference lies in how you set up the contrasts. With the first model, the coefficients represent the average expression in each group. Thus, you'll have to use makeContrasts to set up comparisons between groups, and supply that as contrast in glmLRT. With the second model, the coefficients represent log-fold changes of particular groups over the group chosen as the intercept. As such, you can drop them directly with coef if you want to compare to the intercept group. (Of course, if you want to compare two non-intercept groups, then you'll still have to use makeContrasts.)

Provided you perform the same DE comparison, both parametrizations should give you identical results. I find the first model (i.e., the intercept-free approach) a bit easier to interpret in general, but that's my personal taste.

ADD COMMENT • link 8.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron. I checked both the designs. They are yielding similar output.

Secondly, I have seen in the example case studies in the edgeR manual that they have used glmQFit().

I am just wondering whether I should use glmFit() or glmQFit()? How do I decide?

I have data in three biological replicates with the different groups.

I am using following codes to identify DEG in three different groups. Please comment on the codes.

x<-read.delim("test.txt",header=T,sep="\t",row.names=1)
y<-DGEList(counts=x,group=group)
keep<-rowSums(cpm(y)>3)>=3
y<-y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y,method ="TMM" )
design<-model.matrix(~0+group,data=y$samples)
colnames(design)<-levels(y$samples$group)
y<-estimateGLMCommonDisp(y,design)
y<-estimateGLMTrendedDisp(y,design)
y<-estimateGLMTagwiseDisp(y,design)
fit<-glmFit(y,design)
BvsA<-makeContrasts(B-A,levels=design)
lrt_BvsA <-glmLRT(fit,contrast=BvsA)
topTags(lrt_LvsE)

Thanks

ADD REPLY • link 8.0 years ago myprogramming2016 • 0

1

Entering edit mode

The code looks fine. You could replace steps 8-10 with a single estimateDisp call, as that's the newer function. As for the difference between glmFit and glmQLFit - the latter estimates a quasi-likelihood dispersion, which allows downstream tests to better account for uncertainty in dispersion estimation. The standard glmFit + glmLRT pipeline treats the estimated dispersions as true values, which isn't totally accurate.

ADD REPLY • link 8.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for your comments on the code.

You mean I should use glmQLFit() instead of glmFit + glmLRT. I am not quite sure. Please guide me

fit <- glmQLFit(y, design, robust=TRUE)
qlf <- glmQLFTest(fit)
topTags(qlf)

In addition, I would like to subset DE data using 5% FDR and log-fold change cut-off. Could you please suggest some code.

ADD REPLY • link 8.0 years ago myprogramming2016 • 0

1

Entering edit mode

You can use either pipeline. I prefer to use glmQLFit + glmQLFTest, as it provides more accurate type I error control. As to your other question, I would suggest using glmTreat rather than filtering on the log-fold change.