Question

EdgeR returns small logFC values when using GLM approach

0

Entering edit mode

gergely.zsuzsa.n • 0

@gergelyzsuzsan-10076

Last seen 8.0 years ago

Dear edgeR user community,

I have RNA-seq data downloaded from TCGA from 7 patients, both lung tumor and normal samples. At first I used edgeR with the GLM approach to account for patient differences, but the logFC values seemed too small. Min and max are -1.03 and 0.84. Then I tried the classic approach (with the same data, same filtering, etc.), and I had logFC values larger by a factor of ~10. Min and max are -10.55 and 9.6. DE genes are pretty consistent between the two approaches, common genes are 91% and 82% of all DE genes, respectively.

Also, with the GLM approach it was calculated that BCV = 0.4879. Is that too high? I think it is close enough to 40%, but I'm curious about other opinions.

Classic approach:

library(edgeR)
rawdata <- read.delim("matrix.txt", stringsAsFactors=FALSE, check.names=FALSE)
targets <- readTargets("targets.csv")
rownames(targets) <- targets$ID
targets <- targets[c("LUAD_09_11", "LUAD_16_11", "LUAD_19_11", "LUAD_28_11", "LUAD_36_11", "LUAD_48_11", "LUAD_57_11", "LUAD_09_01", "LUAD_16_01", "LUAD_19_01", "LUAD_28_01", "LUAD_36_01", "LUAD_48_01", "LUAD_57_01"),]
y <- DGEList(counts=round(rawdata[,c("LUAD_09_11", "LUAD_16_11", "LUAD_19_11", "LUAD_28_11", "LUAD_36_11", "LUAD_48_11", "LUAD_57_11", "LUAD_09_01", "LUAD_16_01", "LUAD_19_01", "LUAD_28_01", "LUAD_36_01", "LUAD_48_01", "LUAD_57_01")]), group=targets$Group, genes=rawdata[,1])
keep <- rowSums(cpm(y)>0.5) >= 7
nrow(y)
y <- y[keep,]
nrow(y)
y <- calcNormFactors(y)
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)
et <- exactTest(y)
de <- decideTestsDGE(et)
summary(de)

GLM approach:

library(edgeR)
rawdata <- read.delim("matrix.txt", stringsAsFactors=FALSE, check.names=FALSE)
targets <- readTargets("targets.csv")
rownames(targets) <- targets$ID
targets <- targets[c("LUAD_09_11", "LUAD_16_11", "LUAD_19_11", "LUAD_28_11", "LUAD_36_11", "LUAD_48_11", "LUAD_57_11", "LUAD_09_01", "LUAD_16_01", "LUAD_19_01", "LUAD_28_01", "LUAD_36_01", "LUAD_48_01", "LUAD_57_01"),]
y <- DGEList(counts=round(rawdata[,c("LUAD_09_11", "LUAD_16_11", "LUAD_19_11", "LUAD_28_11", "LUAD_36_11", "LUAD_48_11", "LUAD_57_11", "LUAD_09_01", "LUAD_16_01", "LUAD_19_01", "LUAD_28_01", "LUAD_36_01", "LUAD_48_01", "LUAD_57_01")]), group=targets$Group, genes=rawdata[,1])
keep <- rowSums(cpm(y)>0.5) >= 7
nrow(y)
y <- y[keep,]
nrow(y)
y$samples$lib.size <- colSums(y$counts)
rownames(y$counts) <- rownames(y$genes)
y <- calcNormFactors(y)
design <- model.matrix(~targets$Patient+targets$Group)
rownames(design) <- colnames(y)
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
o <- order(lrt$table$PValue)
summary(de <- decideTestsDGE(lrt))

Session info:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] edgeR_3.4.2   limma_3.18.13

loaded via a namespace (and not attached):
[1] tools_3.1.3

Do you have any explanation why is this possible? Which approach should I use? Any thoughts?

Thank you,

Zsuzsa

edger glm logfc • 1.7k views

ADD COMMENT • link 8.0 years ago gergely.zsuzsa.n • 0

2

Entering edit mode

It's hard to say anything about your analysis, given that you haven't shown the code that you've used. Please edit your answer to include the relevant lines of code you used to get to the figures above.

ADD REPLY • link 8.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for the notice, I have included codes in my question.

ADD REPLY • link 8.0 years ago gergely.zsuzsa.n • 0

score 5 · Accepted Answer · 2016-04-11

5

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

The classic and glm edgeR pipelines produce identical logFC values, except as influenced by slight differences in the dispersion values. The differences you report between the pipelines are so unlikely as to be essentially impossible, so I suspect that you have not run the two pipelines in the same way.

Edit 12 hours later:

Now that you have added codes, we can confirm that the two analyses are different. The glm analysis adjusts for patient differences whereas the classic analysis doesn't.

Does 'group' take on only two values? If there are more than two groups, then the classic pipeline compares the first two groups by default whereas the glm pipeline compares the last group to the first by default.

Later again:

As Aaron has hinted, you need use factors in a model formula:

patient <- factor(targets$patient)
group <- factor(targets$group)
design <- model.matrix(~patient+group)

ADD COMMENT • link 8.0 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you for your answer. I have edited my question to include the codes. I used identical matrix.txt and targets.csv files in both cases.

ADD REPLY • link 8.0 years ago gergely.zsuzsa.n • 0

0

Entering edit mode

In this sense, you can't produce the same analysis with these two different approaches, can you? Or if you construct a design matrix like model.matrix(~targets$Group), then it is the same analysis as the classic?

Intuitively, I would have thought that adjusting for patient differences might have meant larger logFCs, not smaller. Based on the codes, are the differences right?

Yes, group takes on two values.

Thanks in advance,

Zsuzsa

ADD REPLY • link 8.0 years ago gergely.zsuzsa.n • 0

2

Entering edit mode

No, it's not possible to do a paired-sample analysis with the classic approach, you can only use group-based designs. If you use a group-based design in the GLM approach, it will be conceptually similar to the classic approach; however, the specifics will vary quite a bit, due to the differences in the underlying statistics (e.g., LRT vs exact tests, qCML vs CR-APL estimation). All in all, I would go with the GLM approach if you're seeing strong patient effects.

As to the cause of the differences in the log-fold changes; I would have expected that the log-fold change estimates be roughly similar (or at least, not differ by an order of magnitude) between the GLM and classic approaches, with or without the patient term. The only thing I can think of is that targets$Group should be a factor or a character vector. If it's some sort of numeric, then it'll be treated as a real-valued covariate in the GLM fit. This will reduce the log-fold change for large covariate values.

ADD REPLY • link 8.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you, using factors for the design matrix solved my problem (obviously I'm not an R expert...)! The logFCs now are pretty similar.

ADD REPLY • link 8.0 years ago gergely.zsuzsa.n • 0