Question

EdgeR DGElist - equal counts between treatment and control incorrectly give significant PValue using estimateGLMCommonDisp through DGElist

1

Entering edit mode

khyu ▴ 10

@khyu-22213

Last seen 5.1 years ago

Dear all,

Here is the code I am using to calculate PValue which gives significant PValue for the identical counts between treatment and control. My dataframe is called "df", and my PValue list is called "top".

Please let me know if there is anything wrong in a code or reason for getting incorrect significant PValue. Thanks!

# Load library
library(edgeR) 
library(RUVSeq) 
library(reshape2)
library(DESeq2) 
library(dplyr)
library(ggplot2)
library(pheatmap) 

control_1 <- c(rep(10, 50), 1000, 2000, 3000, 4000, 5000)
control_2 <- c(rep(10, 50), 1000, 2000, 3000, 4000, 5000)
patient_1 <- c(rep(20, 20),rep(10,10), rep(0,20), 1000, 2000, 3000, 4000, 5000)
patient_2 <- c(rep(20, 20),rep(10,10), rep(0,20), 1000, 2000, 3000, 4000, 5000)

df <- data.frame(Ctl=control_1,
                 Ctl=control_2,
                 Trt=patient_1,
                 Trt=patient_2)

# Store data
x <- as.factor(rep(c("Ctl", "Trt"), each=2))
set <- newSeqExpressionSet(as.matrix(df), phenoData = data.frame(x, row.names =colnames(df)))

## Upper-quartile normalization
set <- betweenLaneNormalization(set, which="upper")

## PValue Calculation using DGEList
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method = "upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
lrt <- glmLRT(fit, contrast=c(1,-1))

top <- topTags(lrt, n=nrow(set))$table

edger • 926 views

ADD COMMENT • link updated 5.1 years ago by Gordon Smyth 51k • written 5.1 years ago by khyu ▴ 10

score 2 · Answer 1 · 2019-10-25

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 3 hours ago

The city by the bay

Putting aside the extraneous code to create set, your contrast doesn't make sense. I would suggest that you review Section 3.2.3 of the edgeR user's guide, if you have not done so already. The hint is: what does each column of your design matrix represent? And with contrast=c(1,-1), what is your null hypothesis?

ADD COMMENT • link 5.1 years ago Aaron Lun ★ 28k

score 1 · Answer 2 · 2019-10-25

The reason for getting significant P-values is your upper-quartile normalization. Your count data is artificially constructed in the way that the upper-quartile of the Trt group is twice as much as that of the Ctl group. As a result, the normalization factors for the Trt samples are twice as much as those for the Ctl samples, which is equivalent to "down-scaling" (by a factor of half) the expression level of all the genes in the Trt group.

score 1 · Answer 3 · 2019-10-25

Aaron has pointed out that you've made a programming mistake with your glmLRT call so that you are not actually testing for differential expression.

Yunshun has pointed out that "upperquartile" normalization does not cope well with a small, unbalanced dataset like this. The default edgeR method would do much better.

But there's a bigger issue with the premise of your question. You have created artificial data in such a way that 36% (20 out of 55) of the genes are strongly up-regulated in the treatment group while another 36% are strongly down-regulated. Why would you not expect edgeR to detection differential expression? Your data shows differential expression, so it is entirely correct that edgeR will find it.

You have made the library sizes the same for treatment and control groups but the counts themselves are not equal.