EdgeR DGElist - equal counts between treatment and control incorrectly give significant PValue using estimateGLMCommonDisp through DGElist
3
1
Entering edit mode
khyu ▴ 10
@khyu-22213
Last seen 5.2 years ago

Dear all,

Here is the code I am using to calculate PValue which gives significant PValue for the identical counts between treatment and control. My dataframe is called "df", and my PValue list is called "top".

Please let me know if there is anything wrong in a code or reason for getting incorrect significant PValue. Thanks!

# Load library
library(edgeR) 
library(RUVSeq) 
library(reshape2)
library(DESeq2) 
library(dplyr)
library(ggplot2)
library(pheatmap) 

control_1 <- c(rep(10, 50), 1000, 2000, 3000, 4000, 5000)
control_2 <- c(rep(10, 50), 1000, 2000, 3000, 4000, 5000)
patient_1 <- c(rep(20, 20),rep(10,10), rep(0,20), 1000, 2000, 3000, 4000, 5000)
patient_2 <- c(rep(20, 20),rep(10,10), rep(0,20), 1000, 2000, 3000, 4000, 5000)

df <- data.frame(Ctl=control_1,
                 Ctl=control_2,
                 Trt=patient_1,
                 Trt=patient_2)

# Store data
x <- as.factor(rep(c("Ctl", "Trt"), each=2))
set <- newSeqExpressionSet(as.matrix(df), phenoData = data.frame(x, row.names =colnames(df)))

## Upper-quartile normalization
set <- betweenLaneNormalization(set, which="upper")

## PValue Calculation using DGEList
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method = "upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
lrt <- glmLRT(fit, contrast=c(1,-1))

top <- topTags(lrt, n=nrow(set))$table
edger • 946 views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 8 hours ago
The city by the bay

Putting aside the extraneous code to create set, your contrast doesn't make sense. I would suggest that you review Section 3.2.3 of the edgeR user's guide, if you have not done so already. The hint is: what does each column of your design matrix represent? And with contrast=c(1,-1), what is your null hypothesis?

ADD COMMENT
1
Entering edit mode
Yunshun Chen ▴ 900
@yunshun-chen-5451
Last seen 5 weeks ago
Australia

The reason for getting significant P-values is your upper-quartile normalization. Your count data is artificially constructed in the way that the upper-quartile of the Trt group is twice as much as that of the Ctl group. As a result, the normalization factors for the Trt samples are twice as much as those for the Ctl samples, which is equivalent to "down-scaling" (by a factor of half) the expression level of all the genes in the Trt group.

ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 8 hours ago
WEHI, Melbourne, Australia

Aaron has pointed out that you've made a programming mistake with your glmLRT call so that you are not actually testing for differential expression.

Yunshun has pointed out that "upperquartile" normalization does not cope well with a small, unbalanced dataset like this. The default edgeR method would do much better.

But there's a bigger issue with the premise of your question. You have created artificial data in such a way that 36% (20 out of 55) of the genes are strongly up-regulated in the treatment group while another 36% are strongly down-regulated. Why would you not expect edgeR to detection differential expression? Your data shows differential expression, so it is entirely correct that edgeR will find it.

You have made the library sizes the same for treatment and control groups but the counts themselves are not equal.

ADD COMMENT

Login before adding your answer.

Traffic: 757 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6