Question

EdgeR design problem (genotype+treatment+time)

0

Entering edit mode

melissawong • 0

@melissawong-7631

Last seen 9.4 years ago

Canada

Hi, I want to study the genotype effect (Susceptible/Intermediate/Resistant) on gene expression 1 day after treatment. I'm aware of the discussion at edgeR: Specifying the "coef"-argument in glmLRT in a two factor study for this kind of experiment. What makes my experiment a bit more complicated is I'm also interested in the gene expression of R genotype 19 days after treatment compared to 1 day after treatment. This is how my experiment looks like.

   genotype   treatment   Time(Days after treatments)
S_C_R1   S   C   1
S_C_R2   S   C   1
S_C_R3   S   C   1
S_T_R1   S   T   1
S_T_R2   S   T   1
S_T_R3   S   T   1
I_C_R1   I   C   1
I_C_R2   I   C   1
I_C_R3   I   C   1
I_T_R1   I   T   1
I_T_R2   I   T   1
I_T_R3   I   T   1
R_C_R1   R   C   1
R_C_R2   R   C   1
R_C_R3   R   C   1
R_T_R1   R   T   1
R_T_R2   R   T   1
R_T_R3   R   T   1
RC_C_R1   R   C   19
RC_C_R2   R   C   19
RC_C_R3   R   C   19
RC_T_R1   R   T   19
RC_T_R2   R   T   19

1) What is the best design for this kind of experiment? I'm currently splitting the data into two datasets (I+S+R and R+RC) and analyzing them using design <- model.matrix(~0+genotype+genotype:treatment) and design <- model.matrix(~0+time+time:treatment). I noticed the number of DEG for R genotype differs greatly in both datasets, 40 vs 260. Is this normal?

2) Also, there is inconsistency between replicates. BCV distance 1 can't separate the control and treated samples very well. Does my design remove this batch effect?

3) Lastly, how can I check which DEG is significantly higher or lower in R compared to S genotype? Say, DGE for gene A in R and S genotype is -5 and -2 respectively. Is there a test to prove that gene A is significantly down-regulated in R compared to S genotype?

This is the code I use for I+S+R dataset:

raw.data <- read.table("allrna_filSIR.count" , header = TRUE,sep="\t" )
counts <- raw.data[,2:ncol(raw.data)] #select column 2-n
rownames(counts) <- raw.data[ , 1 ] # select column 1 as gene names
d <- DGEList(counts=counts)
keep <- rowSums(cpm(d)>=2) >=6 #filter min 2 cpm in at least 3 samples
d <-d[keep,]
d$samples$lib.size <- colSums(d$counts)
d <- calcNormFactors(d) #Normalization
genotype <- factor(paste(c(rep("S",6),rep("I",6),rep("R",6))))
treatment <- factor(c("C", "C", "C", "T", "T","T","C", "C","C","T", "T","T","C", "C", "C", "T", "T","T"))
design <- model.matrix(~0+genotype+genotype:treatment)
rownames(design) <- colnames(d)
d <- estimateGLMCommonDisp(d, design, verbose=TRUE)
#Disp = 0.05148 , BCV = 0.2269
d <- estimateGLMTrendedDisp(d, design)
d <- estimateGLMTagwiseDisp(d, design)
fit <- glmFit(d, design)
colnames(design)
[1] "genotypeI" "genotypeR" "genotypeS"
[4] "genotypeI:treatmentT" "genotypeR:treatmentT" "genotypeS:treatmentT"
I <- glmLRT(fit,coef=4)
R <- glmLRT(fit,coef=5)
S <- glmLRT(fit,coef=6)

Thanks for your help.

edger • 2.5k views

ADD COMMENT • link 9.9 years ago melissawong • 0

score 2 · Answer 1 · 2015-04-20

The best design for your experiment is the one-way layout. Combine all factors into a grouping vector like so:

genotype <- rep(c("S", "I", "R"), c(6, 6, 11))
treatment <- c(rep(rep(c("C", "T"), each=3), 3), rep(c("C", "T"), c(3, 2)))
time <- rep(c(1, 19), c(18, 5))
grouping <- factor(paste(genotype, treatment, time, sep="."))
design <- model.matrix(~0 + grouping)
colnames(design) <- levels(grouping)

This approach is simple and makes no assumptions about how the three factors interact with one another. Each combination of factors is represented by a separate group, and has its own coefficient in the design matrix.

For the two approaches you've already done, they use different libraries, so I'm not surprised that you end up with differing numbers of DE genes. Regardless, both are suboptimal compared to using all of the data at once.

Lack of separation on a MDS plot doesn't mean that your replicates are inconsistent. It may just mean that your treatment effect is small - or, at least, smaller than the variability between replicates. In any case, the design matrix will not remove any batch effect, because you haven't included a term that describes the batch effect. You'll need some blocking factor based on pre-existing knowledge (e.g., time of library preparation, sequencing) or you might infer it with methods like RUVSeq. Though, the NB dispersion estimates already look quite sensible, so I'll be reluctant to do too much extra work.

If you want to compare the R to S genotype, you can do this by comparing the R-control group to the S-control group, or by comparing the R-treatment group to S-treatment group. This can be done by running:

RvScontrol <- makeContrasts(R.C.1 - S.C.1, levels=design)
l.control <- glmLRT(fit,contrast=RvScontrol)
RvStreatment <- makeContrasts(R.T.1 - S.T.1, levels=design)
l.treatment <- glmLRT(fit,contrast=RvStreatment)

If you want to get genes that are changing in the same manner in both contrasts, then you can intersect the set of DE genes between the two contrasts. It makes little sense to ask whether a gene is generally "up" in one genotype compared to the other, without considering the treatment effect. For example, how do you describe the behaviour of the gene if it is upregulated in R-control against S-control, and down-regulated in R-treatment against S-treatment?

To finish off, if you want to get the DE genes between time points for R-treated samples, you can do:

Rvtime <- makeContrasts(R.T.19 - R.T.1, levels=design)
l.time <- glmLRT(fit,contrast=Rvtime)

score 0 · Answer 2 · 2015-05-04

Hi Aaron, thank you very much for the design. In response to your question:
"How do you describe the behaviour of the gene if it is upregulated in R-control against S-control, and down-regulated in R-treatment against S-treatment?"
>>>Interestingly, this doesn't happen in my data. 99.99% of my genes don't show different direction across multiple conditions.

I have a more general question for gene expression analysis. For example, gene A is down-regulated in both conditions but the gene is significantly down-regulated (FDR<0.05) only in condition 1. I think it's quite misleading to plot a heatmap showing that gene A is not differentially expressed in condition 2. It seems like data presentation focus mainly on significantly differentially expressed genes but ignore the fact that some genes are regulated in the same way in all conditions (but just not significantly and they differs in fold change). What is your thought on this?

cpm values for Condition 1:

gene C C C T T T logFC FDR
A 14.9 15.2 7.8 5.3 5.7 2.5 -1.49 0.038

cpm values for Condition 2:
gene C C C T T T logFC FDR
A 17.5 14.1 15.1 4.9 10.1 7.4 -1.06 0.35