EdgeR design problem (genotype+treatment+time)
2
0
Entering edit mode
@melissawong-7631
Last seen 9.4 years ago
Canada

Hi, I want to study the genotype effect (Susceptible/Intermediate/Resistant) on gene expression 1 day after treatment. I'm aware of the discussion at edgeR: Specifying the "coef"-argument in glmLRT in a two factor study for this kind of experiment. What makes my experiment a bit more complicated is I'm also interested in the gene expression of R genotype 19 days after treatment compared to 1 day after treatment. This is how my experiment looks like.

    genotype    treatment    Time(Days after treatments)
S_C_R1    S    C    1
S_C_R2    S    C    1
S_C_R3    S    C    1
S_T_R1    S    T    1
S_T_R2    S    T    1
S_T_R3    S    T    1
I_C_R1    I    C    1
I_C_R2    I    C    1
I_C_R3    I    C    1
I_T_R1    I    T    1
I_T_R2    I    T    1
I_T_R3    I    T    1
R_C_R1    R    C    1
R_C_R2    R    C    1
R_C_R3    R    C    1
R_T_R1    R    T    1
R_T_R2    R    T    1
R_T_R3    R    T    1
RC_C_R1    R    C    19
RC_C_R2    R    C    19
RC_C_R3    R    C    19
RC_T_R1    R    T    19
RC_T_R2    R    T    19

1) What is the best design for this kind of experiment? I'm currently splitting the data into two datasets (I+S+R and R+RC) and analyzing them using design <- model.matrix(~0+genotype+genotype:treatment) and design <- model.matrix(~0+time+time:treatment). I noticed the number of DEG for R genotype differs greatly in both datasets, 40 vs 260. Is this normal?

2) Also, there is inconsistency between replicates. BCV distance 1 can't separate the control and treated samples very well. Does my design remove this batch effect?

3) Lastly, how can I check which DEG is significantly higher or lower in R compared to S genotype? Say, DGE for gene A in R and S genotype is -5 and -2 respectively. Is there a test to prove that gene A is significantly down-regulated in R compared to S genotype?

This is the code I use for I+S+R dataset:

raw.data <- read.table("allrna_filSIR.count" , header = TRUE,sep="\t" )
counts <- raw.data[,2:ncol(raw.data)] #select column 2-n
rownames(counts) <- raw.data[ , 1 ] # select column 1 as gene names
d <- DGEList(counts=counts) 
keep <- rowSums(cpm(d)>=2) >=6 #filter min 2 cpm in at least 3 samples 
d <-d[keep,]
d$samples$lib.size <- colSums(d$counts)
d <- calcNormFactors(d) #Normalization
genotype <- factor(paste(c(rep("S",6),rep("I",6),rep("R",6))))
treatment <- factor(c("C", "C", "C", "T", "T","T","C", "C","C","T", "T","T","C", "C", "C", "T", "T","T"))
design <- model.matrix(~0+genotype+genotype:treatment)
rownames(design) <- colnames(d)
d <- estimateGLMCommonDisp(d, design, verbose=TRUE)
#Disp = 0.05148 , BCV = 0.2269 
d <- estimateGLMTrendedDisp(d, design)
d <- estimateGLMTagwiseDisp(d, design)
fit <- glmFit(d, design)
colnames(design)
[1] "genotypeI"            "genotypeR"            "genotypeS"           
[4] "genotypeI:treatmentT" "genotypeR:treatmentT" "genotypeS:treatmentT"
I <- glmLRT(fit,coef=4)
R <- glmLRT(fit,coef=5) 
S <- glmLRT(fit,coef=6)

Thanks for your help.

edger • 2.5k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 8 hours ago
The city by the bay

The best design for your experiment is the one-way layout. Combine all factors into a grouping vector like so:

genotype <- rep(c("S", "I", "R"), c(6, 6, 11))
treatment <- c(rep(rep(c("C", "T"), each=3), 3), rep(c("C", "T"), c(3, 2)))
time <- rep(c(1, 19), c(18, 5))
grouping <- factor(paste(genotype, treatment, time, sep="."))
design <- model.matrix(~0 + grouping)
colnames(design) <- levels(grouping)

This approach is simple and makes no assumptions about how the three factors interact with one another. Each combination of factors is represented by a separate group, and has its own coefficient in the design matrix.

For the two approaches you've already done, they use different libraries, so I'm not surprised that you end up with differing numbers of DE genes. Regardless, both are suboptimal compared to using all of the data at once.

Lack of separation on a MDS plot doesn't mean that your replicates are inconsistent. It may just mean that your treatment effect is small - or, at least, smaller than the variability between replicates. In any case, the design matrix will not remove any batch effect, because you haven't included a term that describes the batch effect. You'll need some blocking factor based on pre-existing knowledge (e.g., time of library preparation, sequencing) or you might infer it with methods like RUVSeq. Though, the NB dispersion estimates already look quite sensible, so I'll be reluctant to do too much extra work.

If you want to compare the R to S genotype, you can do this by comparing the R-control group to the S-control group, or by comparing the R-treatment group to S-treatment group. This can be done by running:

RvScontrol <- makeContrasts(R.C.1 - S.C.1, levels=design)
l.control <- glmLRT(fit,contrast=RvScontrol)
RvStreatment <- makeContrasts(R.T.1 - S.T.1, levels=design)
l.treatment <- glmLRT(fit,contrast=RvStreatment)

If you want to get genes that are changing in the same manner in both contrasts, then you can intersect the set of DE genes between the two contrasts. It makes little sense to ask whether a gene is generally "up" in one genotype compared to the other, without considering the treatment effect. For example, how do you describe the behaviour of the gene if it is upregulated in R-control against S-control, and down-regulated in R-treatment against S-treatment?

To finish off, if you want to get the DE genes between time points for R-treated samples, you can do:

Rvtime <- makeContrasts(R.T.19 - R.T.1, levels=design)
l.time <- glmLRT(fit,contrast=Rvtime)
ADD COMMENT
0
Entering edit mode
@melissawong-7631
Last seen 9.4 years ago
Canada

Hi Aaron, thank you very much for the design. In response to your question:
"How do you describe the behaviour of the gene if it is upregulated in R-control against S-control, and down-regulated in R-treatment against S-treatment?"
>>>Interestingly, this doesn't happen in my data. 99.99% of my genes don't show different direction across multiple conditions.

I have a more general question for gene expression analysis. For example, gene A is down-regulated in both conditions but the gene is significantly down-regulated (FDR<0.05) only in condition 1. I think it's quite misleading to plot a heatmap showing that gene A is not differentially expressed in condition 2. It seems like data presentation focus mainly on significantly differentially expressed genes but ignore the fact that some genes are regulated in the same way in all conditions (but just not significantly and they differs in fold change). What is your thought on this?

cpm values for Condition 1:

gene C C C T T T logFC FDR
A 14.9 15.2 7.8 5.3 5.7 2.5 -1.49 0.038

cpm values for Condition 2:
gene C C C T T T logFC FDR
A 17.5 14.1 15.1 4.9 10.1 7.4 -1.06 0.35

 

ADD COMMENT
0
Entering edit mode

Failure to reject the null does not mean that the gene is not DE. Instead, it just means that you don't have enough evidence to say that it is DE. If you were to say that the gene is constant in condition 2 (through a heatmap or otherwise), then that would be misleading. Your inability to detect DE might be due to technical factors like high variances or lack of power, rather than any actual constancy of expression.

More generally, the whole idea of a DE analysis is to find those genes that are significantly differentially expressed, so some genes on the borderline of significance will inevitably be missed. I don't think this can really be helped, except by improving the power of your study. In your specific case, you may be able to improve power by using a contrast that shares information across multiple conditions. For example, you can do an ANOVA-style comparison to test for changes in either condition 1 or 2, or compare the average of T to the average of C between the two conditions. Significance in either condition should result in detection of the gene.

ADD REPLY

Login before adding your answer.

Traffic: 590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6