Hi, I want to study the genotype effect (Susceptible/Intermediate/Resistant) on gene expression 1 day after treatment. I'm aware of the discussion at edgeR: Specifying the "coef"-argument in glmLRT in a two factor study for this kind of experiment. What makes my experiment a bit more complicated is I'm also interested in the gene expression of R genotype 19 days after treatment compared to 1 day after treatment. This is how my experiment looks like.
genotype treatment Time(Days after treatments)
S_C_R1 S C 1
S_C_R2 S C 1
S_C_R3 S C 1
S_T_R1 S T 1
S_T_R2 S T 1
S_T_R3 S T 1
I_C_R1 I C 1
I_C_R2 I C 1
I_C_R3 I C 1
I_T_R1 I T 1
I_T_R2 I T 1
I_T_R3 I T 1
R_C_R1 R C 1
R_C_R2 R C 1
R_C_R3 R C 1
R_T_R1 R T 1
R_T_R2 R T 1
R_T_R3 R T 1
RC_C_R1 R C 19
RC_C_R2 R C 19
RC_C_R3 R C 19
RC_T_R1 R T 19
RC_T_R2 R T 19
1) What is the best design for this kind of experiment? I'm currently splitting the data into two datasets (I+S+R and R+RC) and analyzing them using design <- model.matrix(~0+genotype+genotype:treatment) and design <- model.matrix(~0+time+time:treatment). I noticed the number of DEG for R genotype differs greatly in both datasets, 40 vs 260. Is this normal?
2) Also, there is inconsistency between replicates. BCV distance 1 can't separate the control and treated samples very well. Does my design remove this batch effect?
3) Lastly, how can I check which DEG is significantly higher or lower in R compared to S genotype? Say, DGE for gene A in R and S genotype is -5 and -2 respectively. Is there a test to prove that gene A is significantly down-regulated in R compared to S genotype?
This is the code I use for I+S+R dataset:
raw.data <- read.table("allrna_filSIR.count" , header = TRUE,sep="\t" )
counts <- raw.data[,2:ncol(raw.data)] #select column 2-n
rownames(counts) <- raw.data[ , 1 ] # select column 1 as gene names
d <- DGEList(counts=counts)
keep <- rowSums(cpm(d)>=2) >=6 #filter min 2 cpm in at least 3 samples
d <-d[keep,]
d$samples$lib.size <- colSums(d$counts)
d <- calcNormFactors(d) #Normalization
genotype <- factor(paste(c(rep("S",6),rep("I",6),rep("R",6))))
treatment <- factor(c("C", "C", "C", "T", "T","T","C", "C","C","T", "T","T","C", "C", "C", "T", "T","T"))
design <- model.matrix(~0+genotype+genotype:treatment)
rownames(design) <- colnames(d)
d <- estimateGLMCommonDisp(d, design, verbose=TRUE)
#Disp = 0.05148 , BCV = 0.2269
d <- estimateGLMTrendedDisp(d, design)
d <- estimateGLMTagwiseDisp(d, design)
fit <- glmFit(d, design)
colnames(design)
[1] "genotypeI" "genotypeR" "genotypeS"
[4] "genotypeI:treatmentT" "genotypeR:treatmentT" "genotypeS:treatmentT"
I <- glmLRT(fit,coef=4)
R <- glmLRT(fit,coef=5)
S <- glmLRT(fit,coef=6)
Thanks for your help.
Failure to reject the null does not mean that the gene is not DE. Instead, it just means that you don't have enough evidence to say that it is DE. If you were to say that the gene is constant in condition 2 (through a heatmap or otherwise), then that would be misleading. Your inability to detect DE might be due to technical factors like high variances or lack of power, rather than any actual constancy of expression.
More generally, the whole idea of a DE analysis is to find those genes that are significantly differentially expressed, so some genes on the borderline of significance will inevitably be missed. I don't think this can really be helped, except by improving the power of your study. In your specific case, you may be able to improve power by using a contrast that shares information across multiple conditions. For example, you can do an ANOVA-style comparison to test for changes in either condition 1 or 2, or compare the average of
T
to the average ofC
between the two conditions. Significance in either condition should result in detection of the gene.