I am new user of edge R package for finding out differentially expressed genes. I have two genotypes x 2 treatments (C,S) and 3 replication. I get 0 differentially expressed genes amongst the genotypes and treatments. All reads per sample are more than 15 million. Please advice some troubleshooting and possible points to look at.
Commands I am using are :
> x <- c(“filename1”,”filename2”,”filename3”) > counts <- readDGE(x) > counts2 <- counts$counts > bad <- c("no_feature","ambiguous","too_low_qual","alignment_not_unique") > throw <- rownames(counts2) %in% bad > cpms <- cpm(counts2) > keep <- rowSums(cpms > 1) >=3 & !throw > counts <- counts2[keep,] > dim(counts) > design <- c("C","C","C","S","S","S") > design Output: “C”,”C”,”C”,”S”,”S”,”S” > factor(design) Levels -> C S (2 levels) > factor(design) -> stress Output: stress “C” “C” “C” “S” “S” “S” Levels C S > design <- model.matrix(~ stress) > rownames(design) <- colnames(counts) > y <- DGEList(counts=counts) > y <- calcNormFactors(y) > y <- estimateCommonDisp(y) > y <- estimateTagwiseDisp(y) > y <- calcNormFactors(y) > fit <- glmFit(y, design) > lrt <- glmLRT(fit) > lrt$table > topTags(lrt) > decideTestsDGE(lrt) > summary(decideTestsDGE(lrt)) Result : -1 0 0 1400 1 0
You mentioned you have two genotypes as well? Your design only has a treatment effect, though? With only 6 samples, though, you also don't really have enough replication to model a genotype effect.
Thanks James and Steve for your response. It's a plant genotypes (Tolerant and susceptible) and challenged with stress at two time points (1 and 2) and 3 replications for each challenge. Please help me if I am designing the wrong matrix. And I took just one just 6 (C, S) samples to do differential expression study, in total i have 24 samples. Shall I take all at once into accounts.
Replicated are biological per plant in a stress and control condition
In general you should fit a model with all samples at once, and then test for the differences you care about using contrasts. Something like
I say 'in general' because this makes the implicit assumption that the intra-group variability (or dispersion) is expected to be similar across your four groups. You will be using all data in hand in order to estimate dispersions, and if one genotype is inherently more variable (for some reason), then you will be 'polluting' the dispersion estimates for the other genotype.
But like I said, in general you want to use all data in hand, because more data will result in better dispersion estimates, which will result in more power to detect true differences.