Question

Can't find any differentially expressed genes

1

Entering edit mode

s3462450 ▴ 10

@s3462450-7553

Last seen 7.6 years ago

Australia

I am new user of edge R package for finding out differentially expressed genes. I have two genotypes x 2 treatments (C,S) and 3 replication. I get 0 differentially expressed genes amongst the genotypes and treatments. All reads per sample are more than 15 million. Please advice some troubleshooting and possible points to look at.

Commands I am using are :

> x <- c(“filename1”,”filename2”,”filename3”)
> counts <- readDGE(x) 
> counts2 <- counts$counts
> bad <- c("no_feature","ambiguous","too_low_qual","alignment_not_unique")
> throw <- rownames(counts2) %in% bad 
> cpms <- cpm(counts2) 
> keep <- rowSums(cpms > 1) >=3 & !throw
> counts <- counts2[keep,]
> dim(counts)
> design <- c("C","C","C","S","S","S") 
> design 
Output: “C”,”C”,”C”,”S”,”S”,”S”
> factor(design) 
Levels -> C S (2 levels)
> factor(design) -> stress 
Output: stress “C” “C” “C” “S” “S” “S”
Levels C S
> design <- model.matrix(~ stress) 
> rownames(design) <- colnames(counts)
> y <- DGEList(counts=counts)
> y <- calcNormFactors(y) 
> y <- estimateCommonDisp(y)
> y <- estimateTagwiseDisp(y)
> y <- calcNormFactors(y) 
> fit <- glmFit(y, design)
> lrt <- glmLRT(fit)
> lrt$table
> topTags(lrt)
> decideTestsDGE(lrt)
> summary(decideTestsDGE(lrt))
Result :
-1    0
 0 1400
 1    0

rnaseq differential gene expression design matrix edgeR • 2.9k views

ADD COMMENT • link updated 9.7 years ago by Gordon Smyth 52k • written 9.7 years ago by s3462450 ▴ 10

0

Entering edit mode

You mentioned you have two genotypes as well? Your design only has a treatment effect, though? With only 6 samples, though, you also don't really have enough replication to model a genotype effect.

ADD REPLY • link 9.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks James and Steve for your response. It's a plant genotypes (Tolerant and susceptible) and challenged with stress at two time points (1 and 2) and 3 replications for each challenge. Please help me if I am designing the wrong matrix. And I took just one just 6 (C, S) samples to do differential expression study, in total i have 24 samples. Shall I take all at once into accounts.

ADD REPLY • link 9.7 years ago s3462450 ▴ 10

0

Entering edit mode

Replicated are biological per plant in a stress and control condition

ADD REPLY • link 9.7 years ago s3462450 ▴ 10

1

Entering edit mode

In general you should fit a model with all samples at once, and then test for the differences you care about using contrasts. Something like

design <- model.matrix(~0+factor(rep(c("CT","CS","ST","SS), each = 3)))

contrast <- makeContrasts(CT-CS, ST-SS, levels = design)

fit <- glmFit(y, design)

ccomp <- glmLRT(fit, contrast = contrast[,1])

scomp <- glmLRT(fit, contrast = contrast[,2])

I say 'in general' because this makes the implicit assumption that the intra-group variability (or dispersion) is expected to be similar across your four groups. You will be using all data in hand in order to estimate dispersions, and if one genotype is inherently more variable (for some reason), then you will be 'polluting' the dispersion estimates for the other genotype.

But like I said, in general you want to use all data in hand, because more data will result in better dispersion estimates, which will result in more power to detect true differences.

ADD REPLY • link 9.7 years ago James W. MacDonald 67k

score 1 · Answer 1 · 2015-04-15

Your code looks fine as far as it goes (although with some redundant steps). But, yes, there are heaps of things you can (and should) do to trouble-shoot. One always needs to make exploratory pictures as part of an analysis, rather than just running through a pipeline and looking at the list of genes at the end.

First, up plotMDS(y) should come right after calcNormFactor(y). Do the samples look like they separate? Is there a batch effect?

Second, look for genes that are extremely variable between the replicates. These will make all the results less precise. You can look for hyper-variable genes like this:

y2 <- estimateDisp(y, design, robust=TRUE)
plotBCV(y2)
summary(y2$df.prior)
o <- order(y2$df.prior)
o2 <- o[1:20]
data.frame(y2$counts[o2,], prior.df=y2$df.prior[o2], BCV=sqrt(y2$tagwise.dispersion[o2]))

Do there look to be hyper-variable genes? If so, what could be causing them?

Now try

fit <- glmFit(y2, design)
lrt <- glmLRT(fit)
topTable(lrt)
plotSmear(lrt)

How large are the logFC? Do some logFC stand out from the others, or it is a big cloud? (If the latter, it might all be just noise.)

If none of this helps, then try using plotSmear to compare pairs of individual libraries. Are pairs of replicates more alike than pairs from different treatments?

Finally, as James says, it might be that there really is no significant DE in your experiment, but you should at least try to analyse all your data before deciding that, not just six libraries. You should put all 24 libraries into the DGEList object at once.

Whether you get significant DE depends on how noisy your replicates are. At my institute, we work with genetically identical mice, and our replicates tend to be very consistent. However data from many other sources tends to be more noisy.

score 0 · Answer 2 · 2015-04-06

In my experience, this is a very common result. Especially if you are testing two groups that have subtle differences. Why this would be so is not hard to understand.

First an example. Let's say you want to know if people in country X are heavier on average than country Y. You don't have much money, so you grab your bathroom scale, and fly to country X. You round up the first three people you see, and weigh them. Then you jump on a plane to country Y and repeat the process. Then you fire up R, input the six values and fit a t-test. If there is a difference, how firmly do you believe the results represent the true difference between the two countries? Do you think you could convince anybody else that your results are meaningful? I didn't think so...

When you do a 3x3 experiment, that is essentially what you are doing, but you are doing it like 20,000 times! It's hard to believe that your results would be 'good' (or useful or meaningful) for a single comparison, let alone 20,000 of them. Since you only have three samples per group, any given statistic has to be really big to get a significant p-value, and since you have done thousands of simultaneous comparisons, the statistic has to be that much larger.

It would actually be far more surprising if you got any significant results. Unfortunately, with a 7% payline, this sort of experiment is all most PIs (that I know, at least) feel they can afford to fund.