Question: Question about edgeR usage with one case and multiple controls
0
3.4 years ago by
gavin.oliver0 wrote:

Hi,

I am using a piece of pre-existing code to perform differential expression analysis with a single case and multiple (6) controls.  I am a non-statistician and am somewhat new to the minutiae of edgeR.

I include the core of the code below.  Essentially I want to confirm whether the solution currently being used is suitable, or if there is a better alternative for the 1 vs 6 scenario.

Any input would be appreciated.

Best,

Gavin

y <- calcNormFactors(y)
#creatin the design matrix
condition=factor(c(rep("T", 1), rep("N", ncol(allrpkms)-1)))
data.frame(Sample=colnames(y),condition)
design <- model.matrix(~condition)
rownames(design) <- colnames(y)
#estimating the dispersion for the dataset
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
#determine differentially expressed genes. Fit genewise glms
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
#Conduct likelihood ratio tests for tumour vs normal tissue differences and show the top genes:
deseqres<-topTags( lrt , n = nrow( lrt$table ) )$table

modified 3.4 years ago • written 3.4 years ago by gavin.oliver0
2
3.4 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

As you may have realized, the experimental setup isn't ideal. You don't get a lot of power for detecting changes because, obviously, it's hard to trust DE when you only have a single "case" sample in which that DE can occur. That being said, the edgeR code itself is fine. Remember to filter out low-abundance genes, if you haven't already. You can also make life easier for yourself by setting n=Inf in topTags.

0
3.4 years ago by
gavin.oliver0 wrote:

Hi Aaron,

I fully understand your point about the single case, however this is essentially within a research-based individualized medicine (n of 1) scenario, which ties our hands somewhat.

Low expressed genes have already been filtered, but the advice is appreciated.

In this 1 vs 6 setup, can we put any stock in the p-values output by EdgeR, or should we work solely from fold-change in your opinion?

Best,

Gavin

1

As Ryan has said, the p-values are fine; type I error rates will still be controlled in this situation. The problem is with power, or specifically, your lack of it. Throwing an extra log-fold change threshold into the mix will probably just make things worse - but if you must do so, use glmTreat rather than naively thresholding on the reported values.

More generally, even with individualized medicine, I suspect that you should be able to get multiple samples by collecting them repeatedly from the same patient (e.g., multiple biopsies, or sample at multiple times). These "replicates" will be more technical than biological in nature, as they won't allow generalization of your conclusions to other patients. However, that might not matter if all you care about are the inferences for this particular patient. (Don't rush ahead and do this, though. This strategy might not play nicely with your control samples, which are presumably across different individuals. In the most extreme case, you might underestimate the dispersion for the controls and overestimate them for the patient samples, which would probably have odd and unpredictable effects on edgeR's behaviour. Perhaps there's an argument for using group-specific dispersions here...)

Gordon has said in the past, in response to a similar question of mine, that comparisons between a high-variance group and a low-variance groupusing edgeR (in which the estimated dispersion is intermediate between the two groups) should be slightly over-conservative if anything.

Thanks again for all the input Aaron - it is a great help.

Regarding the multiple biopsies, I absolutely agree and have pushed the same point many times, but unfortunately there is still a disconnect between standard clinical practice and the needs of researchers on the clinical threshold.  Change happens but slowly...

The p-values are as accurate as they can be, but you might not get a whole lot of DE calls because of the small sample size. The other aspect to keep in mind is that the estimated dispersions will only reflect the control group variability, since there is no way to estimate dispersions for the single case. In this sense, I believe that the test is roughly analogous to fitting a distribution to the controls and then computing a z-score of the case relative to that distribution.

In any case, using a method like edgeR that accounts for gene-specific variability is certainly going to be better than relying purely on logFC. Relying on logFC would basically be equivalent to assuming that all genes have equal variance.