Hi,
I am using a piece of pre-existing code to perform differential expression analysis with a single case and multiple (6) controls. I am a non-statistician and am somewhat new to the minutiae of edgeR.
I include the core of the code below. Essentially I want to confirm whether the solution currently being used is suitable, or if there is a better alternative for the 1 vs 6 scenario.
Any input would be appreciated.
Best,
Gavin
y <- calcNormFactors(y)
#creatin the design matrix
condition=factor(c(rep("T", 1), rep("N", ncol(allrpkms)-1)))
data.frame(Sample=colnames(y),condition)
design <- model.matrix(~condition)
rownames(design) <- colnames(y)
#estimating the dispersion for the dataset
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
#determine differentially expressed genes. Fit genewise glms
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
#Conduct likelihood ratio tests for tumour vs normal tissue differences and show the top genes:
deseqres<-topTags( lrt , n = nrow( lrt$table ) )$table
As Ryan has said, the p-values are fine; type I error rates will still be controlled in this situation. The problem is with power, or specifically, your lack of it. Throwing an extra log-fold change threshold into the mix will probably just make things worse - but if you must do so, use
glmTreat
rather than naively thresholding on the reported values.More generally, even with individualized medicine, I suspect that you should be able to get multiple samples by collecting them repeatedly from the same patient (e.g., multiple biopsies, or sample at multiple times). These "replicates" will be more technical than biological in nature, as they won't allow generalization of your conclusions to other patients. However, that might not matter if all you care about are the inferences for this particular patient. (Don't rush ahead and do this, though. This strategy might not play nicely with your control samples, which are presumably across different individuals. In the most extreme case, you might underestimate the dispersion for the controls and overestimate them for the patient samples, which would probably have odd and unpredictable effects on edgeR's behaviour. Perhaps there's an argument for using group-specific dispersions here...)
Gordon has said in the past, in response to a similar question of mine, that comparisons between a high-variance group and a low-variance groupusing edgeR (in which the estimated dispersion is intermediate between the two groups) should be slightly over-conservative if anything.
Thanks again for all the input Aaron - it is a great help.
Regarding the multiple biopsies, I absolutely agree and have pushed the same point many times, but unfortunately there is still a disconnect between standard clinical practice and the needs of researchers on the clinical threshold. Change happens but slowly...
The p-values are as accurate as they can be, but you might not get a whole lot of DE calls because of the small sample size. The other aspect to keep in mind is that the estimated dispersions will only reflect the control group variability, since there is no way to estimate dispersions for the single case. In this sense, I believe that the test is roughly analogous to fitting a distribution to the controls and then computing a z-score of the case relative to that distribution.
In any case, using a method like edgeR that accounts for gene-specific variability is certainly going to be better than relying purely on logFC. Relying on logFC would basically be equivalent to assuming that all genes have equal variance.