Hi there, I am trying to perform a differential expression analysis on large RNA-seq data (761 samples x 26000 genes), where the samples are divided into 5 classes, e.g. group_1 (n=275); group_2 (n=76); group_3 (n=151); group_4 (n=163) and group_5 (n=96). The use of DESeq2 and edgeR gives similar results, with a large number of differentially expressed genes (adjusted P-value <0.05) when factor levels are compared pairwise, representing about 40-50% of the genes in the whole dataset (which is too many for the phenotype in question). Their number drops to about 3-5% (expected proportion of differentially expressed) when I apply a threshold of |log2 fold-change| >0.5. I have read that limma-voom is recommended for large samples due to speed problems, but this is not a big issue at the moment with either DESeq2 or edgeR on this data. I was wondering whether there might be specific reasons to use alternative differential expression analysis methods to handle large samples and a variable of interest with many levels, or whether DESeq2 and edgeR also work effectively in this case. Thank you for any suggestions.
Thank you Michael for your specification... and yes, I agree. I usually couple differential expression analysis with RUVseq (or SVA or its derivatives) to identify possible sources of unintended variation and include them as covariates in the statistical model. RUVr seems to work better than RUVg for a multifactorial design. To better understand whether there is a clear (or at least partial) relationship with already known variables, I also usually do a correlation/association analysis both between latent and technical variables (experimental batches, e.g. sample processing, sequencing etc.) and with biological variables (e.g. demographic/clinical data in the case of patients) and then decide how to design the statistical model. Thanks again to both of you for now.
Exactly!