Question

RNA seq differential expression analysis for large sample size

0

Entering edit mode

Luca • 0

@lucapiacentini-9597

Last seen 9 months ago

Italy

Hi there, I am trying to perform a differential expression analysis on large RNA-seq data (761 samples x 26000 genes), where the samples are divided into 5 classes, e.g. group_1 (n=275); group_2 (n=76); group_3 (n=151); group_4 (n=163) and group_5 (n=96). The use of DESeq2 and edgeR gives similar results, with a large number of differentially expressed genes (adjusted P-value <0.05) when factor levels are compared pairwise, representing about 40-50% of the genes in the whole dataset (which is too many for the phenotype in question). Their number drops to about 3-5% (expected proportion of differentially expressed) when I apply a threshold of |log2 fold-change| >0.5. I have read that limma-voom is recommended for large samples due to speed problems, but this is not a big issue at the moment with either DESeq2 or edgeR on this data. I was wondering whether there might be specific reasons to use alternative differential expression analysis methods to handle large samples and a variable of interest with many levels, or whether DESeq2 and edgeR also work effectively in this case. Thank you for any suggestions.

edgeR DESeq2 • 3.4k views

ADD COMMENT • link 23 months ago Luca • 0

1

Entering edit mode

ATpoint ★ 4.9k

@atpoint-13662

Last seen 5 days ago

Germany

When you have large sample sizes and/or a lot of power you may want to test against a different Null hypothesis than "a gene has a fold change of 0", which is the default. You can use the lfc argument in DESeq2::lfcShrink or in edgeR::glmTreat for this, thereby enriching for genes with good evidence to have fold changes beyond the value you choose, for example 1.5. Other than that, you could use limma-voom which scales better at large sample size. The function for a fold change testing would be limma::treat here. As for any analysis, be sure that you explore the data first, e.g. by PCA or MDS plots to ensure that there is no unwanted confounding of any sort (batch effects for example) that confound the analysis.

ADD COMMENT • link 23 months ago ATpoint ★ 4.9k

0

Entering edit mode

First of all, thank you for your prompt suggestions. I tried using edgeR::glmTreat and the number of differentially expressed genes was greatly reduced. However, I noticed that using edgeR::glmTreat with a different value of the lfc argument strongly affects the shape of the histogram of the distribution of nominal P-values. In fact, the histogram of P-values showed the expected (i.e. for truly differences) uniformly flat distribution over the unit interval (null P-values) with a peak close to zero (alternative hypothesis P-values) for lfc=0, but the opposite (the peak is close to one!) for lfc=0.5. Together with the use of PCA or MDS, the histogram of the distribution of nominal P-values is also useful for assessing the presence of effects related to unintended variability (aka latent and unknown variables), and for this reason I was wondering if its use is only valid for tests that have as null hypothesis 'the gene has a fold-change of 0', or if there is another possible interpretation.

ADD REPLY • link 23 months ago Luca • 0

1

Entering edit mode

glmTreat() does not produce uniformly distributed p-values, so examining the p-value histogram is not helpful. Any statistical test like glmTreat() or treat() with an intervall null hypothesis will yield a skew distribution of p-values -- that is expected and does not imply any problems with the fit or the method.

I am not a big fan of p-value histograms as a basic QC plot even for lfc=0 tests (I haven't recommended them in any of the limma or edgeR documentation or case studies) but certainly not in conjunction with glmTreat(). In general, then is no requirement in statistical theory for p-values to have a strictly uniform distributution under the null hypothesis, especially not for count data.

ADD REPLY • link 23 months ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon Smyth, thank you very much for the clarification regarding the use of glmTreat() and the considerations regarding the distribution of P-values. Although I have found the use of P-value histograms useful in the past, I will also bear your valuable opinion in mind for the future.

ADD REPLY • link 23 months ago Luca • 0

1

Entering edit mode

The genes below the lfc threshold are heavily skewed towards large pvalues, and this trend increases with larger lfc thresholds, so as Gordon Smyth says there is no expectation of a uniform distribution.

ADD REPLY • link 23 months ago ATpoint ★ 4.9k

0

Entering edit mode

Thank you again ATpoint.

ADD REPLY • link 23 months ago Luca • 0

score 1 · Accepted Answer · 2023-07-25

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 14 days ago

United States

Just one note, in addition to the above pertinent pieces from ATpoint :

You should really consider to include RUV or SVA factors in such a large experiment. Even low correlations between batch variation and comparisons of interest will produce spurious rejections at such sample sizes. More here under Experiments with many samples

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs

And there is code for doing this in our workflow:

https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#removing-hidden-batch-effects

I routinely work with dataset in the 100s and every time we model unwanted variation.

ADD COMMENT • link 23 months ago Michael Love 43k

0

Entering edit mode

Thank you Michael for your specification... and yes, I agree. I usually couple differential expression analysis with RUVseq (or SVA or its derivatives) to identify possible sources of unintended variation and include them as covariates in the statistical model. RUVr seems to work better than RUVg for a multifactorial design. To better understand whether there is a clear (or at least partial) relationship with already known variables, I also usually do a correlation/association analysis both between latent and technical variables (experimental batches, e.g. sample processing, sequencing etc.) and with biological variables (e.g. demographic/clinical data in the case of patients) and then decide how to design the statistical model. Thanks again to both of you for now.

ADD REPLY • link 23 months ago Luca • 0

0

Entering edit mode

To better understand whether there is a clear (or at least partial) relationship with already known variables, I also usually do a correlation/association analysis both between latent and technical variables (experimental batches, e.g. sample processing, sequencing etc.) and with biological variables (e.g. demographic/clinical data in the case of patients) and then decide how to design the statistical model.

Exactly!

ADD REPLY • link 23 months ago Michael Love 43k