Hi all,
We are performing a transcriptome analysis with DESeq2.
We have three factors: treatment (control and drought), climate (CO2 and ambient) and zone in the leaf (zone1, zone2, zone3).
For each possible combination, we have 3 biological replicates (no technical replicates). The PCA showed us that we have 2 outliers. They are not in the same group, so this leaves us with 2 biological replicates for two of our sample groups.
I've been looking on the internet for information on how many biological replicates I need at minimum using DESeq2. I did find some information ( https://assets.geneious.com/manual/10.2/GeneiousManualsu100.html ), but I want to be sure before continuing. Running DESeq2 does not prompt an error message. Removing the two outliers results in more significant genes.
Thank you for your advice and time.
Kind regards,
Jonas
Antwerp University - Belgium
Dear Sean,
Thank you for your quick response and for noting the possibility of removing biological variation by removing outliers based on a PCA.
To respond to your suggestion about the outliers: The effect of sampling zone in the leaf is quite dominant and these two samples did not match there zone (when clustered), where in two previous transcriptome studies (with a similar setup) and the current one (the remaining 34) all the samples clustered really nicely together for the zones.
Concerning the number of replicates: An article ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/ ) mentioned that at least 6 replicates should be used in RNAseq analysis to find significantly differentially expressed (SDE) genes. We're sometimes limited in the number of replicates we can run during an experiment, as it is the case here...
As we are quite sure that we are talking about outliers here, we would like to leave out these two samples. This also resulted in more SDE genes. I was afraid that the increase in SDE genes could not be trusted, since we only had 2 biological replicates in two of our sample groups...
Feel free to respond if you have any more suggestions, doubts or ideas.
Thanks again for your quick response and advice!
A small comment re: "at least six replicates should be used". It's a bit more subtle. Take a look at their Figure 1b, for edgeR. You'll find a similar curve for most RNA-seq tools. On the x axis is number of replicates, and the top curves show sensitivity. The 4 curves are for those genes with |LFC| > 0, 0.3, 1, and 2. The far left represents two replicates, where you see ~75% sensitivity for genes with |LFC| > 1. It's also about 50% sensitivity for |LFC| > 0.3. So not to say that 2 replicates (with low biological variability) is sufficient for all purposes, but rather that investigators should know they will only recover genes with the largest effect size.
Regarding the sample size paper, see my replies to Nicholas Schurch and Conrad Burden here: https://f1000research.com/articles/5-1438
I agree with Michael that the true situation is more nuanced than the authors of that paper make out and the main issue is power rather than validity. In most of my work, insisting on 6 replicates per group would be an irresponsible waste of taxpayers' money.