i'm doing a DEGs analysis based on RNA-seq data. I've got 2 experimental thesis and 3 biological replicates each. Here is attached the relative MDS plot for count data and i was wandering if it is a good idea to do not take into account replicate "C1" for downstream analysis or, however, edgeR is able to compensate to this 'outlier'?
Thanks in advance,
I don't know what an experimental thesis is, I assume you mean two experimental conditions.
Here, the increased variation in the "C" group will increase the dispersion estimate. However, with only three replicates, it is not clear whether C1 is an outlier or if the samples are truly that variable. If it is the latter, removal of C1 would lead you to understate the amount of variability in your experiment, increasing the false positive rate.
I would suggest that you try to figure out what is causing the behaviour of C1. Is it due to some technical issue, e.g., low numbers of reads? Is it due to a few key genes, e.g., a sex effect in X/Y-chromosome genes? Once you have this information, you will have a better idea of whether C1 should be removed.
If you are still unclear, I would suggest using limma and voomWithQualityWeights, which will automatically account for any increased variability in C1. That said, there is no substitute for understanding your data.
as you suggested, i've performed an exactTest on C1vsC2 and C1vsC3 (using the BCV value from the **DEGs analysis in which i took into account all the data set, that's 3 replicates for both the "C" and the "T" condition).
After merging together DEGs from C1vsC2 and C1vsC3 analysis, the 7% of the overall gene set showed up as DEGs. Of these, the 16% matched with DEGs found in the above mentioned DEGs analysis**.
Afterwards, i've had a look to tagwise dispersion values in the **DEGs analysis. The 72% of genes having a tagwise dispersion value higher than the common dispersion one matched with DEGs identified in the exactTest between C1 and C2/C3. Moreover, all genes belonging to that 16% matching with DEGs found in the **DEGs analysis have got tagwise dispersion values higher than the common dispersion ones.
Thus, a relative low portion of DEGs have been found between C1 and C2/C3 (7%) and, of these, just the 16% matched with DEGs found in the **DEGs analysis. Despite this, it is quite clear that, by comparing tagwise dispersion values to the common dispersion ones, roughly 3/4 (72%) of the "bad" tagwise dispersion values (=higher than the common dispersion value) belongs to genes found to be DE according to the exactTest analysis. However, how could a good tagwise dispersion threshold be assumed? I guess it is hard to say.
Overall, the only one conclusion that i'm able to draw is that a good portion of the entire variability is gathered in DEGs found in the exactTest analysis. At present, i believe going on with the voomWithQualityWeights is the best choice. What do you think about it?
i've performed a gene ontology analysis on DEGs found in C1 replicate.
The biological functions of these DEGs are wired to the following roles: DNA modification (especially that involving in DNA repair), regulation of transcription/translation, cellular division, protein catabolism, signal transduction, membrane transportation, sexual reproduction, stress related processes (oxidative stress responses) and macroautophagy.
I guess something happened in C1, perhaps the organism was undergoing a stressing condition and reacted by prompting countermeasures in order to deal with it and, on the other hand, by potentiating primary metabolism. The DEGs related to sexual reproduction would not be surprising since stress conditions can stimulate it (we're talking about filamentous fungi).
Well. I've already verified that taking into account C1 replicate would lead to a reduce set of significant DEGs. Conversely, removing C1 would mean to give up to potentially interesting biological variation (but also increases the chances to get false positive hit).
This said, could you suggest me any further step?
Thanks in advance,