Hello there,
i'm doing a DEGs analysis based on RNA-seq data. I've got 2 experimental thesis and 3 biological replicates each. Here is attached the relative MDS plot for count data and i was wandering if it is a good idea to do not take into account replicate "C1" for downstream analysis or, however, edgeR is able to compensate to this 'outlier'?
Thanks in advance,
Antonio
I don't know what an experimental thesis is, I assume you mean two experimental conditions.
Here, the increased variation in the "C" group will increase the dispersion estimate. However, with only three replicates, it is not clear whether C1 is an outlier or if the samples are truly that variable. If it is the latter, removal of C1 would lead you to understate the amount of variability in your experiment, increasing the false positive rate.
I would suggest that you try to figure out what is causing the behaviour of C1. Is it due to some technical issue, e.g., low numbers of reads? Is it due to a few key genes, e.g., a sex effect in X/Y-chromosome genes? Once you have this information, you will have a better idea of whether C1 should be removed.
If you are still unclear, I would suggest using limma and voomWithQualityWeights, which will automatically account for any increased variability in C1. That said, there is no substitute for understanding your data.
Dear Aaron,
i've just run voomWithQualityWeights. The "counts" argument was a DGEList object and normalization factors were previously calculated by calcNormFactors(DGEList object). This is the output:
<<
An object of class "EList"
$targets
group lib.size norm.factors
C1 C 25314293 1.016
C2 C 25663606 1.025
C3 C 24770990 1.028
T1 T 25727243 0.985
T2 T 22554292 0.965
T3 T 21180083 0.983
See ?arrayWeights. Briefly, lower sample weights means that the observations in that sample are more variable. In your case, C1 has the lowest sample weight, consistent with its relative dissimilarity from the other samples. Knowing why, though, will require some more investigative work on your part.
Dear Aaron,
once assumed C1 data are more dispersed than other libraries i investigated raw counts per library. In particular, i've calculated the pairwise correlation among C1, C2, C3 in three distinct groups, that is: (i) differential expressed genes (ii) non-differential expressed genes and (iii) all genes set. Following are summarized the results:
correlation
just DEGs
just no-DEGs
all genes set
C1-C2
0,664
0,991
0,984
C1-C3
0,663
0,984
0,977
C2-C3
0,994
0,991
0,991
Now, i'm thinking the relative low correlation scores (highlighted in bold) mean that C1 raw counts in such genes (i.e. DEGs) have hijacked/impaired DEGs detection. What do you think about it?
The correlations don't tell you much. If you really want to troubleshoot, you'll have to look at the individual genes that are affected (i.e., those that have high variance within conditions, or are DE between C1 and C2/3).
If you still can't figure out the cause, then I would say that there is no reason to remove this sample. Just proceed with the DE analysis with voomWithQualityWeights, and apply caution to the interpretation of the results. This means that you should strongly encourage your collaborators to validate any interesting genes that they might find.
Aaron,
as you suggested, i've performed an exactTest on C1vsC2 and C1vsC3 (using the BCV value from the **DEGs analysis in which i took into account all the data set, that's 3 replicates for both the "C" and the "T" condition).
After merging together DEGs from C1vsC2 and C1vsC3 analysis, the 7% of the overall gene set showed up as DEGs. Of these, the 16% matched with DEGs found in the above mentioned DEGs analysis**.
Afterwards, i've had a look to tagwise dispersion values in the **DEGs analysis. The 72% of genes having a tagwise dispersion value higher than the common dispersion one matched with DEGs identified in the exactTest between C1 and C2/C3. Moreover, all genes belonging to that 16% matching with DEGs found in the **DEGs analysis have got tagwise dispersion values higher than the common dispersion ones.
Thus, a relative low portion of DEGs have been found between C1 and C2/C3 (7%) and, of these, just the 16% matched with DEGs found in the **DEGs analysis. Despite this, it is quite clear that, by comparing tagwise dispersion values to the common dispersion ones, roughly 3/4 (72%) of the "bad" tagwise dispersion values (=higher than the common dispersion value) belongs to genes found to be DE according to the exactTest analysis. However, how could a good tagwise dispersion threshold be assumed? I guess it is hard to say.
Overall, the only one conclusion that i'm able to draw is that a good portion of the entire variability is gathered in DEGs found in the exactTest analysis. At present, i believe going on with the voomWithQualityWeights is the best choice. What do you think about it?
If I were you, I wouldn't care about the numbers. I would focus on the identity of the variable genes. Are they sex-related? Immunoglobulin chains? The numbers you've generated don't tell you anything, because we already know there is extra variability in the C1 sample. The more important question is how is this variability being generated (i.e., biologically or experimentally). To answer this, you need to actually examine the genes themselves, i.e., their names, their functions, etc. For example, if you find a whole bunch of stress-related proteins, you might say that the C1 cells were not handled carefully.
Dear Aaron,
i've performed a gene ontology analysis on DEGs found in C1 replicate.
The biological functions of these DEGs are wired to the following roles: DNA modification (especially that involving in DNA repair), regulation of transcription/translation, cellular division, protein catabolism, signal transduction, membrane transportation, sexual reproduction, stress related processes (oxidative stress responses) and macroautophagy.
I guess something happened in C1, perhaps the organism was undergoing a stressing condition and reacted by prompting countermeasures in order to deal with it and, on the other hand, by potentiating primary metabolism. The DEGs related to sexual reproduction would not be surprising since stress conditions can stimulate it (we're talking about filamentous fungi).
Well. I've already verified that taking into account C1 replicate would lead to a reduce set of significant DEGs. Conversely, removing C1 would mean to give up to potentially interesting biological variation (but also increases the chances to get false positive hit).
This said, could you suggest me any further step?
Thanks in advance,
Antonio
If you want to run estimateCommonDisp, you need to do so on the DGEList object, not the output of voomWithQualityWeights. However, there is no need to estimate the dispersion on the output of voomWithQualityWeights, use the standard limma approach instead.
Dear Aaron,
i've just run voomWithQualityWeights. The "counts" argument was a
DGEList
object and normalization factors were previously calculated by calcNormFactors(DGEList
object). This is the output:<<
An object of class "EList"
$targets
group lib.size norm.factors
C1 C 25314293 1.016
C2 C 25663606 1.025
C3 C 24770990 1.028
T1 T 25727243 0.985
T2 T 22554292 0.965
T3 T 21180083 0.983
$E
C1 C2 C3 T1 T2 T3
TGAM01v2_00001 5.68 5.62 5.63 5.48 5.46 5.36
TGAM01v2_00002 6.76 6.64 6.82 6.80 6.87 6.79
TGAM01v2_00003 5.18 5.15 5.27 4.96 4.92 4.83
TGAM01v2_00004 4.16 4.25 4.21 4.42 4.40 4.32
TGAM01v2_00005 8.44 8.64 8.39 8.44 8.37 8.51
9011 more rows ...
$weights
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 71.0 80.1 139.1 182 142.3 90.6
[2,] 101.9 114.6 200.9 288 234.5 152.3
[3,] 58.5 66.1 114.4 142 109.8 69.8
[4,] 35.6 40.3 69.6 109 84.7 53.8
[5,] 112.2 125.5 223.1 311 267.2 178.2
9011 more rows ...
$design
(Intercept) groupT
C1 1 0
C2 1 0
C3 1 0
T1 1 1
T2 1 1
T3 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
$sample.weights
1 2 3 4 5 6
0.594 0.665 1.179 1.645 1.402 0.932
>>
How should i interpret sample.weights?
Best Regards
/Antonio
See
?arrayWeights
. Briefly, lower sample weights means that the observations in that sample are more variable. In your case, C1 has the lowest sample weight, consistent with its relative dissimilarity from the other samples. Knowing why, though, will require some more investigative work on your part.Dear Aaron,
once assumed C1 data are more dispersed than other libraries i investigated raw counts per library. In particular, i've calculated the pairwise correlation among C1, C2, C3 in three distinct groups, that is: (i) differential expressed genes (ii) non-differential expressed genes and (iii) all genes set. Following are summarized the results:
Now, i'm thinking the relative low correlation scores (highlighted in bold) mean that C1 raw counts in such genes (i.e. DEGs) have hijacked/impaired DEGs detection. What do you think about it?
The correlations don't tell you much. If you really want to troubleshoot, you'll have to look at the individual genes that are affected (i.e., those that have high variance within conditions, or are DE between C1 and C2/3).
If you still can't figure out the cause, then I would say that there is no reason to remove this sample. Just proceed with the DE analysis with
voomWithQualityWeights
, and apply caution to the interpretation of the results. This means that you should strongly encourage your collaborators to validate any interesting genes that they might find.