Question: edgeR - MDS Plot for Count Data
gravatar for baus87
4 weeks ago by
baus870 wrote:

Hello there, 
i'm doing a DEGs analysis based on RNA-seq data. I've got 2 experimental thesis and 3 biological replicates each. Here is attached the relative MDS plot for count data and i was wandering if it is a good idea to do not take into account replicate "C1" for downstream analysis or, however, edgeR is able to compensate to this 'outlier'?
Thanks in advance,

ADD COMMENTlink modified 23 days ago • written 4 weeks ago by baus870
gravatar for Aaron Lun
4 weeks ago by
Aaron Lun16k
Cambridge, United Kingdom
Aaron Lun16k wrote:

I don't know what an experimental thesis is, I assume you mean two experimental conditions.

Here, the increased variation in the "C" group will increase the dispersion estimate. However, with only three replicates, it is not clear whether C1 is an outlier or if the samples are truly that variable. If it is the latter, removal of C1 would lead you to understate the amount of variability in your experiment, increasing the false positive rate.

I would suggest that you try to figure out what is causing the behaviour of C1. Is it due to some technical issue, e.g., low numbers of reads? Is it due to a few key genes, e.g., a sex effect in X/Y-chromosome genes? Once you have this information, you will have a better idea of whether C1 should be removed.

If you are still unclear, I would suggest using limma and voomWithQualityWeights, which will automatically account for any increased variability in C1. That said, there is no substitute for understanding your data.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Aaron Lun16k

Dear Aaron,
i've just run voomWithQualityWeights. The "counts" argument was a DGEList object and normalization factors were previously calculated by calcNormFactors(DGEList object). This is the output:
An object of class "EList"

   group lib.size norm.factors
C1     C 25314293        1.016
C2     C 25663606        1.025
C3     C 24770990        1.028
T1     T 25727243        0.985
T2     T 22554292        0.965
T3     T 21180083        0.983

                                  C1   C2   C3   T1   T2   T3
TGAM01v2_00001 5.68 5.62 5.63 5.48 5.46 5.36
TGAM01v2_00002 6.76 6.64 6.82 6.80 6.87 6.79
TGAM01v2_00003 5.18 5.15 5.27 4.96 4.92 4.83
TGAM01v2_00004 4.16 4.25 4.21 4.42 4.40 4.32
TGAM01v2_00005 8.44 8.64 8.39 8.44 8.37 8.51
9011 more rows ...

         [,1]   [,2]    [,3]      [,4]  [,5]    [,6]
[1,]  71.0  80.1 139.1  182 142.3  90.6
[2,] 101.9 114.6 200.9  288 234.5 152.3
[3,]  58.5  66.1 114.4  142 109.8  69.8
[4,]  35.6  40.3  69.6  109  84.7  53.8
[5,] 112.2 125.5 223.1  311 267.2 178.2
9011 more rows ...

   (Intercept) groupT
C1           1      0
C2           1      0
C3           1      0
T1           1      1
T2           1      1
T3           1      1
[1] 0 1
[1] "contr.treatment"

    1        2         3         4       5       6 
0.594 0.665 1.179 1.645 1.402 0.932

How should i interpret sample.weights?

Best Regards

ADD REPLYlink written 4 weeks ago by baus870

See ?arrayWeights. Briefly, lower sample weights means that the observations in that sample are more variable. In your case, C1 has the lowest sample weight, consistent with its relative dissimilarity from the other samples. Knowing why, though, will require some more investigative work on your part.

ADD REPLYlink written 4 weeks ago by Aaron Lun16k

Dear Aaron,
once assumed C1 data are more dispersed than other libraries i investigated raw counts per library. In particular, i've calculated the pairwise correlation among C1, C2, C3 in three distinct groups, that is: (i) differential expressed genes (ii) non-differential expressed genes and (iii) all genes set. Following are summarized the results:

correlation just DEGs just no-DEGs all genes set
C1-C2 0,664 0,991 0,984
C1-C3 0,663 0,984 0,977
C2-C3 0,994 0,991 0,991

Now, i'm thinking the relative low correlation scores (highlighted in bold) mean that C1 raw counts in such genes (i.e. DEGs) have hijacked/impaired DEGs detection. What do you think about it? 

ADD REPLYlink written 4 weeks ago by baus870

The correlations don't tell you much. If you really want to troubleshoot, you'll have to look at the individual genes that are affected (i.e., those that have high variance within conditions, or are DE between C1 and C2/3).

If you still can't figure out the cause, then I would say that there is no reason to remove this sample. Just proceed with the DE analysis with voomWithQualityWeights, and apply caution to the interpretation of the results. This means that you should strongly encourage your collaborators to validate any interesting genes that they might find.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Aaron Lun16k
gravatar for baus87
4 weeks ago by
baus870 wrote:

as you suggested, i've performed an exactTest on C1vsC2 and C1vsC3 (using the BCV value from the **DEGs analysis in which i took into account all the data set, that's 3 replicates for both the "C" and the "T" condition).
After merging together DEGs from C1vsC2 and C1vsC3 analysis, the 7% of the overall gene set showed up as DEGs. Of these, the 16% matched with DEGs found in the above mentioned DEGs analysis**.
Afterwards, i've had a look to tagwise dispersion values in the **DEGs analysis. The 72% of genes having a tagwise dispersion value higher than the common dispersion one matched with DEGs identified in the exactTest between C1 and C2/C3. Moreover, all genes belonging to that 16% matching with DEGs found in the **DEGs analysis have got tagwise dispersion values higher than the common dispersion ones.
Thus, a relative low portion of DEGs have been found between C1 and C2/C3 (7%) and, of these, just the 16% matched with DEGs found in the **DEGs analysis. Despite this, it is quite clear that, by comparing tagwise dispersion values to the common dispersion ones, roughly 3/4 (72%) of the "bad" tagwise dispersion values (=higher than the common dispersion value) belongs to genes found to be DE according to the exactTest analysis. However, how could a good tagwise dispersion threshold be assumed? I guess it is hard to say.
Overall, the only one conclusion that i'm able to draw is that a good portion of the entire variability is gathered in DEGs found in the exactTest analysis. At present, i believe going on with the voomWithQualityWeights is the best choice. What do you think about it?

Best regards,

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by baus870

If I were you, I wouldn't care about the numbers. I would focus on the identity of the variable genes. Are they sex-related? Immunoglobulin chains? The numbers you've generated don't tell you anything, because we already know there is extra variability in the C1 sample. The more important question is how is this variability being generated (i.e., biologically or experimentally). To answer this, you need to actually examine the genes themselves, i.e., their names, their functions, etc. For example, if you find a whole bunch of stress-related proteins, you might say that the C1 cells were not handled carefully.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Aaron Lun16k
gravatar for baus87
23 days ago by
baus870 wrote:

Dear Aaron,
i've performed a gene ontology analysis on DEGs found in C1 replicate.
The biological functions of these DEGs are wired to the following roles: DNA modification (especially that involving in DNA repair), regulation of transcription/translation, cellular division, protein catabolism, signal transduction, membrane transportation, sexual reproduction, stress related processes (oxidative stress responses) and macroautophagy. 
I guess something happened in C1, perhaps the organism was undergoing a stressing condition and reacted by prompting countermeasures in order to deal with it and, on the other hand, by potentiating primary metabolism. The DEGs related to sexual reproduction would not be surprising since stress conditions can stimulate it (we're talking about filamentous fungi).
Well. I've already verified that taking into account C1 replicate would lead to a reduce set of significant DEGs. Conversely, removing C1 would mean to give up to potentially interesting biological variation (but also increases the chances to get false positive hit).
This said, could you suggest me any further step?
Thanks in advance,

ADD COMMENTlink written 23 days ago by baus870

I would go with voomWithQualityWeights followed by a limma analysis.

P.S. Respond to comments with "add reply" rather than adding an answer, unless you're answering your own question.

ADD REPLYlink modified 23 days ago • written 23 days ago by Aaron Lun16k

Dear Aaron,
how to deal with negative logCPM after performing voomWithQualityWeights in order to calculate dispersion?

ADD REPLYlink written 15 days ago by baus870

There is no problem with what you have described.

ADD REPLYlink written 15 days ago by Aaron Lun16k

This is what came up after performing voomWithQualityWeights and estimateCommonDisp:

> edgeRobject <- voomWithQualityWeights(edgeRobject, design=NULL, lib.size=NULL, normalize.method="none",plot=FALSE, span=0.5,, method="genebygene", maxiter=50, tol=1e-10, trace=FALSE, col=NULL)

> edgeRobject  <- estimateCommonDisp( edgeRobject  )
Error in .isAllZero(y) : negative counts not supported

ADD REPLYlink written 15 days ago by baus870

If you want to run estimateCommonDisp, you need to do so on the DGEList object, not the output of voomWithQualityWeights. However, there is no need to estimate the dispersion on the output of voomWithQualityWeights, use the standard limma approach instead.

ADD REPLYlink modified 15 days ago • written 15 days ago by Aaron Lun16k

it is necessary to do low counts filtering before voomWithQualityWeights?

ADD REPLYlink written 8 days ago by baus870

Yes, you need to filter out low-abundance genes. Otherwise the mean-variance trend looks funny.

ADD REPLYlink modified 8 days ago • written 8 days ago by Aaron Lun16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 265 users visited in the last hour