Question: Building a biologically meaningful design matrix for limma/voom in a large RNASeq experiment
1
10 months ago by
S.Bamopoulos10
Germany
S.Bamopoulos10 wrote:

Hello,

I have a question regarding differential expression (DE) with limma/voom in an RNA-Seq experiment of ~250 samples. I already posted this question on Biostars and it was suggested that I also post here.

I have 4 different point mutations that are mutually exclusive and I wish to identify DE isoforms specific for each mutation. Additionally, I have reason to believe that the mutations cause similar differential expression for some isoforms and I also wish to identify those.

My questions are:

• What is the best approach to identify the differences and the similarities of those mutations?
• How do I adress the issue of confounding?

Specifically, I have 3, 4, 6 and 17 samples for each mutation respectively, as well as ~200 samples as a control group.

The approach I have come up with so far is to:

1) perform a DE analysis for each mutation comparing it to the control group (which does not include the other mutations), like this:

design.matrix <- model.matrix(~ factor(mut1), data)

design.matrix <- model.matrix(~ factor(mut2), data)... and so on

2) after doing this for each of the 4 mutations, combine them into a single binary variable and perform a DE analysis comparing all of them against the control group:

design.matrix <- model.matrix(~ factor(all.mutations), data

My way of thinking is that since the mutations are mutually exclusive, I can compare each mutation against the control group (which does not include the other mutations) in order to identify DE isoforms specific to each mutation. Afterwards by combing them, I hope to highlight isoforms that are similarly DE for all mutations. If I check the overlap of the last analysis with the preceding 4 I should be able to identify at least some isoforms that are affected in a common way. Is it maybe sufficient to just compare the overlap of the first 4 analyses without the 2nd step?

As to the 2nd question. Is there a rule of thumb as to how many confounders I can add in a limma analysis while avoiding overfitting? Since in a differential expression analysis we don't really have "events" I am unsure how to determine the number of confounders I can adjust for. Especially for the mutation with the smallest subset (only 3 mutations) I am unsure if the relatively large control group of ~200 samples permitts me to adjust for multiple confounders.

Any input is welcome, thanks in advance!

Stefan

modified 10 months ago by Aaron Lun23k • written 10 months ago by S.Bamopoulos10
Answer: Building a biologically meaningful design matrix for limma/voom in a large RNASe
1
10 months ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

For your first question: your suggested approach is inefficient from both a statistical and computational perspective. Statistically inefficient as it does not use information across the entire dataset to model variability; computationally inefficient because you're repeating the entire analysis for each comparison. Rather, you should do something like:

all.mutations <- rep(c("A", "B", "C", "D", "con"), c(3, 4, 6, 17, 200))
all.mutations <- relevel(factor(all.mutations), "con") # set control as baseline
design <- model.matrix(~all.mutations)

... and fit one model for the entire dataset. Each non-intercept coefficient here will represent the log-fold change of a mutation against control, so it's easy to determine the effect of each mutation relative to the control. You can also quantify the differences between mutations by directly testing for DE between the corresponding groups.

Identification of genes that respond "similarly" across mutations is harder, depending on how you define "similar". If you only care about the fact that they are DE (in the same direction), then you can just intersect the DE lists from the relevant mutation-control comparisons. If you want also same magnitude of log-fold change across mutations... this is difficult as you're asking for genes where the null hypothesis (of no difference between mutations) is true. Standard hypothesis tests aren't built to answer that question. The best you can do is to check that the genes are not significant in the mutation-mutation comparisons.

For your second question: I assume you're referring to blocking factors rather than confounding factors, as your model becomes overfitted by definition once you have one factor that is confounded with the mutation status. As long as you have enough residual d.f. for variance estimation, limma will still work in terms of controlling type I error. Of course, unnecessary increases to the complexity of the model will also reduce power as the estimates of the model parameters become more uncertain.

I would not worry about the number of additional factors, but consider their impact on the downstream analysis. This involves some thinking about the nature of the biological system, and how certain factors might affect the differential expression results. For example, I always block on processing batch in my DE analyses, as I know that batch effects exist in my datasets. Yes, this means you'll have to think about the experiment (the horror!), which is always difficult - but blindly using automated model selection is arguably worse, see some comments here. Unfortunately, the only way to make analysis easy is to have a good experimental design in the first place.

Hello Aaron,

thank you for your answer, it is very detailed and insightful. I would greatly appreciate it, if you could clarify at least some of the points below:

1) I certainly understand why your suggestion of modelling the mutations as a factor is computationally efficient, but I am unsure as to why it is statistically efficient. From my understanding if I compare e.g. mutation A to the control I do not use information from samples harboring other mutations (mutation B, mutation C etc.) whether I include them in the model or not. Are you suggesting that limma-voom models variability using the entire dataset independent of the fact that not all samples are included in each contrast of interest?

2) Regarding confounding: The samples I have, stem from a study cohort of patients with leukaemia. I have run an exploratory correlation analysis with the purpose of identifying all mutations and other parameters associating with the mutations of interest. I was thinking of including all significant associations as co-factors (I believe these are then blocking factors?) in the linear models, at least those which I expect to significantly affect gene expression. I was thinking of running a DE analysis for each mutation separately, since not all mutations have the same confounders and I wanted to keep the complexity of each model at a minimum, in order to maximize the statistical power of the model. Does this make sense, or should I heed your advice of putting all mutations in a single model?

3) For now I am only intersted in genes that are DE in the same direction (In fact I am not only looking at genes but also isoforms, exons etc). My main concern is the following: 2 of the mutations I am looking at are point-mutations in the same hotspot, but cause a different change in the amino acid sequence of the protein. I have justified reasons to believe that these mutations cause a dysfunction of the protein that is similar for both mutations, but also cause an individual change of function that I believe is specific to each mutation. What complicates the analysis even further is that all of the 4 mutations I am looking at are located in genes involved in the same biological process (splicing) and since they lead to the same disease one can argue that they may have similar effects on the expression of some genes/isoforms. Since the groups are relatively small for each individual mutation (e.g. only 4 patients have the exact same mutation), but in total 25 patients have a mutation in the same gene, I wish to make use of that fact. That is why I suggested the second step: Putting all mutations (at least the mutations in the same gene) in the same variable and comparing them against the control group, so that I can exploit the large sample size and detect differential expression that is shared across the mutations.

I am sorry for giving you the tedious task of reading through my explanations, but I am finding it difficult to translate the biological question at hand into a design matrix, that will both make use of the large sample size and answer my question as specifically as possible.

I am extremely grateful for any response

Stefan

1) Yes. limma estimates the gene-specific variance using information from all samples in all groups, even if not all groups participate in the pairwise comparisons. This is necessary in 'omics studies with limited replication, and outweighs any supposed disadvantage of differences in variances between groups; see Gordon's comment at Correct assumptions of using limma moderated t-test.

2) There are a number of issues here. The first is how you're identifying your factors of interest. You are selecting factors that are significantly associated with your mutations - these are by definition (partially) confounding factors, which will reduce your power to detect the mutation effect! How can you be sure that the variation in these factors is not, in fact, caused by the mutations themselves? If you are sure of this, then it stands to reason that you should have had enough prior information to select said factor regardless of whether it was significantly associated or not.

The second issue is your motivation for running the DE analyses separately. I would find it hard to believe that the residual d.f. in the full model (which determines the stability of the variance estimate, and thus power, all else being equal) is less than the residual d.f. in any of your mutation-specific models.

3) No, that doesn't sound like a good idea. Instead of improving power, you may actually inflate the variance because differences between mutations are not properly modelled. There are also problems with correlations between residuals of observations from the same group, which could result in anticonservativeness and more false positives. And even if all these problems are disregarded; what happens if you have a positive log-fold change in your analysis? You might interpret this as a consistent effect across mutations, but for all you know, it could happen if three mutations go up and one mutation goes down. Much better to model the groups separately, and perform the following comparison:

makeContrasts((A + B + C + D)/4 - con, levels=design)

This combines information across all groups to perform the DE testing. In addition, you report the log-fold change for each mutation against the control, allowing you to verify that each mutation is behaving in a consistent manner.

Thanks a lot for the quick reply, it has improved my understanding of limma quite a bit. Just a quick follow-up:

1) I have some samples that don't have mutations in gene A, B or C but have other mutations with unknown effects and I don't want to use them as controls. What is the best way to include them in the factor variable for variance estimation? Put them in the factor variable as a separate group e.g. "others" or give NA values to the factor for those samples? (Probably option A?)

2) Regarding confounders: I am pretty sure that the variation in the confounders is not caused by the mutations of interest since they are mostly other mutations (in a binary form) and I can safely assume that the mutations I am looking at do not induce the confounding mutations. Same goes for the other factors I have, which mainly include age and sex. Of course you are right: Ideally, I would include in the linear model all mutations available, but as there are over 50 this will just leave me with an empty data frame to look at. I was hoping that the mutations, that do not correlate (positively or negatively) with my mutations of interest will kind of "even out" across the groups thus not making it necessary to adjust for them in the linear model. Still it is worth a thought if it makes sense to at least model for the most common ones. I am aware of the fact that by adding partially confounding factors I decrease the power to detect the mutation effect, but I want to treat the remaining results as a "purified list", which can be used for validation purposes in the lab or further downstream analysis.

3) I was not aware that this was possible, what you are suggesting certainly seems like the way to go. My initial thinking was to cluster the mutations based on the results and see if they can be clustered effectively, but of course quantifying the logFC is more informative. Still this way if e.g. mutation C has the opposite effect compared to the other 3 mutations, will it not decrease the power of detecting an effect for the other 3?

Thank you again, your input is very much appreciated!

Stefan

1) If you don't know what the samples are, then don't use them. You can't be sure that the unknown mutations are all in the same group, so you can't treat them all as a separate group - otherwise the variance will be inflated if the (unknown) true groups are different. And if the factor values are NA, model.matrix will quietly discard these samples for you anyway.

2) Even if an alternative mutation is uncorrelated with your mutation of interest (e.g., orthogonal), it may still be necessary to block on it if it contributes considerably to the variance. Whether there are significant correlations (or not) between the alternative mutations and your mutations of interest is largely irrelevant to the decision regarding whether to include the former in the model.

If you have too many additional mutations, you could consider doing principal components regression, i.e., PCA on the mutation profiles and take the first 5-10 PCs as blocking factors. This provides a way to capture the major trends in the profile of additional mutations when there is no way to determine which ones are important a priori.

3) Yes, it will decrease the power of detecting an overall effect. That's a good thing - why would you want to have more power to detect inconsistent effects between mutations?

Thank you for taking the time to answer all my questions. I will take your suggestions into account and think about the best way to approach this issue

Cheers

Stefan