Question

Defining the design for a complicated multi-level experiment in Limma

0

Entering edit mode

Colari19 ▴ 60

@colari19-22204

Last seen 2.7 years ago

United Kingdom

Hello,

I've carried out a number of differential expression analyses using the Limma-Voom pipeline for a rather complicated data set. I'd just like to clarify that my approach detailed below is appropriate.

The design of the experiment is very much like those described in sections 9.6 and 9.7 of the Limma manual (time-course experiments and multi-level experiments).

Our data set comprises of RNA-Seq data from patients undertaking treatment with one of two drugs (drug A and drug B). For each subject in our data set, samples were collected at three time points from two different tissue types on the body. Samples from diseased tissue and healthy tissue were collected at week 0 (before commencing treatment), week 2 (two weeks after commencing treatment), and week 10 (ten weeks after commencing treatment).

In addition, each patient was genotyped for an allele of interest (we'll call it Allele.X), and designated as Allele.X-positive (1 or 2 copies of allele) or Allele.X-negative (0 copies of allele).

Therefore, a snippet of the data set looks something like this:

Subject      Sample     Drug    Time    Tissue      Allele.X
Subject.01  Sample.01   Drug.A  Week.0  Diseased    Positive
Subject.01  Sample.02   Drug.A  Week.2  Diseased    Positive
Subject.01  Sample.03   Drug.A  Week.10 Diseased    Positive
Subject.01  Sample.04   Drug.A  Week.0  Healthy     Positive
Subject.01  Sample.05   Drug.A  Week.2  Healthy     Positive
Subject.01  Sample.06   Drug.A  Week.10 Healthy     Positive
Subject.02  Sample.07   Drug.B  Week.0  Diseased    Negative
Subject.02  Sample.08   Drug.B  Week.2  Diseased    Negative
Subject.02  Sample.09   Drug.B  Week.10 Diseased    Negative
Subject.02  Sample.10   Drug.B  Week.0  Healthy     Negative
Subject.02  Sample.11   Drug.B  Week.2  Healthy     Negative
Subject.02  Sample.12   Drug.B  Week.10 Healthy     Negative

In total, 35 patients were treated with drug A and 31 were treated with drug B.

In the drug A group, 20 were Allele.X-negative and 15 were were Allele.X-positive. In the drug B group, 18 were Allele.X-negative and 13 were Allele.X-positive.

I would like to identify genes that change in expression over the course of treatment in both tissue types in both drug groups, irrespective of Allele.X status, i.e: week 2 and week 10 compared to week 0 in ALL of the drug A group and ALL of the drug B group.

I would also like to do the above comparisons, but also taking into account Allele.X status.

My approach do this has been to analyse each drug group separately, so when looking at drug A for example, this is done by subsetting the clinical data and expression data...

drug = "Drug.A"
clinical.data <- clinical.data[clinical.data$Drug == "Drug.A",]
exp.data <- exp.data[, clinical.data$Sample]

...and creating an interaction variable that combines tissue, time, and Allele.X status...

clinical.data$Tissue.Time.Allele.X = paste(clinical.data$Tissue, clinical.data$Time, clinical.data$Allele.X, sep = ".")

...and an interaction variable that combines subject and tissue...

clinical.data$Subject.Tissue = paste(clinical.data$Subject, clinical.data$Tissue, sep = ".")

The Tissue.Time.Allele.X variable is used to define the design...

limma_design = model.matrix(~ 0 + Tissue.Time.Allele.X, data = keep_samp)
voomed_counts = voom(edata, limma_design)

...and the Subject.Tissue variable is used to for sample blocking...

# Block correlation
my_cor = duplicateCorrelation(voomed_counts,  limma_design,  block = clinical.data$Subject.Tissue)
fit = lmFit(voomed_counts, limma_design, correlation = my_cor$cor, block = clinical.data$Subject.Tissue)

The makeContrasts function is used to extract the relevant contrasts...

# Extract contrasts of interest
my_contrasts <- makeContrasts(

Diseased_week2_vs_week0_All =
  ((15 * Diseased.Week.2.Positive + 20 * Diseased.Week.2.Negative) / 35) -
  ((15 * Diseased.Week.0.Positive + 20 * Diseased.Week.0.Negative) / 35),

Diseased_week10_vs_week0_All =
  ((15 * Diseased.Week.10.Positive + 20 * Diseased.Week.10.Negative) / 35) -
  ((15 * Diseased.Week.0.Positive + 20 * Diseased.Week.0.Negative) / 35),

Healthy_week2_vs_week0_All =
  ((15 * Diseased.Week.2.Positive + 20 * Diseased.Week.2.Negative) / 35) -
  ((15 * Diseased.Week.0.Positive + 20 * Diseased.Week.0.Negative) / 35),

Healthy_week10_vs_week0_All =
  ((15 * Healthy.Week.10.Positive + 20 * Healthy.Week.10.Negative) / 35) -
  ((15 * Healthy.Week.0.Positive + 20 * Healthy.Week.0.Negative) / 35),

Diseased_week2_vs_week0_Pos = Diseased.Week.2.Positive - Diseased.Week.0.Positive,
Diseased_week2_vs_week0_Neg = Diseased.Week.2.Negative - Diseased.Week.0.Negative,
Diseased_week10_vs_week0_Pos = Diseased.Week.10.Positive - Diseased.Week.10.Positive,
Diseased_week10_vs_week0_Neg = Diseased.Week.10.Negative - Diseased.Week.10.Negative,

Healthy_week2_vs_week0_Pos = Healthy.Week.2.Positive - Healthy.Week.0.Positive,
Healthy_week2_vs_week0_Neg = Healthy.Week.2.Negative - Healthy.Week.0.Negative,
Healthy_week10_vs_week0_Pos = Healthy.Week.10.Positive - Healthy.Week.10.Positive,
Healthy_week10_vs_week0_Neg = Healthy.Week.10.Negative - Healthy.Week.10.Negative

levels = limma_design
)

# Compute contrasts 
fit <- contrasts.fit(fit, my_contrasts)
fit <- eBayes(fit)

Hopefully it is clear that for the contrasts in all of the patients within the drug group, the average expression (weighted by sample number) of the Allele.X-positive and negative patients is being contrasted.

So my main question - is the above approach valid?

Thank you! And sorry for long post, but hopefully it is clear what I'm getting at.

limma RNASeq R voom • 943 views

ADD COMMENT • link updated 4.5 years ago by James W. MacDonald 65k • written 4.5 years ago by Colari19 ▴ 60

score 2 · Accepted Answer · 2019-10-24

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

The short answer is no. As an example, let's say you have a group of children, 20 of whom are boys and 15 of whom are girls. And you want to know if the average height of the children differs by sex. The conventional way to do that would be a t-test, where the numerator is the difference between the average height of the boys and the average height of the girls.

What you are doing is taking the average height of the boys and multiplying that by 20, then taking the average height of the girls and multiplying by 15 (thereby effectively giving a 33% bump in height of the boys as compared to the girls). Dividing that difference by any number whatsoever doesn't change the fact that you have arbitrarily increased the mean height of the boys. There are reasons to worry about unbalanced group sizes, but not really in this context, and what you are doing isn't fixing anything.

You are also doing something that is probably inadvisable with your combined coefficients. If you have complete observations, it is probably better/easier to just block on subject (see the edgeR user's guide, particularly section 3.5 for an example of how to set up that sort of design matrix). But regardless I don't think you really want to be estimating correlations that are at the subject/tissue level, rather than at the subject level. Aaron Lun or Gordon Smyth would know better than me, and may be along with more helpful advice later.

ADD COMMENT • link 4.5 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you for your answer.

My reasoning for averaging the expression for the contrasts in all of the patients comes from this post... https://support.bioconductor.org/p/69764/#69765 ...but I may have misinterpreted it.

And regarding blocking at the subject/tissue level - the two different tissue types from each subject are so different in expression that I though it would make sense to effectively treat them as different subjects.

ADD REPLY • link 4.5 years ago Colari19 ▴ 60

1

Entering edit mode

The post you reference shows that you can do an unweighted average. There is no reason to do a weighted average in this context. Put a different way, the post you link to is showing how to compute the contrast of one group versus the average of two other groups, not how to adjust for perceived accuracy of one group versus the other. The average is very efficient, and there's no reason to think that an average based on 20 subjects is materially more accurate than one based on 15.

The idea behind a mixed model is that the observations within a particular group are correlated and you want to account for that correlation structure in the model (statistical tests usually assume that all the observations are independent of each other). Even if the two tissues have very different expression levels, the correlation between all the tissues within a subject are likely to be high, and that's what you want to control for. And as I already mentioned, if you have complete observations for all subjects (you measure each tissue the same number of times for each subject), it's probably better to simply block on subject.

ADD REPLY • link 4.5 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you, I think I understand. I was thinking that because my subgroups are unequal in size (unlike in the post I linked to) the average should be weighted.

Sorry to keep dragging you back here, but I have one final query... do you think there is any benefit to contrasting the different time points in all the subjects, and in the alleleX-positive and negative patients separately, in the same model? I.e. rather than all this averaging expression business, I should just have one model that contrasts using a Tissue.Time interaction variable (a model that doesn't consider alleleX), and another model that contrasts using a Tissue.Time.AlleleX variable (a model which does consider alleleX positive and negative subjects separately) ?

ADD REPLY • link 4.5 years ago Colari19 ▴ 60

1

Entering edit mode

That's something you need to decide as the analyst. It may be that the allele affects very few, if any genes, in which case it's burning degrees of freedom to include it. Or maybe the allele is the main hypothesis?

You can always make the most specific coefficient and then average to get at less specific questions. In other words you could include the allele and then average time/treatment combinations that have different alleles to do the same thing as your first model.

ADD REPLY • link 4.5 years ago James W. MacDonald 65k