Question

Limma-voom design for multifactorial RNA-seq experiment

0

Entering edit mode

oliver.selmoni • 0

@oliverselmoni-9213

Last seen 9.4 years ago

European Union

Hi All,

I am writing you because I need help in designing the design matrix for analyzing the results of my RNA-seq experiment.

I have sequenced 57 samples. Each sample is described by 6 factors ( Timepoint, Sex, Treatment, Family, sequencing Lane and Run of extraction).

I am struggling to find an effective way to write a matrix design for such a high number of factor. My first idea was to create a design matrix as follows:

f <- paste(data$TimePoint, data$Sex, data$Treatment, data$Sibgroup, data$Run, data$Lane, sep=".")
f <- as.factor(f)
design <- model.matrix(~f)
colnames(design) <- levels(f)

I wanted to create it this way because it would have made simple to make the contrast matrix (for ex. comparing male vs females would have required just to make the sum of the combination of factor including "male" and substract it to the sum for the combination of factors including the "female" term).

This doesn't work, because the observed combination of the 6 factors are 57, which means no sample share the exact same combination of factors and voom needs replication of at least one combination to run.

Can you suggest me a way to write the design and make the constrast matrix (for ex. to compare male vs females).

Thank you in advance,

let me know if you need more informations,

best regards

Oliver

limma voom multiple factor design limma voom • 4.0k views

ADD COMMENT • link 9.3 years ago oliver.selmoni • 0

score 1 · Answer 1 · 2015-12-09

First, check on a MDS plot and see if your samples are segregating according to your factors. I can't imagine the sequencing lane and extraction run are biologically interesting parameters, so if they're not causing major differences in the MDS plot, then you might as well ignore them.

Of the remaining factors, you can stitch them together in an additive model:

design <- model.matrix(~ Sex + Treatment + Sibgroup + ..., data)

Whether this is sensible or not depends on how the factors interact with each other. The pure additive model above assumes that every factor behaves independently; however, you can imagine, e.g., that the response to treatment might vary between sexes. In that case, you might prefer to do something like:

combined <- paste0(data$Sex, ".", data$Treatment)
design <- model.matrix(~ combined + Sibgroup + ..., data)

... in order to account for the interaction between the two factors. The same argument applies with any combination of factors, so you'll need to make some decisions about what to put in combined and what to leave as additive terms.

Also, if TimePoint is a continuous covariate, that will need to be handled differently. Just putting TimePoint as an additive factor assumes that it has a linear effect on gene (log-)expression for all samples. If you do something like:

design <- model.matrix(~ Sex + Sex:TimePoint + ..., data)

... you can get coefficients for sex-specific effects of time (i.e., the effect of time for each sex is different). You can also do this with splines to account for non-linear responses; check out the ns function and examples in the limma user's guide.

P.S. You should tag limma and voom separately, otherwise people who follow either tag won't see this question.

score 0 · Answer 2 · 2015-12-10

Hi Aaron, thanks for your answer.

Sequencing Lane and extraction Run seem to explain part of the variance on the MDS plot, this is why I want to include them in the model. Time point is a discrete factor with 3 possible values.

Ok so let's say I am writing the model as follows:

combined <- paste0(data$TimePoint, ".", data$Sex, ".", data$Treatment)
design <- model.matrix(~ combined + data$Sibgroup + data$Lane + data$Run)

The colnames of the design matrix are these:

> colnames(design)
 [1] "tp1.Female.control" "tp1.Female.EE2"     "tp1.Male.control"   "tp1.Male.EE2"       "tp2.Female.control" "tp2.Female.EE2"    
 [7] "tp2.Male.control"   "tp2.Male.EE2"       "tp3.Female.control" "tp3.Female.EE2"     "tp3.Male.control"   "tp3.Male.EE2"      
[13] "data$Sibgroup94"    "data$Sibgroup95"    "data$Sibgroup96"    "data$Sibgroup97"    "data$Runr2"         "data$Runr3"        
[19] "data$LaneB"         "data$LaneC"         "data$LaneD"         "data$LaneE"         "data$LaneF"         "data$LaneG"        
[25] "data$LaneH"         "data$LaneI"         "data$LaneL"

Now, the combined factors have 12 possible combinations, the sibgroup has 5 levels, the Lane 10 and the Run 3. As you can see the design matrix doesn't explicitly show the first levels for the 3 "single" factors (sibgroup, lane and run). My question now is how should write the contrast for, for example: tp1.Female.control vs tp1. Male.control?

Thank you in advance,

Oliver

score 0 · Answer 3 · 2015-12-10

0

Entering edit mode

oliver.selmoni • 0

@oliverselmoni-9213

Last seen 9.4 years ago

European Union

Ok thanks for your answer. Now let's say that I want to include also the sibgroup inside the combined group. I write the design as:

combined <- paste0(data$TimePoint, ".", data$Sex, ".", data$Treatment, ".", data$Sibgroup)
design <- model.matrix(~ combined + data$Lane + data$Run)

Which returns me this error while I am running the voom command:

Coefficients not estimable: Runr2 Runr3 LaneC LaneD LaneE LaneF LaneH LaneI
Error in approxfun(l, rule = 2) :
  need at least two non-NA values to interpolate
Inoltre: Warning message:
Partial NA coefficients for 35348 probe(s)
Called from: top level

I don't see why for these factors (for which there is replication on the values that they can assume) it is impossible to calculate the coefficients.

Another thing: let's say I don't want to check any interaction and I write a model as:

design <- model.matrix(~0+TimePoint+Sex+Treatment+Run+Lane, data)

The design matrix will look like:

> colnames(design)
 [1] "TimePointtp1" "TimePointtp2" "TimePointtp3" "SexMale"      "TreatmentEE2" "Runr2"        "Runr3"        "LaneB"       
 [9] "LaneC"        "LaneD"        "LaneE"        "LaneF"        "LaneG"        "LaneH"        "LaneI"        "LaneL"

In this case, how should I write the contrast to compare male vs female?

ADD COMMENT • link 9.3 years ago oliver.selmoni • 0

0

Entering edit mode

Don't reply to comments (or answers) with another answer; the ordering of posts changes according to the number of votes for each post, so if someone upvotes your reply, it will no longer be positioned after my comment. And then the entire conversation won't make any sense to an outside reader.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

For your first question; some lanes or runs are nested within the combinations of combined. For example, if all treated males were sequenced on lane C, then you wouldn't be able to distinguish between the confounding effect of lane C and the biological effect of treatment in males. This manifests as an inability to estimate the coefficients; for any given expression of treated males, it could be driven by a small lane C effect + a large male treatment effect, or a large lane C effect + a small male treatment effect.

In this case, you won't be able to block explicitly on Lane in the design matrix. Instead, you'll have to use duplicateCorrelation (e.g., by pasting the Run and Lane terms together and using that as the block argument). This will allow you to compare between runs/lanes, while accounting for the correlations between other samples in the same run/lane. Also, in the future, you should arrange your sequencing so that you don't end up with different conditions in different runs or lanes.

As for your second question; to test for the sex effect, you can just do coef=4 or set contrast to:

con <- makeContrasts(SexMale, levels=design)

This will test if the SexMale coefficient is equal to zero, and will give you DE genes between male and female samples.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for your response. I am trying to use the duplicateCorrelation but it returns me only NAs. Here is how I did it:

First I wrote the design matrix as follows and ran voom.

f <- paste(data$TimePoint,data$Sex,data$Treatment,data$Sibgroup, sep=".") 
f <- as.factor(f) 
design <- model.matrix(~0+f, data) 
colnames(design) <- levels(f)
v <- voom(dge, design)

Then I created the technical block by pasting Run and Lane

techblock <- paste0(data$Run,".",data$Lane)

And finally I launch the duplicateCorrelation using v, design and techblock as arguments.

corfit <- duplicateCorrelation(v, design, block=techblock)

This results in a corfit object which contains only NAs. Do you have any idea what could be the problem?

Thanks again.

ADD REPLY • link 9.3 years ago oliver.selmoni • 0

0

Entering edit mode

It's hard to say. Have a look at the relationship between techblock and f. If they share a lot of blocks (e.g., samples with one level of f have the same level of techblock, and vice versa), then that might be causing problems.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Well if I make a table of the levels of F vs the levels of techblock I obtain a 54vs26 matrix, which means that we have ~ 2 levels of F per each level of techblock. Are there any other alternatives to the duplicateCorrelation?

ADD REPLY • link 9.3 years ago oliver.selmoni • 0

0

Entering edit mode

Are the levels of f unique to each level of techblock? You have 54 levels of f for 57 samples, so I suspect that there will be nesting of f within techblock. This is likely to cause problems as there will not be enough information left to estimate the correlations, if nearly every sample has its own f term in the design matrix. Your best option is to take out some factors in combined, and block on them (either in the design matrix, or as part of techblock). SibGroup would probably be a good candidate, it doesn't sound biologically interesting.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Ok. Now everything is working. Just one last question. I have written the design with TimePoint, Sex and Treatment in combination and the Sibgroup as individual factor. My design matrix looks like this:

> colnames(design)
 [1] "ftp1.Female.control" "ftp1.Female.EE2"     "ftp1.Male.control"   "ftp1.Male.EE2"       "ftp2.Female.control" "ftp2.Female.EE2"    
 [7] "ftp2.Male.control"   "ftp2.Male.EE2"       "ftp3.Female.control" "ftp3.Female.EE2"     "ftp3.Male.control"   "ftp3.Male.EE2"      
[13] "Sibgroup94"          "Sibgroup95"          "Sibgroup96"          "Sibgroup97"

The combined factors are all explicited in the design matrix, so contrast are easy to write for them. I don't understand how to contrast between the Sibgroup levels (FYI 5 levels: 93,94,95,96,97). If I understood it well,

makeContrasts(Sibgroup94, levels=design)

should make the contrast between the sibgroup 93 and 94. If I want to contrast 94 and 95:

makeContrasts(Sibgroup94-Sibgroup95, levels=design)

Is this correct? What if I want to contrast for ex.

-- Sibgroup93 vs all the remaining sibgroups

-- Sibgroup94 vs all the remaining sibgroups

-- Under a certain condition (for ex. EE2 treated): Sibgroup93 vs all the remaining sibgroups

Thank you again.

ADD REPLY • link 9.3 years ago oliver.selmoni • 0

0

Entering edit mode

I don't really understand why you would contrast the SibGroup levels; I mean, you'll get DE between specific families, but that doesn't seem to be particularly interesting, because you can't generalize it to different families. Nonetheless, your approach is basically correct. If you want to test for one SibGroup against the average of all other levels, you could do something like:

con <- makeContrasts(Sibgroup94-(Sibgroup95+Sibgroup96+Sibgroup97)/4,
       levels=design) # 94 vs the rest
con <- makeContrasts((Sibgroup94+Sibgroup95+Sibgroup96+Sibgroup97)/4,
       levels=design) # 93 vs the rest

Note the division by 4 in the first contrast, even though only three groups are added; this is because 93 is implicitly present in the average, but doesn't have its own coefficient (because the intercepts all cancel out). You won't be able to test for condition-specific differences between Sibgroup levels, because you've made Sibgroup an additive term that doesn't interact with the other conditions. That can't be helped, because otherwise you wouldn't have enough residual d.f. or information for blocking in duplicateCorrelation.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 28k