Design to analyze experiment with multiple samples from the same biological entity
Entering edit mode
tmms • 0
Last seen 6 months ago


I am having trouble choosing a model matrix to analyze certain RNA-Seq data set. I will first describe the experiment and afterwards show between which designs I am doubting.

Design of the experiment
There are six distinct animals and each animal received the control (untrt) and the treatment (trt). The control and the treatment were separated by a fixed amount of time. This makes it a paired design, where each animal serves as it own control.
After the treatment, sperm was extracted from each animal. With that sperm sample, five embryos were generated. This results in 60 samples (6 animals times two treatments times 5 embryos = 60 samples).

This is a part of my metadata.

> head(setup, 10)
          animal treatment attribute
sample21 animal1     untrt      good
sample22 animal1     untrt      good
sample23 animal1     untrt      good
sample24 animal1     untrt      good
sample25 animal1     untrt      good
sample26 animal1       trt      good
sample27 animal1       trt      good
sample28 animal1       trt      good
sample29 animal1       trt      good
sample30 animal1       trt      good

This shows that the experiment is completely balanced.

> sapply(setup, table)

animal1 animal2 animal3 animal4 animal5 animal6 
     10      10      10      10      10      10 


  trt untrt 
   30    30 


 bad good 
  30   30 

The goal of the experiment is to discover if the gene expression of the embryos differs between the treatments. But I am confused what to do with the different embryos? I suppose the embryos can't be considered biological replicates, because one sperm sample was used to generate five embryos.

First idea
My first idea was to sum the counts for all embryos from a given animal and a given treatment (for example sum all samples from animal1 and untrt). But I am not sure if it OK to do this. Don't you loose information about the variability between the embryos when summing ?

Second idea
Keep the metadata and samples as they are and fit a model with animal as blocking factor (~ animal + treatment). In both cases, I would test the last coefficient of model.

Thanks in advance.

P.S. This is my first post on this forum, so I hope I followed all conventions.

rna-seq design linear model glm • 497 views
Entering edit mode
Last seen 2 days ago
United States

I would, for sure, go with the second design over the first.

There may several embryos seeded from the same father, but they are still not exactly the same since each sperm is different, after all.

If we took it one step further and imagine a situation where you were doing some expression analysis on identical twins (exact same genetic background), I'd still consider each twin a "biological replicate" as opposed to a technical one, no?

Another way you might consider approaching this analysis is to use voom and duplicateCorrelation using animal as the block parameter. In that scenario you would just use ~ treatment as your model or maybe ~ treatment + sexif you want to control for sex of the offspring, but I'll let the professional statisticians comment on the appropriateness (or not) of the fixed effects model vs the random effect one in this particular scenario (if we're lucky :-) ... I'd probably start with the fixed effects model first, though ...

Entering edit mode

From a statistical perspective, fitting a fixed effect for animal makes the assumption that differences between litters are relatively consistent increases or decreases in expression for some genes, within each litter, so you can adjust for those differences by estimating the mean per litter and adding that as a coefficient in the model to account for it.

Probably the best way to see if this is a reasonable assumption is to do an MDS plot and see if you get samples grouping by litter in one or more of the first few dimensions, which would indicate consistent expression levels within litter. Another alternative would be to fit the model with an 'animal' factor and see how many genes are significant. How many are 'enough' is up to you as an analyst to decide.

Fitting a random effect (the voom duplicateCorrelation pipeline that Steve mentions) makes the assumption that the gene expression within litter is correlated to some extent. In other words, if a gene in one littermate is high, you might expect that gene could be high in some of the other littermates. You can run voom and duplicateCorrelation (twice each, supplying the correlation value from the first run of duplicateCorrelation to voom the second time around), and see if the consensus correlation is 'high enough' to provide evidence that it's a thing (where 'high enough' is something you have to define as the analyst). So something like

v <- voom(dglst, design)
vcor <- duplicateCorrelation(v, design, block = animal)
v <- voom(dglst, design, block = animal, correlation = vcor$consensus.correlation)
vcor <- duplicateCorrelation(v, design, block = animal)

And deciding if the value of vcor$consensus.correlation is large enough to care about.

Entering edit mode

Thank you for your answer.

I made a PCA plot of the 60 samples. I did not obverse a strong clustering by litter. It looked like a random cloud of points.

Why do you run duplicateCorrelation twice?

If it is allowed, I have an off-topic question. Could you recommend one or more sources to learn about the (dis)avantages of certain designs? I can come up with different ideas, but I sometimes do not see the (dis)advantages.

Entering edit mode

When you run voom the second time, by including block and correlation you use a linear mixed model (rather than the fixed effects model used in the first iteration) to estimate the observation-level weights, which may affect those weights. When you run duplicateCorrelation, the weights are used as part of estimating the correlations, so you want to provide the updated weights to get a better estimate of the consensus correlation.

I can't come up with any sources offhand, other than just regular linear modeling textbooks. During my formal training, like 20 years ago now, they used Applied Linear Statistical Models, by Neter et al, which I always thought was pretty approachable (especially for someone like me, who came to this via a biology undergrad degree). No idea if that book is still around, but I imagine it is? I always though Julian Faraday's linear modeling book was pretty good as well, being R based and all. But most books like that approach the problem from 'if you have repeated measures you must use mixed models', rather than saying what the pros and cons might be.

Another thing you could do is just search this support site for Aaron Lun and duplicateCorrelation or so. He's like way smarter than I will ever be, and does a good job of outlining the pros and cons. But you have to find the relevant posts... I usually do something like <search strings="" go="" here=""> on Google, which seems to do a better job than the site search functionality. Although I believe it is supposed to have been improved.

Entering edit mode

Thanks for chiming in, James!

I'm always grateful when the stats folks chime in with mini lessons like this. I've learned a lot about the proper application of linear models to genomics analysis through posts just like this one through the years.

Entering edit mode

Thank you for the answer.

I felt that the terms biological and technical replicate were too restrictive in this case. They seem to be neither of them. Given your example, I would also call them biological replicates.

Thanks for the tip about duplicateCorrelation.


Login before adding your answer.

Traffic: 192 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6