Question

Design Matrix with paired and independent samples plus RUVg Spike-in normalization

0

Entering edit mode

ChrisM • 0

@chrism-16926

Last seen 5.7 years ago

Dear All,

This is my first post on Bioconductor so if I am not clear please have some patience. I am working on an RNA-seq (human) dataset with 6 paired samples pre and post treatment and 2 independent controls. Unfortunately I don't have a bioinformatics background so I am trying to grasp the design matrix concept. I can do simple design differential expression analysis with EdgeR, but in this case I have a situation with paired samples, 2 independent samples and spike-in controls that I would like to use with RUVseq to remove variability. This means in my design matrix I have to account for the W_1 as a covariate.

How to design a matrix for this purpose? If it was only the paired samples it would be an additive model as described on the edger vignette and I would do something like:

design = model.matrix(~patient+W_1+Treat, data = pData(set1))

with W_1 from RUVseq with spike-in controls.

But in this case I also have 2 healthy controls that are completely independent from the patients pre and post treatment. So I would like to design a matrix so that I can compare differential expression within the patients pre and post treatment, between patients pre treatment and healthy controls and between patients post-treatment and healthy controls. So in the end my question is how to design a matrix and after fitting the model how to get the differential expression for each case.

If you also have any good examples to understand the design matrix and how to get the differential expression with glmLRT(fit) in edger that would be greatly appreciated.

Thanks for your kind help,

Chris

edger design matrix paired samples ruvseq • 1.3k views

ADD COMMENT • link updated 5.7 years ago by James W. MacDonald 65k • written 5.7 years ago by ChrisM • 0

score 1 · Answer 1 · 2018-08-15

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

I don't see how you can do that in one go using a fixed effect for patient. You can't fit a patient level blocking factor if you have some patients with just one observation, and the comparisons between the pre-treatment and controls and the post-treatment and controls don't involve any repeated measures so you can't use the patient level blocking factors in that comparison either.

The way I would do this would be to fit two models; one that only includes the patients for which you have multiple observations, with a patient blocking factor, and another with the pre/post/controls and no patient level blocking factor. In the latter you wouldn't want to make any pre vs post comparisons, because you aren't accounting for patient.

An alternative might be to use limma voom and then fit a glm, accounting for the within-patient dependence as a random effect (e.g., using duplicateCorrelation to assess within-patient correlation and passing that into lmFit). You could then make any comparison you like.

ADD COMMENT • link 5.7 years ago James W. MacDonald 65k

1

Entering edit mode

To follow up on James' comments:

If you're going to fit the second proposed model with edgeR, make sure you only use one sample from each patient. For example, for half of the patients, take only the "pre" sample, and for the other half, take the "post" sample. This avoids correlations between samples that come from the same patient (which would not be modelled properly here, as there is no patient blocking factor in the second model). Or alternatively, you set up two separate models, one with the two controls + all post samples for the control/post comparison; and another model with controls + pre samples, for the control/pre comparison.

You can see how this getting complicated, so I would recommend using duplicateCorrelation instead. This makes life a lot easier, as you can just use a single model for all of your comparisons between post/pre/controls. It also ensures that you use all of the information in the data for each of your contrasts, albeit at the cost of some anticonservativeness.

ADD REPLY • link 5.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for the reply! So using two separate models is still fine from the statistical point of view. I was just concerned about the difference between having all the samples in one model versus two different models , but I guess they answer different questions. The first model would address what difference is there in patients pre and post treatment and the second would address what differences pre and post patients have with normal healthy cell transcriptome.

ADD REPLY • link 5.7 years ago ChrisM • 0