Question

Proper Design Matrix for Paired Samples where endpoint has multiple samples that pair to the same start sample

0

Entering edit mode

laura • 0

@laura-10637

Last seen 8.0 years ago

Hello, I am pretty new to edgeR but have read through the documentation and I haven't found any examples that seem to match my scenario. I'm using edgeR to analyze genome wide genetic screen data. I had three replicates of my library, each infected independently, however, before the beginning of the screen one replicate became contaminated. At the start of the screen, I took one of my remaining two replicates and split it into two, maintaining each separately for the duration of the experiment. Thus, I now have two samples that represent my start time point. One of those (Replicate A) has a single paired end point (A.end). In Rep B, I now have two end samples (B.end.1 and B.end.2) that each pair with B.start.

In my experiments in which all 3 replicates make it through the screen, I have generally been using "Replicate"=c(1,2,3,1,2,3) and "Timepoint"=c("Start","Start","Start","End","End","End") to create my design matrix. Would the correct way to amend this to reflect my current situation be: "Replicate"=c(1,2,1,2,2), "Timepoint"=c("Start","Start","End","End","End")?

Thank you!

edger design matrix • 692 views

ADD COMMENT • link updated 8.0 years ago by Aaron Lun ★ 28k • written 8.0 years ago by laura • 0

score 0 · Answer 1 · 2016-05-04

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 4 hours ago

The city by the bay

In other words, you have two populations (A and B) for which you have 1 start sample and 1 (A) or 2 (B) end samples. I suspect that it would not be appropriate to treat the split end samples for B as biological replicates of each other. In your experiment, the true biological replication is that between populations; the two end samples for B provide a weaker level of replication because they don't capture the variance between populations - obviously, given that those two samples are derived from a single population. I would guess that this would lead to underestimation of the dispersion, as the low variability between the two end samples will pull down the estimate of the variability between populations (which is what you want to model).

The simplest way to avoid this, while still using the information in the two end samples, is to pool those two samples together into a single end sample for population B. Then use a standard paired-samples design as you've done before, but this time with only two populations. This pooling approach is most sensible when the variability between the two end samples is substantially smaller (ideally Poisson, like summing technical replicates) than the variability between the two populations; otherwise, it might be safer to just drop one of the end samples altogether.

ADD COMMENT • link 8.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks so much for your response! This makes total sense, and I completely agree that B.end.1 and B.end.2 are not true replicates of each other... I guess that's why I was hoping that designating them both with the same replicate number and the same timepoint factor might be the way to tell edgeR to deal with them in exactly the manner you are describing?

I'm hesitant to just completely pool them as there is some degree of relevant variation between them. Some groups routinely perform screens like this by infecting a single population of cells, splitting into three batches and then proceeding with the experiment. Of course, the independent infection method allows you to control for effects resulting from the retrovirus integrating in some more/less expressed site, or other weird effects relating to integration. But that aside, my two B.end samples should capture information about the stochastic fluctuations in a pooled population of cells carried over several population doublings, so ideally I'd like to not lose that information. Does this seem reasonable to try to find an in-between category, where it can somehow be viewed as an independent sample that's part of the same replicate?

ADD REPLY • link 8.0 years ago laura • 0

0

Entering edit mode

Well, consider the most extreme case, where the difference in expression between the start and end samples depends only on the population, i.e., any split samples from a single population will always yield the same end samples after screening. The model that you describe in your original post won't be able to handle that, as the split end samples won't be independent of each other. (It'd be like treating technical replicates as biological replicates.) Even in more realistic settings, there'll still be some dependencies which could distort error control, as the response to the screen is going to be determined in part by population-specific effects. Your model only accounts for population-specific differences in start-time expression, so population-level variability in the screen response will lead to unmodelled similarities between the split end points. I guess it might not be so bad in practice, but I don't know enough about this type of data to say that for sure.

edgeR doesn't support two levels of variability, because it only computes one dispersion estimate per gene. If you had a data set containing many of these split samples, duplicateCorrelation with voom might be applicable - the correlation would describe the dependencies between the split samples, while the variance estimate from limma would describe the variability between populations. However, this isn't feasible for your analysis as you don't have enough samples.

In any case, information about the stochastic fluctuations from infection is still present in your experiment after pooling/ignoring the extra end samples. The variability in the screen response between populations will include that from fluctuations, so it'll be properly modelled. Sure, it would be nice to also consider the information between the split end samples - in fact, this kind of scenario was what duplicateCorrelation was designed for - but I don't think it's practical in this case.

If you like, you can still fit an extra model using only the split end samples. This will give you some diagnostics about the proportion of the BCV that is attributable to stochastic fluctuations over time, given the infected population.

ADD REPLY • link 8.0 years ago Aaron Lun ★ 28k