Hello, I am pretty new to edgeR but have read through the documentation and I haven't found any examples that seem to match my scenario. I'm using edgeR to analyze genome wide genetic screen data. I had three replicates of my library, each infected independently, however, before the beginning of the screen one replicate became contaminated. At the start of the screen, I took one of my remaining two replicates and split it into two, maintaining each separately for the duration of the experiment. Thus, I now have two samples that represent my start time point. One of those (Replicate A) has a single paired end point (A.end). In Rep B, I now have two end samples (B.end.1 and B.end.2) that each pair with B.start.
In my experiments in which all 3 replicates make it through the screen, I have generally been using "Replicate"=c(1,2,3,1,2,3) and "Timepoint"=c("Start","Start","Start","End","End","End") to create my design matrix. Would the correct way to amend this to reflect my current situation be: "Replicate"=c(1,2,1,2,2), "Timepoint"=c("Start","Start","End","End","End")?
Thank you!
Thanks so much for your response! This makes total sense, and I completely agree that B.end.1 and B.end.2 are not true replicates of each other... I guess that's why I was hoping that designating them both with the same replicate number and the same timepoint factor might be the way to tell edgeR to deal with them in exactly the manner you are describing?
I'm hesitant to just completely pool them as there is some degree of relevant variation between them. Some groups routinely perform screens like this by infecting a single population of cells, splitting into three batches and then proceeding with the experiment. Of course, the independent infection method allows you to control for effects resulting from the retrovirus integrating in some more/less expressed site, or other weird effects relating to integration. But that aside, my two B.end samples should capture information about the stochastic fluctuations in a pooled population of cells carried over several population doublings, so ideally I'd like to not lose that information. Does this seem reasonable to try to find an in-between category, where it can somehow be viewed as an independent sample that's part of the same replicate?
Well, consider the most extreme case, where the difference in expression between the start and end samples depends only on the population, i.e., any split samples from a single population will always yield the same end samples after screening. The model that you describe in your original post won't be able to handle that, as the split end samples won't be independent of each other. (It'd be like treating technical replicates as biological replicates.) Even in more realistic settings, there'll still be some dependencies which could distort error control, as the response to the screen is going to be determined in part by population-specific effects. Your model only accounts for population-specific differences in start-time expression, so population-level variability in the screen response will lead to unmodelled similarities between the split end points. I guess it might not be so bad in practice, but I don't know enough about this type of data to say that for sure.
edgeR doesn't support two levels of variability, because it only computes one dispersion estimate per gene. If you had a data set containing many of these split samples,
duplicateCorrelation
withvoom
might be applicable - the correlation would describe the dependencies between the split samples, while the variance estimate from limma would describe the variability between populations. However, this isn't feasible for your analysis as you don't have enough samples.In any case, information about the stochastic fluctuations from infection is still present in your experiment after pooling/ignoring the extra end samples. The variability in the screen response between populations will include that from fluctuations, so it'll be properly modelled. Sure, it would be nice to also consider the information between the split end samples - in fact, this kind of scenario was what
duplicateCorrelation
was designed for - but I don't think it's practical in this case.If you like, you can still fit an extra model using only the split end samples. This will give you some diagnostics about the proportion of the BCV that is attributable to stochastic fluctuations over time, given the infected population.