Hi,
I have a vaccine timecourse RNA-seq dataset with four samples per donor, including a pre-vaccine timepoint, a post-prime timepoint, a pre-boost timepoint, and a post-boost timepoint. Some individuals received Vaccine A and other individuals received Vaccine B. I also have a batch effect due to library preparation (Batch 1 and Batch 2).
The model I have used for edgeR differential expression analysis is:
~ 0 + donor + batch + vaccine:timepoint
As I understand, this allows me to control for differences in baseline expression levels between donors and for the library preparation batch effect. I am then looking for differentially expressed genes between post-prime and pre-prime, and post-boost and pre-boost, for each vaccine separately.
I would like to plot boxplots and a heatmap of the differentially expressed genes and for this I require corrected counts. I have currently used ComBat as follows.
group <- paste(donor, library_batch, sep = "_") # make group variable containing donor and library preparation batch information
corr_counts <- ComBat(log_normalised_counts, batch = group)
However, I am not sure if this is correcting in the same way as edgeR e.g. for baseline differences between donors rather than the average differences between donors.
I am also getting the following warning message and I am not sure whether this can be safely ignored or if there is a way to find out which genes these are:
Found 369 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
Any help or suggestions would be most appreciated.
Best wishes,
Lucy
Dear Gordon Smyth,
Thank you very much for the response.
To generate my design matrix, I followed the approach by Aaron Lun here: EdgeR - Model matrix for complex model similar to user guide 3.5 example - but more complex, and section 3.5 of the edgeR user guide. However I have an extra library preparation batch (batch) variable that I have perhaps included incorrectly? Within each vaccine group, there are a mix of samples that were generated in library preparation batch 1 and library preparation batch 2.
As additional information, not all of the donors in my study have all timepoints.
I have read your response here (Suggestion for next revision of edgeR User Guide re Design matrix setup), but I am still unclear as to where I am going wrong as the columns in my design matrix are similar to the example (with the exception of the additional library preparation batch variable).
This gives me a design matrix with the following columns:
Apologies if I have misunderstood, but could you explain where I have gone wrong and how you would generate the model formula correctly?
Best wishes,
Lucy
In your question, you hadn't mentioned using grepl to remove the redundant columns of the design matrix. If you omit that step, then the design matrix generated by model.matrix is overparametrized and will give an error in edgeR.
I don't recommend this method of generating the design matrix because it seems to me to be indirect and too easy to make a mistake. Nevertheless, I think you have arrived at the correct design matrix.
Great, thank you for the response.
Could you clarify exactly what each coefficient would refer to in this model? My confusion comes from the fact that e.g. not every donor is present in each library batch (some donors have samples in both batches, while other donors have samples only in an individual batch). And each donor is only given one of the two vaccine types.
For example, does the batch2 coefficient indicate the logFC between batch 2 and batch 1, averaged across all donors, vaccines, and timepoints? (since we are assuming that the effect of batch is independent of the other variables in the model). Does this cause problems since the timepoints are not individually distributed across the batches?