Adjust RNA-seq counts (more than one batch variable) for plotting differentially expressed genes
Entering edit mode
Lucy ▴ 60
Last seen 3 days ago
United Kingdom


I have a vaccine timecourse RNA-seq dataset with four samples per donor, including a pre-vaccine timepoint, a post-prime timepoint, a pre-boost timepoint, and a post-boost timepoint. Some individuals received Vaccine A and other individuals received Vaccine B. I also have a batch effect due to library preparation (Batch 1 and Batch 2).

The model I have used for edgeR differential expression analysis is:
~ 0 + donor + batch + vaccine:timepoint

As I understand, this allows me to control for differences in baseline expression levels between donors and for the library preparation batch effect. I am then looking for differentially expressed genes between post-prime and pre-prime, and post-boost and pre-boost, for each vaccine separately.

I would like to plot boxplots and a heatmap of the differentially expressed genes and for this I require corrected counts. I have currently used ComBat as follows.

group <- paste(donor, library_batch, sep = "_")   # make group variable containing donor and library preparation batch information
corr_counts <- ComBat(log_normalised_counts, batch = group)

However, I am not sure if this is correcting in the same way as edgeR e.g. for baseline differences between donors rather than the average differences between donors.

I am also getting the following warning message and I am not sure whether this can be safely ignored or if there is a way to find out which genes these are:

Found 369 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.

Any help or suggestions would be most appreciated.

Best wishes,

ComBat sva edgeR • 387 views
Entering edit mode
Last seen 29 minutes ago
WEHI, Melbourne, Australia

If you want to remove the batch and donor effects for plotting purposes, then the removeBatchEffect() function is provided for that purpose. removeBatchEffect() can correct for more than one batch effect simultaneously. The approach is to convert counts to log-CPM using cpm() then remove batch effects using removeBatchEffect().

The approach you are currently taking is not correct, for a number of different reasons:

  • The model formula is not valid because it includes an interaction term without either of the main effects. This will give unpredictable results. I have explained previously on this forum how to construct a design matrix for a paired comparison experiment like yours with subjects in two groups (two vaccines in this case). See my answer here: Suggestion for next revision of edgeR User Guide re Design matrix setup or James' answer here: Paired Time course Experiment Design on edgeR between and within groups or see Section 3.5 of the edgeR User's Guide.
  • ComBat() cannot be used in conjunction with edgeR. It does not correct in a way equivalent to edgeR. In my opinion, it tends to over-correct.
  • You haven't explained what the vector log_normalised_counts contains. Is it logCPM perhaps? edgeR doesn't produce anything called a normalised count.
  • Correcting for a combined group variable is not the same as correcting for donor + batch. Using the combined variable will result in over-correction.
Entering edit mode

Dear Gordon Smyth,

Thank you very much for the response.

To generate my design matrix, I followed the approach by Aaron Lun here: EdgeR - Model matrix for complex model similar to user guide 3.5 example - but more complex, and section 3.5 of the edgeR user guide. However I have an extra library preparation batch (batch) variable that I have perhaps included incorrectly? Within each vaccine group, there are a mix of samples that were generated in library preparation batch 1 and library preparation batch 2.

As additional information, not all of the donors in my study have all timepoints.

I have read your response here (Suggestion for next revision of edgeR User Guide re Design matrix setup), but I am still unclear as to where I am going wrong as the columns in my design matrix are similar to the example (with the exception of the additional library preparation batch variable).

y <- DGEList(counts = counts, samples = sample_info)

keep <- filterByExpr(y)

y <- y[keep, , keep.lib.sizes = FALSE]

y <- calcNormFactors(y)

model_var <- data.frame(
    donor = factor(y$samples$donor), 
    batch = factor(y$samples$library_batch),    # two batches
    vaccine = factor(y$samples$vaccine),    # two vaccines
    timepoint = factor(y$samples$timepoint))    # four timepoints

design <- model.matrix(~ 0 + donor + batch + vaccine:timepoint, model_var)
design <- design[, !grepl("T1$", colnames(design))]    # remove T1 (baseline) timepoints

This gives me a design matrix with the following columns:

"donor1" "donor2" "donor3" "donor4" "donor5"
"donor6" "donor7" "donor8" "donor9" "donor10"
"donor11" "donor12" "donor13" "donor14" "donor15"
"donor16" "donor17" "donor18" "donor19" "donor20"
"batch2" "vaccineA.T2" "vaccineB.T2" "vaccineA.T3"
"vaccineB.T3" "vaccineA.T4" "vaccineB.T4"

Apologies if I have misunderstood, but could you explain where I have gone wrong and how you would generate the model formula correctly?

Best wishes,

Entering edit mode

In your question, you hadn't mentioned using grepl to remove the redundant columns of the design matrix. If you omit that step, then the design matrix generated by model.matrix is overparametrized and will give an error in edgeR.

I don't recommend this method of generating the design matrix because it seems to me to be indirect and too easy to make a mistake. Nevertheless, I think you have arrived at the correct design matrix.

Entering edit mode

Great, thank you for the response.

Could you clarify exactly what each coefficient would refer to in this model? My confusion comes from the fact that e.g. not every donor is present in each library batch (some donors have samples in both batches, while other donors have samples only in an individual batch). And each donor is only given one of the two vaccine types.

For example, does the batch2 coefficient indicate the logFC between batch 2 and batch 1, averaged across all donors, vaccines, and timepoints? (since we are assuming that the effect of batch is independent of the other variables in the model). Does this cause problems since the timepoints are not individually distributed across the batches?


Login before adding your answer.

Traffic: 411 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6