I am working on an RNA-seq dataset and have been using the EdgeR differential expression analysis methods with GLMs. Here, I have specified batch as a factor in the design matrix. The batch factor is uncorrelated with the other experimental factors and each batch is evenly represented among all environments.
But I have been asked to get batch-adjusted RPKM expression values for a select number of genes to make a bar chart (I of course want to use another chart type since bar charts can be deceptive) because if we just show the rpkm-normalized values that are not batch-adjusted, the results might be compromised by batch.
My general approach is as follows. I start with the entire dataset of about 100k genes. I make the design matrix from the experimental factors in accordance with the experimental design, and then define a batch factor. I then put the counts into a DGEList, normalize, log2 transform with rpkm() using effective gene lengths and then use the removeBatchEffect() function.
factorA <- factor(substring(colnames(data), 1, 6)) factorB <- factor(substring(colnames(data), 7, 11)) batch <- factor(substring(colnames(data), 13, 13)) # Making the design matrix. Should only contain factors to be saved. design <- model.matrix(~factorA*factorB) # Filtering keep <- rowSums(cpm(data)>0.1) >= 8 data <- data[keep, ] # Put counts into DGEList container needed for latter steps. y <- DGEList(counts=data) # TMM-normalization. y <- calcNormFactors(y) # Convert to RPKM and log2 transformation. data <- rpkm(y, gene.length=length$effective_length, log=TRUE, prior.count=0.01) # Remove batch effect. data <- removeBatchEffect(data, batch=batch, design=design)
The reason for such a shallow filtering and prior count is because those genes that are of interest are very, very lowly expressed, so they might otherwise get removed in the filtering step or given a prior count that, proportionally speaking, would increase their normalized expression value substantially.
So for my questions:
- Can I remove the 2 logarithm and get back non-logarithm batch-adjusted normalized values? In other words, is Limma's removeBatchEffect() and log2() commutative? I am guessing no since log2() is rarely commutative with other operations.
- When running rpkm() as above, is log2() applied before or after rpkm normalization?
- Anything that looks questionable in the above approach I have taken?
This will work even better if you set
before forming the design matrix. That will ensure that setting the batch effects to zero does not change the size of the counts on average.
Dear Gordon,
By any chance, should
contr.sum
set to be2
instead of10
as it should be thelength
of batchlevels
according to the example given by Aaron?PS: I know this is an old post, but I hope you will see this.
Yes, you're right. I've edited my comment to fix it. The use of
levels(batch)
will ensure the correct number of levels is used, whatever that is.Thanks for this neat technique!
Aaron, I followed the above code (together with the recommendations from https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#edgeR to incorporate the length offsets from tximport) and I get negative values for the
pseudo.counts
. Any recommendation on how to deal with this?Negative values occur at greatest magnitude in samples where
batch
in the design matrix is 1 rather than 0.Probably just floor it at zero. The procedure in
q2qnbinom
is approximate (check out the documentation as to why) so there's always a chance of this happening. Mind, this requires the counts to be pretty low and most of the time you'd have filtered out low-abundance genes already.