I am working on an RNA-seq dataset and have been using the EdgeR differential expression analysis methods with GLMs. Here, I have specified batch as a factor in the design matrix. The batch factor is uncorrelated with the other experimental factors and each batch is evenly represented among all environments.
But I have been asked to get batch-adjusted RPKM expression values for a select number of genes to make a bar chart (I of course want to use another chart type since bar charts can be deceptive) because if we just show the rpkm-normalized values that are not batch-adjusted, the results might be compromised by batch.
My general approach is as follows. I start with the entire dataset of about 100k genes. I make the design matrix from the experimental factors in accordance with the experimental design, and then define a batch factor. I then put the counts into a DGEList, normalize, log2 transform with rpkm() using effective gene lengths and then use the removeBatchEffect() function.
factorA <- factor(substring(colnames(data), 1, 6)) factorB <- factor(substring(colnames(data), 7, 11)) batch <- factor(substring(colnames(data), 13, 13)) # Making the design matrix. Should only contain factors to be saved. design <- model.matrix(~factorA*factorB) # Filtering keep <- rowSums(cpm(data)>0.1) >= 8 data <- data[keep, ] # Put counts into DGEList container needed for latter steps. y <- DGEList(counts=data) # TMM-normalization. y <- calcNormFactors(y) # Convert to RPKM and log2 transformation. data <- rpkm(y, gene.length=length$effective_length, log=TRUE, prior.count=0.01) # Remove batch effect. data <- removeBatchEffect(data, batch=batch, design=design)
The reason for such a shallow filtering and prior count is because those genes that are of interest are very, very lowly expressed, so they might otherwise get removed in the filtering step or given a prior count that, proportionally speaking, would increase their normalized expression value substantially.
So for my questions:
- Can I remove the 2 logarithm and get back non-logarithm batch-adjusted normalized values? In other words, is Limma's removeBatchEffect() and log2() commutative? I am guessing no since log2() is rarely commutative with other operations.
- When running rpkm() as above, is log2() applied before or after rpkm normalization?
- Anything that looks questionable in the above approach I have taken?