Hi everyone, I am using edgeR to identify differentially expressed transcripts based on a simple RNA-seq setup (tumor vs control samples).
The DGELRT object with results provided by edgeR contains, among others, logFC values calculated for the contrast that I used (Tumor vs Control). I additionally need the average read counts for both sample groups, used to calculate the logFC values. For a model matrix based on a single factor I can do that by averaging pseudo.counts from the DGEList. However if I use a model matrix with an additional factor for batch-effect correction this method no longer provides average reads used to calculate logFC values in the DGELRT object. Could you please guide me to an appropriate way on how to either get the batch adjusted reads of just the average read counts used to get the logFC values?
Below is a simplified version of my code:
#create DGEList object
CountsTableEdge <- DGEList(counts=CountsTable, genes=rownames(CountsTable), group=factor_names)
#normalize and calculate dispersions
CountsTableEdge <- calcNormFactors(CountsTableEdge, method="TMM")
CountsTableEdge <- estimateCommonDisp(CountsTableEdge)
CountsTableEdge <- estimateTagwiseDisp(CountsTableEdge)
#create design matrix
design <- model.matrix(~0+factor_names+batch)
colnames(design) = gsub('factor_names','',colnames(design))
#fit the model
fit <- glmFit(CountsTableEdge,design)
#make the contract matrix and test DE
cont.matrix <- makeContrasts(contrasts="Tumor - Control", levels=design)
tRes = glmLRT(fit,contrast=cont.matrix)
#calulate the average values in each group
TumorMean = rowMeans(CountsTableEdge$pseudo.counts[,TumorCols])
ControlMean = rowMeans(CountsTableEdge$pseudo.counts[,ControlCols])
logFC=log2(TumorMean/ControlMean)
### logFC matches tRes$table$logFC only for a simplified model matrix without the batch factor
Thank you very much! this is exactly what I needed.
The reason why I stick to the "classic" edgeR is because it provides the pseudo-counts that I needed to create visualizations for individual features (requested by my collaborator). I am aware that this is not the best way of doing the analysis and that pseudo-counts are intended mainly for internal use by the edgeR, as stated in the manual, but what other option there is to visualize individual samples? The average counts were actually requested by one of the reviewers of our manuscript. It seems that despite some things have changed people are still used to the classic statistics from the microarray age.
EDIT: for the second part I made a small modification, instead of:
I used:
since it appears they provide something different.
For any visualization that you are using pseudo-counts for, you can replace it directly with the output of
cpm
, which is a more common measure of expression anyway. As a reviewer, I wouldn't even bat an eye upon seeing CPMs; pseudo-counts, on the other hand, make me wonder... And once a reviewer has to think about something, you're in trouble.Thank you for the suggestion, I will replace pseudo-counts with CPM.