edgeR doesn't have a function to compute normalized counts because it doesn't use them for anything. All models are fitted with the raw counts, using offsets to account for differences in sequencing depth. This is the most accurate approach to modelling as it accommodates changes in the mean-variance relationship with count size.
edgeR only provides functions to compute normalized expression values like CPMs via the cpm()
function. So, the simplest approach to computing normalized values for all genes is to do:
y <- DGEList(cbind(alleleA, alleleB))
# Transplanting normalization factors after filtering.
y2 <- filterByExpr(y) ## insert grouping or design matrix here, if you have it.
y2 <- calcNormFactors(y2)
y$samples$norm.factors <- y2$samples$norm.factors
norm <- cpm(y)
If one MUST have normalized counts - and I don't see why this is necessary, other than to provide an alternative interpretation to make plots - you could replace the cpm()
call with:
eff.lib <- y$samples$norm.factors * y$samples$lib.size
eff.lib <- eff.lib/mean(eff.lib)
norm <- t(t(y$counts)/eff.lib)
One could, in theory, do something fancier with the fact that each pair of allele count profiles come from the same sample. This would be similar to the code in your post where you compute one normalization factor for each pair, but you didn't do it quite right, because you can't divide directly by the normalization factors:
combined <- alleleA+alleleB
keep <- filterByExpr(combined) ## insert grouping or design matrix here.
norm.factors <- calcNormFactors(combined[keep,])
eff.lib <- norm.factors * colSums(combined)
eff.lib <- eff.lib/mean(eff.lib)
normA <- t(t(alleleA)/eff.lib)
normB <- t(t(alleleB)/eff.lib)
The second approach ignores any biases between the allele-specific profiles from the same sample. For example, if you quantified the allele-specific counts by aligning to the reference genome and counting the minor allele frequencies, you might expect to get lower counts for the minor alleles where there were too many mismatches for alignment to the reference. Such biases may or may not matter, depending on whether you care about the magnitude of the ASE (does matter) or changes in the magnitude in the ASE across conditions (less likely to matter, as any biases should hopefully cancel out between conditions).
Regardless of what you choose, a proper analysis would operate from the raw counts rather than the normalized values. The normalized counts or CPMs or whatever are only provided by edgeR for visualization purposes.
Hi If anyone has a suggestion please let me know.
I would appreciate any help Thanks