Question

Question: Method to get normalize allele specific read counts

0

Entering edit mode

hemantcnaik • 0

@hemantcnaik-23771

Last seen 12 months ago

India

Dear all,

I am currently have allele specific read count wanted normalize data At the moment, I am focusing on RLE normalization, which is mentioned in paper https://academic.oup.com/bioinformatics/article/36/2/504/5539691 using edgeR package trying to do the normalization.

using edgeR running code I have two matrix file for allelic count alleleA and alleleB

finding library size factor for total read count alleleA+ alleleB and using this size factor to normalize alleleA and alleleB normalization is it making sense please suggest me

scale.factors<- calcNormFactors(alleleA+alleleB, method = "RLE")

count_normalizedA=sweep(alleleA, MARGIN=2, scale.factors, "/" )*mean(scale.factors)

count_normalizedB=sweep(alleleB, MARGIN=2, scale.factors, "/" )* mean(scale.factors)

What is the appropriate way to obtained normalized counts within edgeR package for RLE or TMM normalization ?

Best,

Hemant

edger normalization • 1.6k views

ADD COMMENT • link updated 5.0 years ago by Aaron Lun ★ 28k • written 5.0 years ago by hemantcnaik • 0

0

Entering edit mode

Hi If anyone has a suggestion please let me know.

I would appreciate any help Thanks

ADD REPLY • link 5.0 years ago hemantcnaik • 0

score 2 · Accepted Answer · 2020-07-02

edgeR doesn't have a function to compute normalized counts because it doesn't use them for anything. All models are fitted with the raw counts, using offsets to account for differences in sequencing depth. This is the most accurate approach to modelling as it accommodates changes in the mean-variance relationship with count size.

edgeR only provides functions to compute normalized expression values like CPMs via the cpm() function. So, the simplest approach to computing normalized values for all genes is to do:

y <- DGEList(cbind(alleleA, alleleB))

# Transplanting normalization factors after filtering.
y2 <- filterByExpr(y) ## insert grouping or design matrix here, if you have it.
y2 <- calcNormFactors(y2)
y$samples$norm.factors <- y2$samples$norm.factors

norm <- cpm(y)

If one MUST have normalized counts - and I don't see why this is necessary, other than to provide an alternative interpretation to make plots - you could replace the cpm() call with:

eff.lib <- y$samples$norm.factors * y$samples$lib.size
eff.lib <- eff.lib/mean(eff.lib)
norm <- t(t(y$counts)/eff.lib)

One could, in theory, do something fancier with the fact that each pair of allele count profiles come from the same sample. This would be similar to the code in your post where you compute one normalization factor for each pair, but you didn't do it quite right, because you can't divide directly by the normalization factors:

combined <- alleleA+alleleB
keep <- filterByExpr(combined) ## insert grouping or design matrix here.
norm.factors <- calcNormFactors(combined[keep,])
eff.lib <- norm.factors * colSums(combined)
eff.lib <- eff.lib/mean(eff.lib)

normA <- t(t(alleleA)/eff.lib)
normB <- t(t(alleleB)/eff.lib)

The second approach ignores any biases between the allele-specific profiles from the same sample. For example, if you quantified the allele-specific counts by aligning to the reference genome and counting the minor allele frequencies, you might expect to get lower counts for the minor alleles where there were too many mismatches for alignment to the reference. Such biases may or may not matter, depending on whether you care about the magnitude of the ASE (does matter) or changes in the magnitude in the ASE across conditions (less likely to matter, as any biases should hopefully cancel out between conditions).

Regardless of what you choose, a proper analysis would operate from the raw counts rather than the normalized values. The normalized counts or CPMs or whatever are only provided by edgeR for visualization purposes.