Question: Method to get normalize allele specific read counts
1
0
Entering edit mode
@hemantcnaik-23771
Last seen 2 days ago
India

Dear all,

I am currently have allele specific read count wanted normalize data At the moment, I am focusing on RLE normalization, which is mentioned in paper https://academic.oup.com/bioinformatics/article/36/2/504/5539691 using edgeR package trying to do the normalization.

using edgeR running code I have two matrix file for allelic count alleleA and alleleB

finding library size factor for total read count alleleA+ alleleB and using this size factor to normalize alleleA and alleleB normalization is it making sense please suggest me

scale.factors<- calcNormFactors(alleleA+alleleB, method = "RLE")

count_normalizedA=sweep(alleleA, MARGIN=2, scale.factors, "/" )*mean(scale.factors)

count_normalizedB=sweep(alleleB, MARGIN=2, scale.factors, "/" )* mean(scale.factors)


What is the appropriate way to obtained normalized counts within edgeR package for RLE or TMM normalization ?

Best,

Hemant

edger normalization • 419 views
0
Entering edit mode

Hi If anyone has a suggestion please let me know.

I would appreciate any help Thanks

2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 50 minutes ago
The city by the bay

edgeR doesn't have a function to compute normalized counts because it doesn't use them for anything. All models are fitted with the raw counts, using offsets to account for differences in sequencing depth. This is the most accurate approach to modelling as it accommodates changes in the mean-variance relationship with count size.

edgeR only provides functions to compute normalized expression values like CPMs via the cpm() function. So, the simplest approach to computing normalized values for all genes is to do:

y <- DGEList(cbind(alleleA, alleleB))

# Transplanting normalization factors after filtering.
y2 <- filterByExpr(y) ## insert grouping or design matrix here, if you have it.
y2 <- calcNormFactors(y2)
y$samples$norm.factors <- y2$samples$norm.factors

norm <- cpm(y)


If one MUST have normalized counts - and I don't see why this is necessary, other than to provide an alternative interpretation to make plots - you could replace the cpm() call with:

eff.lib <- y$samples$norm.factors * y$samples$lib.size
eff.lib <- eff.lib/mean(eff.lib)
norm <- t(t(y\$counts)/eff.lib)


One could, in theory, do something fancier with the fact that each pair of allele count profiles come from the same sample. This would be similar to the code in your post where you compute one normalization factor for each pair, but you didn't do it quite right, because you can't divide directly by the normalization factors:

combined <- alleleA+alleleB
keep <- filterByExpr(combined) ## insert grouping or design matrix here.
norm.factors <- calcNormFactors(combined[keep,])
eff.lib <- norm.factors * colSums(combined)
eff.lib <- eff.lib/mean(eff.lib)

normA <- t(t(alleleA)/eff.lib)
normB <- t(t(alleleB)/eff.lib)


The second approach ignores any biases between the allele-specific profiles from the same sample. For example, if you quantified the allele-specific counts by aligning to the reference genome and counting the minor allele frequencies, you might expect to get lower counts for the minor alleles where there were too many mismatches for alignment to the reference. Such biases may or may not matter, depending on whether you care about the magnitude of the ASE (does matter) or changes in the magnitude in the ASE across conditions (less likely to matter, as any biases should hopefully cancel out between conditions).

Regardless of what you choose, a proper analysis would operate from the raw counts rather than the normalized values. The normalized counts or CPMs or whatever are only provided by edgeR for visualization purposes.

0
Entering edit mode

@Aaron Lun Thank you for your valuable response and also for taking time to answer my question, as you said better to use raw read count. my data is off single cell main objective to do allelic imbalance study the tool I am using for AI they mentioned to use normalized count or imputed count on this type situation what would you suggest. Can I Use this normalized values or raw read count?

0
Entering edit mode

Hi If anyone has a suggestion please let me know.

can I use above method for single cell analysis is it right way to do it, is there any method available for single cell allele count normalization

I would appreciate any help Thanks

0
Entering edit mode

If you want advice on a particular tool, you are better off asking the authors of that tool.

FWIW I would just create pseudo-bulk profiles for each haplotype in each cluster and just perform tests for allele-specific expression with edgeR, assuming you have true replicates.

0
Entering edit mode

Thanks for your advice, its not specific to one tool, trying different type normalization approach which are used in single cell normalization and using those normalization approach to normalize allele count which is good approach or not, confused because tools are not mentioning can be used for single cell allele analysis. that's what asking suggestions from this page,

can you please elaborate above mentioned step I am not able understand