Question

Queries regarding normalization methods for RNA sequencing data analysis

0

Entering edit mode

ash29pathak • 0

@ash29pathak-7467

Last seen 9.2 years ago

United States

Dear Sir / Madam,

Greetings!

I am a PhD student, in an academic institute of India. I have the following
queries:

1)      We can use expected count for calculation of log fold change, and
log counts per million using edgeR package of R. In this package library
size is calculated, which represents sum of expected count of each contigs.
Kindly help me to understand calculation of normalization factors.

2)      Multiplying the normalization factor with library size we get
effective library size, and after that this effective library size is used
for the calculation of normalized expected count. Kindly help me to
understand how normalized expected count is calculated?

3)      Kindly also explain how the TMM_normalized FPKM is calculated?

4)      Here is a example data, you are kindly
requested to calculate the normalization factor for effective library size,
normalized expected count, log fold change, log counts per million and TMM
normalized FPKM. I feel I can grasp easily from the calculation done in the
for this calculation.

Sorry have to post this example data here only because unable to attach excel sheet.

Example data:

Matrix of expected count
	Sample A	Sample B
c989_g1_i1	457	134
c1001_g1_i1	482	117
c997_g1_i1	3	16

Matrix information after analysis
group	lib.size	norm.factors	eff.lib.size
Sample A	942	1.016076654	957.1442
Sample B	267	0.984177716	262.7755

I want to understand how this norm.factors are calculated?

Results of edgeR from the matrix
	logFC	logCPM	PValue	FDR
c997_g1_i1	4.193418634	14.56202158	0.000305695	0.000917085
c1001_g1_i1	-0.177541924	18.86089555	0.799344988	0.888664958
c989_g1_i1	0.094905219	18.91100806	0.888664958	0.888664958

Kindly help me to understand how normalised expected count is calculated?

Information for calculation of TMM_normalized FPKM
group	lib.size	norm.factors	eff.lib.size
Sample A	3505040	1.041236155	3649574
Sample B	3399608	0.960396924	3264973

For the calculation of TMM normalised FPKM, how this normalisation factor is calculated?

After that we get TMM normalised FPKM, which is

TMM_normalized_FPKM
	Sample A	Sample B
c989_g1_i1	412316.07	440363.64
c1001_g1_i1	1171119.49	1035458.31
c997_g1_i1	5958.79	115757.58

To process some of my transcriptome data of doctoral study, you are kindly
requested to please clarify the aforementioned doubts. I shall be highly
grateful for this help.

I look forward your response.

With regards

Ashish Kumar Pathak

edger • 3.6k views

ADD COMMENT • link 9.2 years ago ash29pathak • 0

0

Entering edit mode

ash29pathak • 0

@ash29pathak-7467

Last seen 9.2 years ago

United States

Thank you sir,

For your kind help.

Research article suggested by you was very helpful for understanding of log fold change, log CPM and normalisation factor. Now these things are clear to me.

But require little bit more clarification regarding calculation of TMM normalised FPKM. I have concept of FPKM, but I am confused regarding calculations which are performed in edgeR for the calculation of TMM normalised FPKM, if we assume we have two samples, for the calculation of FPKM we are providing effective length of only one sample which it will consider as a test sample and other as a reference sample and after that it will calculate FPKM. In RSEM, for the calculation of FPKM they first calculate TPM and after that FPKM is calculated considering the average transcript length of library.

I am trying to understand what exact formula is used for calculation of FPKM in edgeR considering length of only test sample only and how that length is used for reference sample, if we consider effective length then it will vary among samples.

Sorry sir, as I am new user of R, so unable to decode every steps followed in it.

Your kind suggestion will be highly helpful for my understanding of the subject. If possible suggest research article I should follow for understanding of the concept.

With regards.

ADD COMMENT • link 9.2 years ago ash29pathak • 0

score 3 · Accepted Answer · 2015-03-16

Normalization factors are computed using the trimmed mean of M-values (TMM) method; see the paper for more details. Briefly, M-values are defined as the library size-adjusted log-ratio of counts between two libraries. The most extreme 30% of M-values are trimmed away, and the mean of the remaining M-values is computed. This trimmed mean represents the log-normalization factor between the two libraries. The idea is to eliminate systematic differences in the counts between libraries, by assuming that most genes are not DE.
I'm not sure what you mean by the normalized expected count. In edgeR, normalization operates by scaling the library sizes. The original counts are left untouched (at least, in the edgeR's GLM framework), in order to preserve the mean-variance relationship of the data.
As you may know, FPKM stands for fragments-per-kilobase-million. Fragments represent reads for single-end data, or read pairs for paired-end data. For any gene, the fragment count is divided by the (exonic) length of the gene, in kilobases. This is divided by the library size and multiplied by a million to get the FPKM. The normalized FPKM just uses the effective library size, instead of the original library size.

As for the calculations in your example, a better strategy for learning would be to run the R commands on the full dataset. If you can load your count matrix into R, then you can run:

require(edgeR)
y <- DGEList(counts)
cpm(y, log=TRUE) # will give you counts-per-million.
y <- calcNormFactors(y)
y$samples$norm.factors # will give you normalization factors.
y$samples$norm.factors * y$samples$lib.size # will give you effective library sizes.

Calculation of FPKM requires information about the length of each gene. You'll have to supply gene.lengths according to your biological system.

rpkm(y, gene.lengths)

Calculation of log-fold changes depends on your experimental design and the comparisons you want to perform. If you have a design matrix and you know the coefficient for your comparison, then it's easy:

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=coef)
lrt$table

See the user's guide for more information.