Queries regarding normalization methods for RNA sequencing data analysis
2
0
Entering edit mode
@ash29pathak-7467
Last seen 9.2 years ago
United States

Dear Sir / Madam,

Greetings!

I am a PhD student, in an academic institute of India. I have the following
queries:


1)      We can use expected count for calculation of log fold change, and
log counts per million using edgeR package of R. In this package library
size is calculated, which represents sum of expected count of each contigs.
Kindly help me to understand calculation of normalization factors. 

2)      Multiplying the normalization factor with library size we get
effective library size, and after that this effective library size is used
for the calculation of normalized expected count. Kindly help me to
understand how normalized expected count is calculated? 

3)      Kindly also explain how the TMM_normalized FPKM is calculated?

4)      Here is a example data, you are kindly
requested to calculate the normalization factor for effective library size,
normalized expected count, log fold change, log counts per million and TMM
normalized FPKM.  I feel I can grasp easily from the calculation done in the
for this calculation.

Sorry have to post this example data here only because unable to attach excel sheet.

Example data:

Matrix of expected count    
  Sample A Sample B
c989_g1_i1 457 134
c1001_g1_i1 482 117
c997_g1_i1 3 16

  

 

Matrix information after analysis    
group lib.size norm.factors eff.lib.size
Sample A 942 1.016076654 957.1442
Sample B 267 0.984177716 262.7755

 

I want to understand how this norm.factors are calculated?

Results of edgeR from the matrix      
  logFC logCPM PValue FDR
c997_g1_i1 4.193418634 14.56202158 0.000305695 0.000917085
c1001_g1_i1 -0.177541924 18.86089555 0.799344988 0.888664958
c989_g1_i1 0.094905219 18.91100806 0.888664958 0.888664958

Kindly help me to understand how normalised expected count is calculated?

Information for calculation of TMM_normalized FPKM  
group lib.size norm.factors eff.lib.size
Sample A 3505040 1.041236155 3649574
Sample B 3399608 0.960396924 3264973

For the calculation of TMM normalised FPKM, how this normalisation factor is calculated?

After that we get TMM normalised FPKM, which is 

TMM_normalized_FPKM    
  Sample A Sample B
c989_g1_i1 412316.07 440363.64
c1001_g1_i1 1171119.49 1035458.31
c997_g1_i1 5958.79 115757.58

To process some of my transcriptome data of doctoral study, you are kindly
requested to please clarify the aforementioned doubts. I shall be highly
grateful for this help.

I look forward your response.

With regards


Ashish Kumar Pathak

edger • 3.6k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 17 hours ago
The city by the bay
  1. Normalization factors are computed using the trimmed mean of M-values (TMM) method; see the paper for more details. Briefly, M-values are defined as the library size-adjusted log-ratio of counts between two libraries. The most extreme 30% of M-values are trimmed away, and the mean of the remaining M-values is computed. This trimmed mean represents the log-normalization factor between the two libraries. The idea is to eliminate systematic differences in the counts between libraries, by assuming that most genes are not DE.
  2. I'm not sure what you mean by the normalized expected count. In edgeR, normalization operates by scaling the library sizes. The original counts are left untouched (at least, in the edgeR's GLM framework), in order to preserve the mean-variance relationship of the data.
  3. As you may know, FPKM stands for fragments-per-kilobase-million. Fragments represent reads for single-end data, or read pairs for paired-end data. For any gene, the fragment count is divided by the (exonic) length of the gene, in kilobases. This is divided by the library size and multiplied by a million to get the FPKM. The normalized FPKM just uses the effective library size, instead of the original library size.

As for the calculations in your example, a better strategy for learning would be to run the R commands on the full dataset. If you can load your count matrix into R, then you can run:

require(edgeR)
y <- DGEList(counts)
cpm(y, log=TRUE) # will give you counts-per-million.
y <- calcNormFactors(y)
y$samples$norm.factors # will give you normalization factors.
y$samples$norm.factors * y$samples$lib.size # will give you effective library sizes.

Calculation of FPKM requires information about the length of each gene. You'll have to supply gene.lengths according to your biological system.

rpkm(y, gene.lengths)

Calculation of log-fold changes depends on your experimental design and the comparisons you want to perform. If you have a design matrix and you know the coefficient for your comparison, then it's easy:

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=coef)
lrt$table

See the user's guide for more information.

ADD COMMENT
0
Entering edit mode

Thank you sir,

For your kind help. 

Research article suggested by you was very helpful for understanding of log fold change, log CPM and normalisation factor. Now these things are clear to me.   

But require little bit more clarification regarding calculation of TMM normalised FPKM. I have concept of FPKM, but I am confused regarding calculations which are performed in edgeR for the calculation of TMM normalised FPKM, if we assume we have two samples, for the calculation of FPKM we are providing effective length of only one sample which it will consider as a test sample and other as a reference sample and after that it will calculate FPKM. In RSEM, for the calculation of FPKM they first calculate TPM and after that FPKM is calculated considering the average transcript length of library.

I am trying to understand what exact formula is used for calculation of FPKM in edgeR considering length of only test sample only and how that length is used for reference sample, if we consider effective length then it will vary among samples.

Sorry sir, as I am new user of R, so unable to decode every steps followed in it.

Your kind suggestion will be highly helpful for my understanding of the subject. If possible suggest  research article I should follow for understanding of the concept.

With regards.  

ADD REPLY
0
Entering edit mode

The length of a gene should be constant between libraries, as you should be counting reads across the same features in each library. It doesn't make sense to test for DE between libraries if you're comparing different gene models; they will obviously be different, so the null hypothesis won't hold in the first place.

ADD REPLY
0
Entering edit mode
@ash29pathak-7467
Last seen 9.2 years ago
United States

Thank you sir,

For your kind help. 

Research article suggested by you was very helpful for understanding of log fold change, log CPM and normalisation factor. Now these things are clear to me.   

But require little bit more clarification regarding calculation of TMM normalised FPKM. I have concept of FPKM, but I am confused regarding calculations which are performed in edgeR for the calculation of TMM normalised FPKM, if we assume we have two samples, for the calculation of FPKM we are providing effective length of only one sample which it will consider as a test sample and other as a reference sample and after that it will calculate FPKM. In RSEM, for the calculation of FPKM they first calculate TPM and after that FPKM is calculated considering the average transcript length of library.

I am trying to understand what exact formula is used for calculation of FPKM in edgeR considering length of only test sample only and how that length is used for reference sample, if we consider effective length then it will vary among samples.

Sorry sir, as I am new user of R, so unable to decode every steps followed in it.

Your kind suggestion will be highly helpful for my understanding of the subject. If possible suggest  research article I should follow for understanding of the concept.

With regards.  

ADD COMMENT

Login before adding your answer.

Traffic: 520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6