Question

Using the ERCC spike as a normalization factor for a custom algorithm

0

Entering edit mode

what_theheck24 • 0

@what_theheck24-12452

Last seen 8.9 years ago

Hello all,

I have a couple of theoretical/analysis questions concerning my data analysis. I am spiking in an equal amount of the ERCC Control mix into each of my samples that I can use to create a normalization factor because of the diversity of my libraries. I then would like to generate TMM values for all of the genes in my data set based on the ERCC normalization factor and use those values downstream in some custom algorithms/scripts I am designing. I am using edgeR and the calcNormFactors() function on the reads that align to the ERCC transcripts to generate a normalization factor I can apply to the raw read counts in my data set. I have imported my files into the DGEList format and generated normalization factors two different ways which are listed below:

A). including only those reads which map to the ERCC transcripts

                    files group lib.size norm.factors
1-1_Elution_ERCC_counts    1     4452    0.9584354
1-1_Total_ERCC_counts      2      872    0.9997297
GFP_Elution_ERCC_counts    3     4209    1.0072572
GFP_Total_ERCC_counts      4      778    1.0361298

or

B). using all the reads in a sample (both human RNA reads and ERCC control reads.

                     files group lib.size norm.factors
1-1_Elution_ERCC_counts      1   194732    2.3671725
1-1_Total_ERCC_counts        2   184785    0.5019733
GFP_Elution_ERCC_counts      3   185957    2.2148415
GFP_Total_ERCC_counts        4   206232    0.3799678

At this point I feel it important to note that I recognize how small my library sizes are. This was taken from a pilot experiment I am analyzing right now. My full sequencing experiment will be at a much larger depth.

I know that calcNormFactors() is sensitive to the library size, so it does not surprise me that I get different results from these two methods, but I am wondering which is the correct method for generating a normalization factor for my purposes. I anticipated that the number of ERCC reads would be higher in my Elution samples compared to my Total samples, so I also assumed the norm.factors would be different between the Totals and Elutions as well. Additionally, because I want to use the norm.factors on all of my other genes I think I need to take into consideration the total size of the library (options B), as opposed to option A where I am only looking at the size of the ERCC reads library. In essence, I am leaning towards option B but I wanted to get some thoughts from people with more expertise in this subject.

Once I have this settled, I was going to generate TMM values by using the equation TMM = raw read counts/(library size*normalization factor). Does this yield the same value as the $pseudo.counts from the estimateCommonDisp() function? I tried researching this online and in the edgeR manual, but was confused because the manual states the authors do not recommend using pseudo.counts as normalized values.

Thanks for your help!

Adam

ERCC rna-seq edgeR tmm normalised values • 4.2k views

ADD COMMENT • link updated 8.9 years ago by Aaron Lun ★ 29k • written 8.9 years ago by what_theheck24 • 0

score 3 · Answer 1 · 2017-02-26

It seems unwise to use ERCCs for scaling normalisation in bulk RNA-seq experiments, see http://dx.doi.org/10.1038/nbt.2931. I suspect that this is because the definition of "an equal amount" is not precise in bulk settings. For example, it might be an equal quantity of spike-in RNA per tube; or per unit of endogenous RNA in each sample; or per number of cells in each sample from which the RNA was extracted. It is usually easier to interpret the results if a non-DE majority of genes is assumed and used for normalisation.

That said, if you still want to use the spike-ins, you would use option A where the effective library size is equal to the product of the library sizes (i.e., total spike-in coverage) and the corresponding normalisation factors. Log-transforming the effective library sizes would yield offsets for fitting a log-link GLM. In contrast, option B should be practically equivalent to ignoring the spike-ins altogether. This is because there's only 92 ERCCs but ~10000 human genes after filtering, so the calculation of normalisation factors will be dominated by the latter.

Finally, the pseudo-counts are not calculated by simply scaling the counts. This is because scaling will not preserve the mean-variance relationship. Rather, a NB distribution is fitted to the counts; the mean of the distribution is scaled; and the counts are mapped to the corresponding quantile in the new distribution. See q2qnbinom for the function that is responsible for doing this.