Using the ERCC spike as a normalization factor for a custom algorithm
1
0
Entering edit mode
@what_theheck24-12452
Last seen 7.2 years ago

Hello all,

I have a couple of theoretical/analysis questions concerning my data analysis. I am spiking in an equal amount of the ERCC Control mix into each of my samples that I can use to create a normalization factor because of the diversity of my libraries. I then would like to generate TMM values for all of the genes in my data set based on the ERCC normalization factor and use those values downstream in some custom algorithms/scripts I am designing. I am using edgeR and the calcNormFactors() function on the reads that align to the ERCC transcripts to generate a normalization factor I can apply to the raw read counts in my data set. I have imported my files into the DGEList format and generated normalization factors two different ways which are listed below:

A). including only those reads which map to the ERCC transcripts

                    files group lib.size norm.factors
1-1_Elution_ERCC_counts    1     4452    0.9584354
1-1_Total_ERCC_counts      2      872    0.9997297
GFP_Elution_ERCC_counts    3     4209    1.0072572
GFP_Total_ERCC_counts      4      778    1.0361298

or

B). using all the reads in a sample (both human RNA reads and ERCC control reads.

                     files group lib.size norm.factors
1-1_Elution_ERCC_counts      1   194732    2.3671725
1-1_Total_ERCC_counts        2   184785    0.5019733
GFP_Elution_ERCC_counts      3   185957    2.2148415
GFP_Total_ERCC_counts        4   206232    0.3799678

At this point I feel it important to note that I recognize how small my library sizes are. This was taken from a pilot experiment I am analyzing right now. My full sequencing experiment will be at a much larger depth.

I know that calcNormFactors() is sensitive to the library size, so it does not surprise me that I get different results from these two methods, but I am wondering which is the correct method for generating a normalization factor for my purposes. I anticipated that the number of ERCC reads would be higher in my Elution samples compared to my Total samples, so I also assumed the norm.factors would be different between the Totals and Elutions as well. Additionally, because I want to use the norm.factors on all of my other genes I think I need to take into consideration the total size of the library (options B), as opposed to option A where I am only looking at the size of the ERCC reads library. In essence, I am leaning towards option B but I wanted to get some thoughts from people with more expertise in this subject.

Once I have this settled, I was going to generate TMM values by using the equation TMM = raw read counts/(library size*normalization factor). Does this yield the same value as the $pseudo.counts from the estimateCommonDisp() function? I tried researching this online and in the edgeR manual, but was confused because the manual states the authors do not recommend using pseudo.counts as normalized values.

Thanks for your help!

Adam

 

ERCC rna-seq edgeR tmm normalised values • 2.6k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 15 hours ago
The city by the bay

It seems unwise to use ERCCs for scaling normalisation in bulk RNA-seq experiments, see http://dx.doi.org/10.1038/nbt.2931. I suspect that this is because the definition of "an equal amount" is not precise in bulk settings. For example, it might be an equal quantity of spike-in RNA per tube; or per unit of endogenous RNA in each sample; or per number of cells in each sample from which the RNA was extracted. It is usually easier to interpret the results if a non-DE majority of genes is assumed and used for normalisation.

That said, if you still want to use the spike-ins, you would use option A where the effective library size is equal to the product of the library sizes (i.e., total spike-in coverage) and the corresponding normalisation factors. Log-transforming the effective library sizes would yield offsets for fitting a log-link GLM. In contrast, option B should be practically equivalent to ignoring the spike-ins altogether. This is because there's only 92 ERCCs but ~10000 human genes after filtering, so the calculation of normalisation factors will be dominated by the latter.

Finally, the pseudo-counts are not calculated by simply scaling the counts. This is because scaling will not preserve the mean-variance relationship. Rather, a NB distribution is fitted to the counts; the mean of the distribution is scaled; and the counts are mapped to the corresponding quantile in the new distribution. See q2qnbinom for the function that is responsible for doing this.

ADD COMMENT
0
Entering edit mode

Thanks for the clarification Aaron. For basic DE analysis we will be assuming a majority of genes are non-DE and using them for normalization. But we also want to look at mRNA half-life and are doing a fractionation to do so (the elution and total samples). This is where we feel it is necessary to provide a control spike in, because the composition/complexity of the total and elution samples are different following fractionation. Furthermore, our calculation relies upon using equivalent volumes of each sample to generate accurate half-lives, but in order to maintain efficient cluster generation during sequencing we normalize samples by weight, thus throwing off the volume ratios we are concerned about.

Thanks for clarifying the pseudo-counts as well. I will take a look at the q2qnbionom function in further detail. Correct me if I am wrong, but you are saying that the pseudo-counts are calculated by using the q2q function and that this would be a better representation of the normalized data as opposed to simply scaling.

ADD REPLY

Login before adding your answer.

Traffic: 893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6