I have used ERCC spike-ins in a large RNA-Seq study (600+ samples). I
would temper expectations for any approach based on them. The dynamic
range of the spike-in is large (I recall 18 orders of magnitude on
base 2 scale), so unless you are sequencing quite deeply, don't get
high read counts for at least the bottom 1/3 of that range. I tried a
number of different strategies to use that information for the
sizefactors, but was never comfortable with the results from that
approach. The spike-ins themselves are subject to a great deal of
sample-to-sample variability (due to pipetting variance, difference in
library diversity, etc.) which makes using it as a basis for
normalizing less appealing when you see the results. The result was
sample differences of several fold in cases. By the way, our depth was
~ 20M reads per sample.
My experience agrees with that reported in the following paper, which
uses some data from the SEQC study, and does consider the spike-ins in
a complex background (i.e., spiked-in to a human sample at suggested
concentrations). They also looked at large data sets.
This paper (http://www.ncbi.nlm.nih.gov/pubmed/21816910
) is more
optimistic, and may seem somewhat contradictory to my comments and the
paper above; however, a key difference is sampling depth in the
latter. A glance at supplemental table S2 shows the average number of
reads was 230M PER (human) SAMPLE! They also used paired-end reads.
I did find the spike-ins useful for computing an "empirical" false
discovery rate (using the ERCC Set B) between groups. With reasonable
sample sizes per group (n=8), the group mean fold changes we extremely
close to 1 for those probes, even though they were not used in the
normalization procedure per se.
I'd be happy to discuss more off the list, and point you to
publications where I used them as a measure of false discovery.
From: Agnes Paquet [mailto:firstname.lastname@example.org]
Sent: Thursday, February 06, 2014 9:05 AM
To: bioconductor at r-project.org
Subject: [BioC] Normalization of RNAseq data using ERCC?
We have just started using ERCC spike-in controls in our RNAseq
experiments. I have looked for recommended approaches on how to use
the controls for normalization, but I couldn't find much information.
From what I read, I am planning to use the spike-ins to estimate the
sizeFactors in our differential analysis pipeline. Is there a better
approach that we could use to normalize our data based on the spike-
Can anyone recommend any paper covering that topic?
Thank you for your help,