Question: DESeq and Limma+Voom Normalization for Rna-Seq Data Using Ercc Spike-In
1
3.4 years ago by
komal.rathi70
United States
komal.rathi70 wrote:

Hello everyone,

I have some RNASeq data, that has been spiked-in with exogenous ERCC controls. I have the counts from htseq-count and now I want to normalize + test for differential expression. For this, I am using two methods: DESeq and limma+voom. Below is my code for both the tests. Can someone tell me if what I am doing is correct or not? I do think that I am slightly mistaken in the voom code.

##### Using DESeq
library(DESeq)
# get the size factors from the spiked-ins
# dat.spike.in is count data of only spike ins
cds <- newCountDataSet(countData = dat.spike.in, conditions = as.character(condition))
cds <- estimateSizeFactors(cds)
nf <- sizeFactors(cds)

# add the size factors to data without spike-ins
# dat.without.spike.in is count data where spike-ins are removed
cds <- newCountDataSet(countData = dat.without.spike.in, conditions = as.character(condition))
sizeFactors(cds) <- nf

# differential expression test
cds <- estimateDispersions( cds, method= 'pooled', sharingMode='maximum', fitType='local')
deseq.res <- nbinomTest( cds, "A", "B")

##### Using Limma + Voom
library(limma)
library(edgeR)
groups = as.factor(condition) # conditions
design <- model.matrix(~0+groups)
colnames(design) = levels(groups)
nf <- calcNormFactors(dat.spike.in) # calculation norm factors using spike ins only
voom.norm <- voom(dat.without.spike.in, design, lib.size = colSums(dat.spike.in) * nf) # add that to voom

Thanks!

P.S: Crossposted on Biostars to get inputs from both communities.

normalization limma deseq voom ercc • 3.5k views
modified 3.4 years ago by Gordon Smyth37k • written 3.4 years ago by komal.rathi70
Answer: DESeq and Limma+Voom Normalization for Rna-Seq Data Using Ercc Spike-In
7
3.4 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

I suspect that you are doing what others have done, but I don't recommend this method of normalizing by spike-ins.

Somewhat better would be:

N <- colSums(dat.without.spike.in)
nf <- calcNormFactors(dat.spike.in, lib.size=N)
voom.norm <- voom(dat.without.spike.in, design, lib.size = N * nf)

but I don't really recommend that either.

I prefer to normalize on all genes in the usual way rather than trying to normalize on a small number of spike-ins. The problem is that spike-ins are inherently variable, as is remarked on in this paper:

Have a look also at this paper:

They show that "spike-ins are not reliable enough to be used in standard global-scaling normalization procedures" -- which is what you are doing when you call calcNormFactors() or estimateSizeFactors() in your code.

If you really insist on using spike-ins for normalization, then the best bet would be the RUVseq package described in the above paper. The RUVseq vignette shows I think how it can be used with edgeR or limma.

Thank you for your detailed response. Not to sound naive but if spike-ins are not reliable for normalization, what is their purpose? What do people generally do with them?

ADD REPLYlink modified 3.4 years ago by Ryan C. Thompson7.2k • written 3.4 years ago by komal.rathi70
2

Well, spike-ins can be used to assess accuracy and technical performance, see

and

A number of companies have been advocating spike-ins for normalization, for example:

and no doubt a lot of biologists have followed this advice. We went through the same process with microarrays 15 years ago, whereby spike-ins were proposed for normalization but the genomic methods proposed by statistical bioinformaticians gradually took over. There may be special situations, e.g., where all or most genes are DE, in which normalization by spike-ins is a good idea.

Thank you, this really helps.

Answer: DESeq and Limma+Voom Normalization for Rna-Seq Data Using Ercc Spike-In
2
3.4 years ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

Yep, looks good for voom. In fact, you could probably just use the count sums for the spike-ins directly as the library sizes, without multiplying by nf. I would be surprised if the normalization factors differ substantially from unity, as there shouldn't be any DE in the spike-in counts between samples. Of course, the implicit assumption of this normalization strategy is that you've added the same amount of spike-in (per cell, or per quantity of cellular RNA) to each sample. Whether that's true or not is probably dependent on how good your operator is.