Question

Using edgeR and a spike-in to calculate absolute abundance

0

Entering edit mode

robert.chen • 0

@robertchen-14677

Last seen 3.1 years ago

I had a question about using edgeR to perform inference on samples that have been sequenced and that have a consistent amount of external spike-in added to each sample./

In my experiment, I have 10 samples split evenly across two comparison groups. (n = 5 in treatment A, n = 5 in treatment B). These samples were the same type of tissue and were prepared for RNA-Seq. Prior to sequencing, an external spike-in was added to each sample (the same amount per sample). The tissue mass was measured before processing.

I would like to perform a differential expression analysis between the two treatment groups. edgeR inherently normalizes using TMM, treating the data as compositional. However, is it possible to use the spike-in information to scale the TMM factor by an absolute abundance factor, so that we are working on 'absolute' counts? What I have tried so far is the following:

Calculate a scaling factor, which is the relative abundance of the spike-in.
Remove the spike-in counts from my samples.
Perform TMM normalization with edgeR.
Multiply the TMM normalization scaling factor by my spike-in scaling factor to get a absolute-abundance-scaled TMM factor.
Proceed per usual with edgeR pipeline.

See below for mock example:

###### 0. Calculating the spike-in factor
# counts_raw = mxn matrix where m = genes, n = samples
spike_in_index <- which(rownames(counts_raw) == 'spike_in')
spike_in_factor <- as.numeric(counts_raw[spike_in_index, ])/colSums(counts_raw)
counts_without_spike_in <- counts_raw[rownames(counts_raw) != 'spike_in', ]

###### 1. EdgeR Pipeline
### i. Create a DGEList
y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
### ii. Create a design
design <- model.matrix(~ 0 + Treatment, data = mapping_file)
### iii. Calculate normalization factors
y <-  calcNormFactors(y)
### iv. Multiply normalization factor by spike-in factor
y$samples$norm.factors <- y$samples$norm.factors*y$samples$spike_in_factor
### v. Estimate dispersions
y <- estimateDisp(y, design, trend.method = 'locfit')
### vi. Fit the glm
fit_y <- glmQLFit(y, design)
### vii. Specify contrasts
contrasts <- makeContrasts(
  TreatmentA_vs_TreatmentB = TreatmentB - TreatmentA,
  levels = design
)
### vii. Quasilielihood F-test
qlf <- glmQLFTest(fit_y, contrast = contrasts[, 'TreatmentA_vs_TreatmentB'])
res_qlf <- data.frame(topTags(qlf, n = nrow(y$counts)))

Would love anyone's feedback on this topic!

edgeR SpikeIn RNASeq • 3.7k views

ADD COMMENT • link updated 6 months ago by Miguel • 0 • written 3.2 years ago by robert.chen • 0

1

Entering edit mode

Am I understanding correctly that you have just one row of spike-in counts?

ADD REPLY • link 3.2 years ago Gordon Smyth 50k

0

Entering edit mode

Yup that's correct!

ADD REPLY • link 3.1 years ago robert.chen • 0

0

Entering edit mode

I have the same issue. As far as I understand, many of the approximations to compute differential abundances include normalization steps which remove the quantitative information from spike-in normalised read counts. MetagenomeSeq or DESeq2 seem to also include or assume normalized library sizes. I am not sure how to tune this manually in some of the cases and if the models behind assume compositional data. If that`s the case maybe the mathematical framework to analyse this data is rather limited. Any thoughts?

ADD REPLY • link 6 months ago Miguel • 0

score 1 · Answer 1 · 2021-03-01

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

You're on the right track but it's actually slightly simpler. No need to run calcNormFactors:

y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
norm.factors <- spike_in_factor / y$samples$lib.size
norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors))
y$samples$norm.factors <- norm.factors

Then just proceed with as usual in edgeR without further normalization, for example

y <- estimateDisp(y, design, trend.method = 'locfit')

etc.

ADD COMMENT • link 3.2 years ago • updated 2.4 years ago Gordon Smyth 50k

0

Entering edit mode

Okay awesome, thank you so much!!

ADD REPLY • link 3.1 years ago robert.chen • 0

0

Entering edit mode

I realize that I answered quickly without understanding. What's the reason for not needing to calculate normalization factors? I didn't realize that having a spike-in eliminated that need.

ADD REPLY • link 3.1 years ago robert.chen • 0

1

Entering edit mode

The purpose of spike-ins is to normalize the library sizes. The purpose of TMM is to normalize the library sizes. You obviously can't normalize the library sizes in two different ways simultaneously. You have to choose one or the other.

Further than that, the whole purpose of spike-in normalization is to handle situations that TMM can't. Usually spike-in normalization is used when global differential expression is expected, i.e., when more than half of all genes may be DE in one direction. On the other hand, if the assumptions of TMM were satisfied then one wouldn't even consider spike-ins. TMM normalization is enormously more accurate than spike-in normalization when its assumptions are satisified.

ADD REPLY • link 3.1 years ago Gordon Smyth 50k

0

Entering edit mode

Okay this is INCREDIBLY helpful, thank you Gordon!

ADD REPLY • link 3.1 years ago robert.chen • 0

1

Entering edit mode

Note I edited my answer today to fix a problem with the suggested code.

ADD REPLY • link 3.1 years ago Gordon Smyth 50k

0

Entering edit mode

I just saw it, thanks!

ADD REPLY • link 3.1 years ago robert.chen • 0

0

Entering edit mode

Hi Smyth, I was following your procedure to perform spike-in normalization in edgeR. If I understand correctly, the spike_in_factor = raw_spike_in_counts / total_library_size. Then, the spike_in_factor are further divided by non_spike_in_library_size. My question is why the raw_spike_in_counts need to divide by the total_library_size and non_spike_in_library_size.

Also, I saw in other posts, Aaron Lun recommended to pass the spike-in library to the edgeR's calcNormFactors() for calulating normalization factors. I tried both your and Aaron's suggestion, the normalization factors from two methods were quite different and I was confused of which one to use. What is the difference between your and Aaron's methods?

Thanks in advance!

ADD REPLY • link 2.4 years ago lida • 0

0

Entering edit mode

Hi Gordon, and everyone,

We are working with SNAP-ChIP assays that uses approx 20 spike-in sequences.

When we do the normalization, shall I use the same formula below , or I was wondering whether there are updates :

y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
norm.factors <- spike_in_factor / y$samples$lib.size
norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors))
y$samples$norm.factors <- norm.factors

ADD REPLY • link 17 months ago Bogdan ▴ 670