Using edgeR and a spike-in to calculate absolute abundance
1
0
Entering edit mode
@robertchen-14677
Last seen 10 months ago

I had a question about using edgeR to perform inference on samples that have been sequenced and that have a consistent amount of external spike-in added to each sample./

In my experiment, I have 10 samples split evenly across two comparison groups. (n = 5 in treatment A, n = 5 in treatment B). These samples were the same type of tissue and were prepared for RNA-Seq. Prior to sequencing, an external spike-in was added to each sample (the same amount per sample). The tissue mass was measured before processing.

I would like to perform a differential expression analysis between the two treatment groups. edgeR inherently normalizes using TMM, treating the data as compositional. However, is it possible to use the spike-in information to scale the TMM factor by an absolute abundance factor, so that we are working on 'absolute' counts? What I have tried so far is the following:

1. Calculate a scaling factor, which is the relative abundance of the spike-in.
2. Remove the spike-in counts from my samples.
3. Perform TMM normalization with edgeR.
4. Multiply the TMM normalization scaling factor by my spike-in scaling factor to get a absolute-abundance-scaled TMM factor.
5. Proceed per usual with edgeR pipeline.

See below for mock example:

###### 0. Calculating the spike-in factor
# counts_raw = mxn matrix where m = genes, n = samples
spike_in_index <- which(rownames(counts_raw) == 'spike_in')
spike_in_factor <- as.numeric(counts_raw[spike_in_index, ])/colSums(counts_raw)
counts_without_spike_in <- counts_raw[rownames(counts_raw) != 'spike_in', ]

###### 1. EdgeR Pipeline
### i. Create a DGEList
y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
### ii. Create a design
design <- model.matrix(~ 0 + Treatment, data = mapping_file)
### iii. Calculate normalization factors
y <-  calcNormFactors(y)
### iv. Multiply normalization factor by spike-in factor
y$samples$norm.factors <- y$samples$norm.factors*y$samples$spike_in_factor
### v. Estimate dispersions
y <- estimateDisp(y, design, trend.method = 'locfit')
### vi. Fit the glm
fit_y <- glmQLFit(y, design)
### vii. Specify contrasts
contrasts <- makeContrasts(
TreatmentA_vs_TreatmentB = TreatmentB - TreatmentA,
levels = design
)
### vii. Quasilielihood F-test
qlf <- glmQLFTest(fit_y, contrast = contrasts[, 'TreatmentA_vs_TreatmentB'])
res_qlf <- data.frame(topTags(qlf, n = nrow(y$counts))) Would love anyone's feedback on this topic! edgeR SpikeIn RNASeq • 884 views ADD COMMENT 1 Entering edit mode Am I understanding correctly that you have just one row of spike-in counts? ADD REPLY 0 Entering edit mode Yup that's correct! ADD REPLY 1 Entering edit mode @gordon-smyth Last seen 1 hour ago WEHI, Melbourne, Australia You're on the right track but it's actually slightly simpler. No need to run calcNormFactors: y <- DGEList(counts = counts_without_spike_in, samples = mapping_file) norm.factors <- spike_in_factor / y$samples$lib.size norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors)) y$samples\$norm.factors <- norm.factors

Then just proceed with as usual in edgeR without further normalization, for example

y <- estimateDisp(y, design, trend.method = 'locfit')

etc.

0
Entering edit mode

Okay awesome, thank you so much!!

0
Entering edit mode

I realize that I answered quickly without understanding. What's the reason for not needing to calculate normalization factors? I didn't realize that having a spike-in eliminated that need.

1
Entering edit mode

The purpose of spike-ins is to normalize the library sizes. The purpose of TMM is to normalize the library sizes. You obviously can't normalize the library sizes in two different ways simultaneously. You have to choose one or the other.

Further than that, the whole purpose of spike-in normalization is to handle situations that TMM can't. Usually spike-in normalization is used when global differential expression is expected, i.e., when more than half of all genes may be DE in one direction. On the other hand, if the assumptions of TMM were satisfied then one wouldn't even consider spike-ins. TMM normalization is enormously more accurate than spike-in normalization when its assumptions are satisified.

0
Entering edit mode

Okay this is INCREDIBLY helpful, thank you Gordon!

1
Entering edit mode

Note I edited my answer today to fix a problem with the suggested code.

0
Entering edit mode

I just saw it, thanks!

0
Entering edit mode

Hi Smyth, I was following your procedure to perform spike-in normalization in edgeR. If I understand correctly, the spike_in_factor = raw_spike_in_counts / total_library_size. Then, the spike_in_factor are further divided by non_spike_in_library_size. My question is why the raw_spike_in_counts need to divide by the total_library_size and non_spike_in_library_size.

Also, I saw in other posts, Aaron Lun recommended to pass the spike-in library to the edgeR's calcNormFactors() for calulating normalization factors. I tried both your and Aaron's suggestion, the normalization factors from two methods were quite different and I was confused of which one to use. What is the difference between your and Aaron's methods?