Using edgeR and a spike-in to calculate absolute abundance
1
0
Entering edit mode
@robertchen-14677
Last seen 4 months ago

I had a question about using edgeR to perform inference on samples that have been sequenced and that have a consistent amount of external spike-in added to each sample./

In my experiment, I have 10 samples split evenly across two comparison groups. (n = 5 in treatment A, n = 5 in treatment B). These samples were the same type of tissue and were prepared for RNA-Seq. Prior to sequencing, an external spike-in was added to each sample (the same amount per sample). The tissue mass was measured before processing.

I would like to perform a differential expression analysis between the two treatment groups. edgeR inherently normalizes using TMM, treating the data as compositional. However, is it possible to use the spike-in information to scale the TMM factor by an absolute abundance factor, so that we are working on 'absolute' counts? What I have tried so far is the following:

  1. Calculate a scaling factor, which is the relative abundance of the spike-in.
  2. Remove the spike-in counts from my samples.
  3. Perform TMM normalization with edgeR.
  4. Multiply the TMM normalization scaling factor by my spike-in scaling factor to get a absolute-abundance-scaled TMM factor.
  5. Proceed per usual with edgeR pipeline.

See below for mock example:

###### 0. Calculating the spike-in factor
# counts_raw = mxn matrix where m = genes, n = samples
spike_in_index <- which(rownames(counts_raw) == 'spike_in')
spike_in_factor <- as.numeric(counts_raw[spike_in_index, ])/colSums(counts_raw)
counts_without_spike_in <- counts_raw[rownames(counts_raw) != 'spike_in', ]

###### 1. EdgeR Pipeline
### i. Create a DGEList
y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
### ii. Create a design
design <- model.matrix(~ 0 + Treatment, data = mapping_file)
### iii. Calculate normalization factors
y <-  calcNormFactors(y)
### iv. Multiply normalization factor by spike-in factor
y$samples$norm.factors <- y$samples$norm.factors*y$samples$spike_in_factor
### v. Estimate dispersions
y <- estimateDisp(y, design, trend.method = 'locfit')
### vi. Fit the glm
fit_y <- glmQLFit(y, design)
### vii. Specify contrasts
contrasts <- makeContrasts(
  TreatmentA_vs_TreatmentB = TreatmentB - TreatmentA,
  levels = design
)
### vii. Quasilielihood F-test
qlf <- glmQLFTest(fit_y, contrast = contrasts[, 'TreatmentA_vs_TreatmentB'])
res_qlf <- data.frame(topTags(qlf, n = nrow(y$counts)))

Would love anyone's feedback on this topic!

edgeR SpikeIn RNASeq • 495 views
ADD COMMENT
1
Entering edit mode

Am I understanding correctly that you have just one row of spike-in counts?

ADD REPLY
0
Entering edit mode

Yup that's correct!

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 24 minutes ago
WEHI, Melbourne, Australia

You're on the right track but it's actually slighly simpler. No need to run calcNormFactors:

y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
norm.factors <- spike_in_factor / y$samples$lib.size
norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors))
y$samples$norm.factors <- norm.factors

Then just proceed with as usual in edgeR without further normalization, for example

y <- estimateDisp(y, design, trend.method = 'locfit')

etc.

ADD COMMENT
0
Entering edit mode

Okay awesome, thank you so much!!

ADD REPLY
0
Entering edit mode

I realize that I answered quickly without understanding. What's the reason for not needing to calculate normalization factors? I didn't realize that having a spike-in eliminated that need.

ADD REPLY
1
Entering edit mode

The purpose of spike-ins is to normalize the library sizes. The purpose of TMM is to normalize the library sizes. You obviously can't normalize the library sizes in two different ways simultaneously. You have to choose one or the other.

Further than that, the whole purpose of spike-in normalization is to handle situations that TMM can't. Usually spike-in normalization is used when global differential expression is expected, i.e., when more than half of all genes may be DE in one direction. On the other hand, if the assumptions of TMM were satisfied then one wouldn't even consider spike-ins. TMM normalization is enormously more accurate than spike-in normalization when its assumptions are satisified.

ADD REPLY
0
Entering edit mode

Okay this is INCREDIBLY helpful, thank you Gordon!

ADD REPLY
1
Entering edit mode

Note I edited my answer today to fix a problem with the suggested code.

ADD REPLY
0
Entering edit mode

I just saw it, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6