I had a question about using edgeR to perform inference on samples that have been sequenced and that have a consistent amount of external spike-in added to each sample./
In my experiment, I have 10 samples split evenly across two comparison groups. (n = 5 in treatment A, n = 5 in treatment B). These samples were the same type of tissue and were prepared for RNA-Seq. Prior to sequencing, an external spike-in was added to each sample (the same amount per sample). The tissue mass was measured before processing.
I would like to perform a differential expression analysis between the two treatment groups. edgeR inherently normalizes using TMM, treating the data as compositional. However, is it possible to use the spike-in information to scale the TMM factor by an absolute abundance factor, so that we are working on 'absolute' counts? What I have tried so far is the following:
- Calculate a scaling factor, which is the relative abundance of the spike-in.
- Remove the spike-in counts from my samples.
- Perform TMM normalization with edgeR.
- Multiply the TMM normalization scaling factor by my spike-in scaling factor to get a absolute-abundance-scaled TMM factor.
- Proceed per usual with edgeR pipeline.
See below for mock example:
###### 0. Calculating the spike-in factor
# counts_raw = mxn matrix where m = genes, n = samples
spike_in_index <- which(rownames(counts_raw) == 'spike_in')
spike_in_factor <- as.numeric(counts_raw[spike_in_index, ])/colSums(counts_raw)
counts_without_spike_in <- counts_raw[rownames(counts_raw) != 'spike_in', ]
###### 1. EdgeR Pipeline
### i. Create a DGEList
y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
### ii. Create a design
design <- model.matrix(~ 0 + Treatment, data = mapping_file)
### iii. Calculate normalization factors
y <- calcNormFactors(y)
### iv. Multiply normalization factor by spike-in factor
y$samples$norm.factors <- y$samples$norm.factors*y$samples$spike_in_factor
### v. Estimate dispersions
y <- estimateDisp(y, design, trend.method = 'locfit')
### vi. Fit the glm
fit_y <- glmQLFit(y, design)
### vii. Specify contrasts
contrasts <- makeContrasts(
TreatmentA_vs_TreatmentB = TreatmentB - TreatmentA,
levels = design
)
### vii. Quasilielihood F-test
qlf <- glmQLFTest(fit_y, contrast = contrasts[, 'TreatmentA_vs_TreatmentB'])
res_qlf <- data.frame(topTags(qlf, n = nrow(y$counts)))
Would love anyone's feedback on this topic!
Am I understanding correctly that you have just one row of spike-in counts?
Yup that's correct!
I have the same issue. As far as I understand, many of the approximations to compute differential abundances include normalization steps which remove the quantitative information from spike-in normalised read counts. MetagenomeSeq or DESeq2 seem to also include or assume normalized library sizes. I am not sure how to tune this manually in some of the cases and if the models behind assume compositional data. If that`s the case maybe the mathematical framework to analyse this data is rather limited. Any thoughts?