Using edgeR and a spike-in to calculate absolute abundance
1
0
Entering edit mode
@robertchen-14677
Last seen 22 months ago

I had a question about using edgeR to perform inference on samples that have been sequenced and that have a consistent amount of external spike-in added to each sample./

In my experiment, I have 10 samples split evenly across two comparison groups. (n = 5 in treatment A, n = 5 in treatment B). These samples were the same type of tissue and were prepared for RNA-Seq. Prior to sequencing, an external spike-in was added to each sample (the same amount per sample). The tissue mass was measured before processing.

I would like to perform a differential expression analysis between the two treatment groups. edgeR inherently normalizes using TMM, treating the data as compositional. However, is it possible to use the spike-in information to scale the TMM factor by an absolute abundance factor, so that we are working on 'absolute' counts? What I have tried so far is the following:

1. Calculate a scaling factor, which is the relative abundance of the spike-in.
2. Remove the spike-in counts from my samples.
3. Perform TMM normalization with edgeR.
4. Multiply the TMM normalization scaling factor by my spike-in scaling factor to get a absolute-abundance-scaled TMM factor.
5. Proceed per usual with edgeR pipeline.

See below for mock example:

###### 0. Calculating the spike-in factor
# counts_raw = mxn matrix where m = genes, n = samples
spike_in_index <- which(rownames(counts_raw) == 'spike_in')
spike_in_factor <- as.numeric(counts_raw[spike_in_index, ])/colSums(counts_raw)
counts_without_spike_in <- counts_raw[rownames(counts_raw) != 'spike_in', ]

###### 1. EdgeR Pipeline
### i. Create a DGEList
y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
### ii. Create a design
design <- model.matrix(~ 0 + Treatment, data = mapping_file)
### iii. Calculate normalization factors
y <-  calcNormFactors(y)
### iv. Multiply normalization factor by spike-in factor
y$samples$norm.factors <- y$samples$norm.factors*y$samples$spike_in_factor
### v. Estimate dispersions
y <- estimateDisp(y, design, trend.method = 'locfit')
### vi. Fit the glm
fit_y <- glmQLFit(y, design)
### vii. Specify contrasts
contrasts <- makeContrasts(
TreatmentA_vs_TreatmentB = TreatmentB - TreatmentA,
levels = design
)
### vii. Quasilielihood F-test
qlf <- glmQLFTest(fit_y, contrast = contrasts[, 'TreatmentA_vs_TreatmentB'])
res_qlf <- data.frame(topTags(qlf, n = nrow(y$counts)))  Would love anyone's feedback on this topic! edgeR SpikeIn RNASeq • 1.9k views ADD COMMENT 1 Entering edit mode Am I understanding correctly that you have just one row of spike-in counts? ADD REPLY 0 Entering edit mode Yup that's correct! ADD REPLY 1 Entering edit mode @gordon-smyth Last seen 6 minutes ago WEHI, Melbourne, Australia You're on the right track but it's actually slightly simpler. No need to run calcNormFactors: y <- DGEList(counts = counts_without_spike_in, samples = mapping_file) norm.factors <- spike_in_factor / y$samples$lib.size norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors)) y$samples$norm.factors <- norm.factors  Then just proceed with as usual in edgeR without further normalization, for example y <- estimateDisp(y, design, trend.method = 'locfit')  etc. ADD COMMENT 0 Entering edit mode Okay awesome, thank you so much!! ADD REPLY 0 Entering edit mode I realize that I answered quickly without understanding. What's the reason for not needing to calculate normalization factors? I didn't realize that having a spike-in eliminated that need. ADD REPLY 1 Entering edit mode The purpose of spike-ins is to normalize the library sizes. The purpose of TMM is to normalize the library sizes. You obviously can't normalize the library sizes in two different ways simultaneously. You have to choose one or the other. Further than that, the whole purpose of spike-in normalization is to handle situations that TMM can't. Usually spike-in normalization is used when global differential expression is expected, i.e., when more than half of all genes may be DE in one direction. On the other hand, if the assumptions of TMM were satisfied then one wouldn't even consider spike-ins. TMM normalization is enormously more accurate than spike-in normalization when its assumptions are satisified. ADD REPLY 0 Entering edit mode Okay this is INCREDIBLY helpful, thank you Gordon! ADD REPLY 1 Entering edit mode Note I edited my answer today to fix a problem with the suggested code. ADD REPLY 0 Entering edit mode I just saw it, thanks! ADD REPLY 0 Entering edit mode Hi Smyth, I was following your procedure to perform spike-in normalization in edgeR. If I understand correctly, the spike_in_factor = raw_spike_in_counts / total_library_size. Then, the spike_in_factor are further divided by non_spike_in_library_size. My question is why the raw_spike_in_counts need to divide by the total_library_size and non_spike_in_library_size. Also, I saw in other posts, Aaron Lun recommended to pass the spike-in library to the edgeR's calcNormFactors() for calulating normalization factors. I tried both your and Aaron's suggestion, the normalization factors from two methods were quite different and I was confused of which one to use. What is the difference between your and Aaron's methods? Thanks in advance! ADD REPLY 0 Entering edit mode Hi Gordon, and everyone, We are working with SNAP-ChIP assays that uses approx 20 spike-in sequences. When we do the normalization, shall I use the same formula below , or I was wondering whether there are updates : y <- DGEList(counts = counts_without_spike_in, samples = mapping_file) norm.factors <- spike_in_factor / y$samples$lib.size norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors)) y$samples\$norm.factors <- norm.factors