Question: EdgeR applied to peptide screen: help!
0
7 months ago by
oakhamwolf0 wrote:

Dear all,

I'm trying to apply edgeR to the analysis of a peptide screen and I have a few questions I hope I can get help with.

Background

We are following a similar protocol to Matochko et al (PMID:24217917) as we are trying to identify "fast-grower" sequences (i.e. those that appear to proliferate in the absence of any intended selection pressure) in our libraries so that we can ultimately remove them. We have 5 separately amplified samples of the same naive peptide library as a starting point and each go through three rounds of "screening", which is essentially amplification and PEG ppt. without selection.  We have sequenced the 5 initially amplified libraries (A30_1-5) and the corresponding 5 x 3 rounds of the "screen" (NSP1_1-5, NSP2_1-5 and NSP3_1-5). The library sizes are large (~4M for each NSP3 dataset and 7M for each A30 dataset). Overall if nothing is misbehaving then when this data is analysed in an RNA-Seq type analysis framework like edgeR, we shouldn't see any differential enrichment between A30 vs NSP3 data. If we do then these are our "fast-growers". Note that the data is paired and A30_1 is paired with NSP3_1 etc.

The input to the analysis comprises all unique sequences across all 5 NSP3 datasets (akin to genes in a genome but instead of ~20K we have 6.5M) and their counts in each 5 x A30 dataset and 5 x NSP3 dataset. I have subsequently narrowed this dataset down to those sequences with a count >0 in at least 5 datasets, which takes us to about 500K sequences.

Questions

1. Our libraries are very complex and diverse and while every effort was made to sequence as much as we could we cannot sequence our whole libraries within realistic budget constraints. Therefore, not every unique NSP3 sequence is present in all NSP3 or A30 datasets (each of the 10 samples is about 30% 0s). Furthermore our overall counts are low (min=0, max=~500, mean=~1.5, sd=~3-5 depending on the dataset). Will this high zero, low count dataset be sensible to analyse in the edgeR framework?
2. As the input dataset is a filtered dataset and edgeR calculates library size from the count data I have intentionally forced the y$samples$lib.size to be the original library size rather than the one calculated from the counts. I've done this as for the CPM calculations and normalisation I wanted to ensure that the true library size was provided. I have read [EdgeR] question about the y$samples$lib.size and y$samples$norm.factors that this shouldn't be required but it seems to make a difference to the results in our case. Is this a sensible thing to do?
3. In terms of normalisation, I am struggling to work out which method would be best in our case. I know that TMM and others account for library size *and* library composition. However, in a screen such as ours it seems to be that the process of a sequence becoming enriched is rather different to looking at abundance. Enrichment seems to mirror compositional bias in that one sequence legitimately enriches at the expense of another. I am concerned that applying a normalisation method that removes this affect, will remove the very thing we want to model. I know that calcNormFactors() takes the method "none" but other than setting all the norm.factors to 1 I have not found out whether or not doing so just accounts for library size in the analysis and not composition or if it does something else. Please can someone let me know whether my thinking is right here and I was to use it, what "none" is actually doing?

My apologies for a lengthy post but I have been saving these questions up!

If anyone has any other helpful hints then I would be grateful to hear them as I have not found much in the literature about applying such a method in this context.

modified 7 months ago by Gordon Smyth37k • written 7 months ago by oakhamwolf0
Answer: EdgeR applied to peptide screen: help!
3
7 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

Question 1: in and of itself, the high number of zeroes is not a problem for edgeR. Count-based models naturally handle zeroes when the mean is low, and we've been using edgeR for single-cell data where the frequency of zeroes is often higher. The real concern is the variability between replicates - screens seem to be more variable than other types of 'omics experiments, presumably because any initial differences get amplified over time.

Question 2: Assuming you did this correctly, any difference should be slight and is caused by the calculation of other statistics where the normalization factor is not used (and thus cannot cancel out differences in the library size). For example:

library(edgeR)
y <- matrix(rnbinom(10000, mu=runif(1000, 1, 100), size=10), ncol=10)
design <- model.matrix(~gl(2, 5))

d <- DGEList(y)
d <- calcNormFactors(d)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design)
res <- glmQLFTest(fit, coef=2)
topTags(res)

d2 <- d
d2$samples$lib.size <- sample(d2$samples$lib.size)
d2 <- calcNormFactors(d2)
d2 <- estimateDisp(d2, design)
fit2 <- glmQLFit(d2, design)
res2 <- glmQLFTest(fit2, coef=2)
topTags(res2)

I get pretty similar results between these two, despite the fact that the library sizes in d2 are scrambled and don't make any sense.

Question 3: Consider a sequence X that has no effect on growth. Let's say that there are other sequences that increase growth (i.e., your fast growers) and have greater coverage in your later time points. If you don't normalize for composition biases, the increased coverage of your fast growers at later time points will push down the coverage of X. Then, when you do the analysis, you will think that X decreases over time, and thus has a deleterious effect on growth - even though it has no true effect. Indeed, in the most extreme cases, extreme fast growers can push down the coverage of other fast growers, making them seem like they have no or negative effects on growth. Normalization with methods like TMM protects you from such misleading conclusions.

There are two points worth mentioning here, though. The first is that TMM normalization (and almost all other methods) assumes that most features are not DE between pairs of samples. Specifically, if more than 30% of your peptides are fast growers, then you're in trouble. The other thing to note is that, should you decide to set method="none" (probably a bad idea, but nonetheless), then only the library size contributes to normalization. This is because the scaling factor that is used internally by edgeR is the product of the library size and normalization factor, i.e., the effective library size.

It's worth looking at https://f1000research.com/articles/3-95/v2 if you haven't already.

Answer: EdgeR applied to peptide screen: help!
1
7 months ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

In the latest Bioconductor 3.8 Release, we introduced a new option "TMMwzp" for calcNormFactors() that will generally be better than "TMM" when the counts include a high proportion of zeros.