I'm trying to apply edgeR to the analysis of a peptide screen and I have a few questions I hope I can get help with.
We are following a similar protocol to Matochko et al (PMID:24217917) as we are trying to identify "fast-grower" sequences (i.e. those that appear to proliferate in the absence of any intended selection pressure) in our libraries so that we can ultimately remove them. We have 5 separately amplified samples of the same naive peptide library as a starting point and each go through three rounds of "screening", which is essentially amplification and PEG ppt. without selection. We have sequenced the 5 initially amplified libraries (A30_1-5) and the corresponding 5 x 3 rounds of the "screen" (NSP1_1-5, NSP2_1-5 and NSP3_1-5). The library sizes are large (~4M for each NSP3 dataset and 7M for each A30 dataset). Overall if nothing is misbehaving then when this data is analysed in an RNA-Seq type analysis framework like edgeR, we shouldn't see any differential enrichment between A30 vs NSP3 data. If we do then these are our "fast-growers". Note that the data is paired and A30_1 is paired with NSP3_1 etc.
The input to the analysis comprises all unique sequences across all 5 NSP3 datasets (akin to genes in a genome but instead of ~20K we have 6.5M) and their counts in each 5 x A30 dataset and 5 x NSP3 dataset. I have subsequently narrowed this dataset down to those sequences with a count >0 in at least 5 datasets, which takes us to about 500K sequences.
- Our libraries are very complex and diverse and while every effort was made to sequence as much as we could we cannot sequence our whole libraries within realistic budget constraints. Therefore, not every unique NSP3 sequence is present in all NSP3 or A30 datasets (each of the 10 samples is about 30% 0s). Furthermore our overall counts are low (min=0, max=~500, mean=~1.5, sd=~3-5 depending on the dataset). Will this high zero, low count dataset be sensible to analyse in the edgeR framework?
- As the input dataset is a filtered dataset and edgeR calculates library size from the count data I have intentionally forced the
y$samples$lib.sizeto be the original library size rather than the one calculated from the counts. I've done this as for the CPM calculations and normalisation I wanted to ensure that the true library size was provided. I have read [EdgeR] question about the y$samples$lib.size and y$samples$norm.factors that this shouldn't be required but it seems to make a difference to the results in our case. Is this a sensible thing to do?
- In terms of normalisation, I am struggling to work out which method would be best in our case. I know that TMM and others account for library size *and* library composition. However, in a screen such as ours it seems to be that the process of a sequence becoming enriched is rather different to looking at abundance. Enrichment seems to mirror compositional bias in that one sequence legitimately enriches at the expense of another. I am concerned that applying a normalisation method that removes this affect, will remove the very thing we want to model. I know that
calcNormFactors()takes the method "none" but other than setting all the
norm.factorsto 1 I have not found out whether or not doing so just accounts for library size in the analysis and not composition or if it does something else. Please can someone let me know whether my thinking is right here and I was to use it, what "none" is actually doing?
My apologies for a lengthy post but I have been saving these questions up!
If anyone has any other helpful hints then I would be grateful to hear them as I have not found much in the literature about applying such a method in this context.
Many thanks in advance.