Search
Question: edgeR: Should there be a normalization step for CRISPR screens? How to deal with bottlenecked screens?
0
11 months ago by
knaxerova10
United States
knaxerova10 wrote:

Hi everyone,

I am wondering about the edgeR normalization step (calcNormFactors) for CRISPR screens. The edgeR user guide says that TMM normalization is recommended for RNA-Seq data because "the highly expressed genes can consume a substantial proportion of the total library size, causing the remaining genes to be under-sampled in that sample. Unless this RNA composition effect is adjusted for, the remaining genes may falsely appear to be down-regulated in that sample." -- the same applies for CRISPR enrichment screens, where in many cases individual guides can take over large fractions of the library (in comparison to the starting point). However, in most of the case studies I have seen for CRISPR screens, no normalization is applied. What is the reason for this?

A related question concerns enrichment screens with a substantial bottleneck. For example, what would be the best way to deal with FACS screens where a small number of cells has been sorted from a complex library? Most guides will be well-represented in the baseline sample, but will have zero counts in the sorted sample, where only a few guides enrich massively. What kind of normalization (if any) should/could be applied? What is the best way to set up the analysis in edgeR in such situations?

Any thoughts would be much appreciated!

Thank you.

Kamila

modified 11 months ago • written 11 months ago by knaxerova10

Which case studies are you referring to? These?

Yes. Specifically the approach shown in section 6. Thanks!

4
11 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

In lieu of someone more qualified, I'll have a stab at answering your questions.

For your first question: TMM normalization in calcNormFactors relies on the assumption that most genes (or guides, or whatever features you're giving it) are not DE between samples. This may not be true for small screens with low numbers of guides, especially if those guides were chosen to target genes that are relevant to the phenotype of interest - and thus are more likely to be DE. In such cases, it may be better to perform library size normalization, given that TMM normalization will not be accurate when its assumption is violated. This will correct for sequencing depth at least. However, the use of library size normalization requires you to accept that you are now testing for differential proportions rather than differential abundance of the guides.

For your second question: again, this depends on whether you can assume that the majority of guides are non-DE in the sorted cells. If not, you cannot use TMM normalization. Library size normalization also sounds problematic if a few guides are dominating the library, and even more so if those guides are DE. Presumably the sorted cells undergo separate library preparation and sequencing from the baseline samples, so you can't use information from the latter to help normalize the former. I also assume that you don't have dedicated "house-keeping" guides that are expected to be present and non-DE in the sorted cells and that you can use for normalization. The only solution I can think of is to sort more cells and sequence more deeply to obtain non-zero counts for more guides.

Of course, there is the other issue of the number of sorted cells. If the size of the sorted population decreases over the course of the screen, how should that be handled in the normalization procedure? You'd probably need to assume that most guides do not affect the viability of the sorted subset, which would allow you to assume that changes in the size of the sorted subset are uninteresting and can be normalized out. To do that, you would enough guides, hence the suggestion for increased sequencing.

0
11 months ago by
knaxerova10
United States
knaxerova10 wrote:

Thanks a lot Aaron! I totally agree with you, for small screens TMM may not be suitable, but I think it is a good choice for genome-wide screens. The case studies we discussed (which are also in the edgeR user guide) are genome-wide screens and they don't do it, but I think I will go ahead and do TMM. I think it makes a lot of sense, particularly if you have a few very potently enriching guides in your screen.

Just to make sure I understand you correctly: what do you mean by "performing" library normalization? This is something edgeR does automatically and in every case, correct? So in the classic edgeR pipeline, there is no choice between library normalization and no library normalization?

Thanks also for grappling with my second question. Unfortunately, I don't have the option of sequencing more cells. I already have to sort for many hours to get the few cells that I've got. :) Actually just using "regular edgeR" (with automatic library normalization) and gene set analysis with roast/camera gives pretty good results I have to say. It is true that the abundances of guides that did make it into the sorted samples is overblown with library normalization, but it still works on a gene set level because it means something if all guides belonging to a gene have high abundances. The other thing one could do (and I am curious what you think of this) would be to do repeated random sampling from the counts of the starting population. Draw guides x times (with x corresponding to the number of sorted cells) and calculate how often one gets the observed enrichment by chance. This would circumvent the need for any direct comparison of libraries.

"Library size normalization" refers to the use of the library size only for scaling normalization of each sample. This is equivalent to running edgeR without using calcNormFactors, i.e., normalization factors set to 1. Remember that edgeR uses the effective library sizes (product of the library size and normalization factor for each sample) for normalization, so if the normalization factors are equal to 1, you're basically using the library size itself. In contrast, when using TMM normalization via calcNormFactors, the normalization factors will usually differ from unity after adjusting for composition biases. This results in effective library sizes that are different from the original library sizes, which is why I don't refer to this as library size normalization.

Regarding your sorted cell data: library size normalization seems like it may be the least-worst choice of the lot. However, as I said before, you will have to keep in mind that you are testing for differential proportions, which requires some care in how you interpret the results. The fact that all your high-abundance guides correspond to the same genes is irrelevant to the performance of the normalization procedure. All it means is that your different guides for the same gene are consistent with each other, but for all we know, they could be consistently wrong given that any errors in scaling normalization would affect all genes.

Regarding your sampling scheme: this won't account for biases in guide sequencing, or for biological variability in the screening process, which was the whole point of using edgeR to do this analysis in the first place. As your sampling operates across guides, this would be particularly compromised by correlations between guides that target the same gene, which would alter the null distribution for the guide count. I would stick with using edgeR, despite the flaws with library size normalization.

Thanks! Regarding the sorted cells, it sounds like what I have already been doing (edgeR with library normalization) is the best approach then, even if it's not ideal... I wonder whether there is a way to identify the potential "errors in scaling normalization" that you are referring to?

With the sampling scheme, I believe it could potentially be set up in a way that makes sense. The experiment is such that there is a complex library which is separated into two pools by FACS, one of the pools is very large, the other is the small population I was referring to. This is done in replicate with separate infections. So every replicate of the small sorted cell population has its own built-in reference (the large population from which it was separated). One would basically be testing the null hypothesis that the small population is sampled at random from the large population. Seems like this experimental design would account for biases in guide sequencing or variability in the screening process.

1

For your first question, not in general. If we were able to measure systematic biases in scaling normalization, then by definition, we would be able to adjust our normalization procedure until those biases are zero, thus yielding an accurate normalization procedure that we could have used in the first place.

For your sampling scheme; you seem to be testing a different null hypothesis than the one edgeR is targeting. Were you using edgeR to compare the sorted cells to the baseline? Even so, the null hypotheses are different, as edgeR is making guide-specific inferences rather than a general statement about the randomness of the subset. In any case, my points about variability and correlations still apply - your sampling scheme only considers technical variability/Poisson noise, and not variability between replicate runs of the screen or time points; and random sampling assumes there are no dependencies between guides, which is not going to be true (e.g., guides that target genes that are not involved in your phenotype will exhibit dependencies without violating the null hypothesis).

Yes, the comparisons I am doing with edgeR are sorted cells vs. the pool from which they were separated. The main question I have is which guides are differentially enriched in the sorted pool (it’s really impossible to look for dropouts). My current (if not ideal) approach is to create a DGElist object, filter low abundance reads, estimate dispersions and run camera/roast. I also extract fold changes and scale them to a mean of zero. This approach seems to be working from a biologist‘s perspective (the known positive controls in the screen come up on top), but if there is anything that can be improved, I would love to know. Thank you!

PS: you convinced me with the random sampling.