Hi everyone,
I have been analyzing genetic screen data with edgeR for years -- my favorite toolbox! Now I was hoping to get people's opinions about a special case: analysis of bottlenecked data.
Let's say an assay takes a reagent library through some sort of severe bottleneck, such that the resulting data (on the raw count level) look something like this:
control1=c(100, 150, 140, 300, 1000, 300)
control2=c(120, 140, 150, 350, 1200, 320)
bottlenecked_treatment1=c(0, 0, 0, 0, 300, 10)
bottlenecked_treatment2=c(1, 0, 0, 0, 50, 0)
I have recently been analyzing data like this with the standard edgeR pipeline (with and without using calcNormFactors) in conjunction with the Camera function to score the behavior of entire groups of reagents, and it appears to be working great. However, I am wondering whether there are any potential artifacts/issues I should watch out for? Any words of caution or advice?
Many thanks in advance.
Thanks Aaron.
* Counts are gRNAs in a genome-wide screen.
* The features I listed were just for illustrative purposes. The actual dataset would have tens of thousands of features.
* The bottleneck is severe, such that counts for a majority of the genes (or perhaps more like 75%) become 0.
* I chose 300 and 50 in that particular position of the vector to illustrate that genes that have high read counts are less likely to go to 0 in the bottlenecked treatment
For these data, I have tried NOT using calcNormFactors (so just normalize by library size) and using calcNormFactors (TMM), and the results are actually very similar (i.e. the same genes score as enriched in the bottlenecked treatment in both cases). It's only a few dozen genes (from the whole genome) that score with a significant FDR. I also ran a parallel analysis with MAGECK and got the same genes, so all these methods seem to agree, I was just wondering whether there is something I have not thought of.
Thanks. Please let me know if this is still too vague.
As described, the bottleneck will make it very difficult to normalize.
You cannot assume that most genes are not DE, because then you'd be trying to compute normalization factors from zero counts, which is obviously nonsensical. In fact,
calcNormFactors
will actually ignore the zeroes - this means that it computes the normalization factors from the remaining 25% of non-zero counts. This effectively makes the assumption that most of those remaining genes are non-DE. Perhaps you could make some hand-wavy justification based on the known roles of those genes, but this will involve imposing some strong biological prior expectations on your data.The other choice is to use library size normalization. In the presence of strong DE, This fundamentally changes the nature of the hypothesis being tested, as I mentioned before. Whether or not this is the lesser of two evils is debatable.
In any case, agreement between methods means little in terms of the correctness of the result, as the data are so extreme that all attempts at normalization are likely to be compromised in one way or another.
Well, fortunately I have a number of positive controls in the experiment, and they are correctly appearing at or near the very top of the list, so I have a measure of correctness in that regard. It therefore seems to me that edgeR is doing a good job handling these data overall. Many screens in which a strong selective pressure is applied result in bottlenecking, so this is hardly an unusual case (at least in our lab). I was just trying to see what people thought the "best case solution" for a computationally suboptimal (but biologically inevitable) scenario was. Seems to me like the library size normalization is fine, as long as it is clear that one is testing a difference in proportions ... which means, if I understand you correctly, that I am asking the question "is it significant that 80% of pool 1 is taken up by guide A, while only 2% of pool 2 is taken up by guide A"?
For this type of experiment, negative controls are equally if not more valuable (e.g., non-targeting CRISPR guides). These would really tell you if you're normalizing correctly. In particular, you could immediately distinguish between the two extreme possibilities - the 75% of genes are genuinely downregulated and the remaining 25% are non-DE; or the 75% are non-DE, and the 25% are strongly upregulated to the point that the coverage of all other genes is suppressed (which maybe possible in long time courses after knocking out anti-proliferative genes). Or, of course, anything in between. Positive controls are less helpful here, as they can't tell you much about correctness of log-fold changes further down the DE list.
In any case, you should be clear that you are testing for a difference in proportions. This is not always intuitive and it may not even be the scientific question of interest. Indeed, a non-DE gene can happily exhibit significant changes in proportions due to DE in other genes. I've had a few exchanges like this:
Collaborator: So, this gene looks like it's upregulated upon treatment X.
Me: Well, the proportion of reads assigned to this gene increases in X.
Collaborator: So that means that the expression goes up?
Me: No. The proportion goes up.
Collaborator: But what about the expression?
Me: I can't tell you that.
Collaborator: (pause) You're pretty useless, aren't you?
Haha. And you’re like: not useless at all, only precise!
Totally understood and agreed about the proportions. One has to be comfortable with that scientific question. For the particular experiment I am analyzing it is actually appropriate, as cells are separated into two pools by FACS that are then directly compared to each other. I am happy to just get a sense of proportion change, it’s really all I can hope for in this setting. Thanks for your help Aaron.