Question

EdgeR analysis of bottlenecked genetic screen data

0

Entering edit mode

knaxerova ▴ 10

@knaxerova-7541

Last seen 3.0 years ago

United States

Hi everyone,

I have been analyzing genetic screen data with edgeR for years -- my favorite toolbox! Now I was hoping to get people's opinions about a special case: analysis of bottlenecked data.

Let's say an assay takes a reagent library through some sort of severe bottleneck, such that the resulting data (on the raw count level) look something like this:

control1=c(100, 150, 140, 300, 1000, 300)

control2=c(120, 140, 150, 350, 1200, 320)

bottlenecked_treatment1=c(0, 0, 0, 0, 300, 10)

bottlenecked_treatment2=c(1, 0, 0, 0, 50, 0)

I have recently been analyzing data like this with the standard edgeR pipeline (with and without using calcNormFactors) in conjunction with the Camera function to score the behavior of entire groups of reagents, and it appears to be working great. However, I am wondering whether there are any potential artifacts/issues I should watch out for? Any words of caution or advice?

Many thanks in advance.

edgeR • 670 views

ADD COMMENT • link updated 5.8 years ago by Aaron Lun ★ 28k • written 5.8 years ago by knaxerova ▴ 10

score 1 · Answer 1 · 2018-06-26

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 20 hours ago

The city by the bay

The question and context are too vague to give a useful answer. Sufficient detail would include:

What is the nature of the counts? I presume that counts represent the abundance of a particular mutant.
How many features do you really have? Your example suggests 6 features - surely this is not right.
What is the nature and severity of the bottleneck? Is it expected to cause DE in a majority of genes? Is it expected to decrease the complexity of the library (and thus increase the rate of PCR duplicates)?
Does the difference between 300 and 50 meant to convey something about the variability of the data as a result of the bottleneck, or just a result of using some randomly chosen numbers in the example?

The major issue is probably how you're normalizing. If you're not using calcNormFactors, what are you using? Keep in mind that TMM (and everyone else) fails when a majority of genes are DE. However, this doesn't mean that library size normalization is any better. In fact, if you have many DE genes, the result of the downstream edgeR analysis based on library size normalization isn't even doing differential expression testing (or for screens, abundance testing) anymore, but rather is testing for differences in proportions between conditions; this is usually not intended.

So I guess the best answer I can give you is: maybe?

ADD COMMENT • link 5.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron.

* Counts are gRNAs in a genome-wide screen.

* The features I listed were just for illustrative purposes. The actual dataset would have tens of thousands of features.

* The bottleneck is severe, such that counts for a majority of the genes (or perhaps more like 75%) become 0.

* I chose 300 and 50 in that particular position of the vector to illustrate that genes that have high read counts are less likely to go to 0 in the bottlenecked treatment

For these data, I have tried NOT using calcNormFactors (so just normalize by library size) and using calcNormFactors (TMM), and the results are actually very similar (i.e. the same genes score as enriched in the bottlenecked treatment in both cases). It's only a few dozen genes (from the whole genome) that score with a significant FDR. I also ran a parallel analysis with MAGECK and got the same genes, so all these methods seem to agree, I was just wondering whether there is something I have not thought of.

Thanks. Please let me know if this is still too vague.

ADD REPLY • link 5.8 years ago knaxerova ▴ 10

0

Entering edit mode

As described, the bottleneck will make it very difficult to normalize.

You cannot assume that most genes are not DE, because then you'd be trying to compute normalization factors from zero counts, which is obviously nonsensical. In fact, calcNormFactors will actually ignore the zeroes - this means that it computes the normalization factors from the remaining 25% of non-zero counts. This effectively makes the assumption that most of those remaining genes are non-DE. Perhaps you could make some hand-wavy justification based on the known roles of those genes, but this will involve imposing some strong biological prior expectations on your data.

The other choice is to use library size normalization. In the presence of strong DE, This fundamentally changes the nature of the hypothesis being tested, as I mentioned before. Whether or not this is the lesser of two evils is debatable.

In any case, agreement between methods means little in terms of the correctness of the result, as the data are so extreme that all attempts at normalization are likely to be compromised in one way or another.

ADD REPLY • link 5.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Well, fortunately I have a number of positive controls in the experiment, and they are correctly appearing at or near the very top of the list, so I have a measure of correctness in that regard. It therefore seems to me that edgeR is doing a good job handling these data overall. Many screens in which a strong selective pressure is applied result in bottlenecking, so this is hardly an unusual case (at least in our lab). I was just trying to see what people thought the "best case solution" for a computationally suboptimal (but biologically inevitable) scenario was. Seems to me like the library size normalization is fine, as long as it is clear that one is testing a difference in proportions ... which means, if I understand you correctly, that I am asking the question "is it significant that 80% of pool 1 is taken up by guide A, while only 2% of pool 2 is taken up by guide A"?

ADD REPLY • link 5.8 years ago knaxerova ▴ 10

0

Entering edit mode

For this type of experiment, negative controls are equally if not more valuable (e.g., non-targeting CRISPR guides). These would really tell you if you're normalizing correctly. In particular, you could immediately distinguish between the two extreme possibilities - the 75% of genes are genuinely downregulated and the remaining 25% are non-DE; or the 75% are non-DE, and the 25% are strongly upregulated to the point that the coverage of all other genes is suppressed (which maybe possible in long time courses after knocking out anti-proliferative genes). Or, of course, anything in between. Positive controls are less helpful here, as they can't tell you much about correctness of log-fold changes further down the DE list.

In any case, you should be clear that you are testing for a difference in proportions. This is not always intuitive and it may not even be the scientific question of interest. Indeed, a non-DE gene can happily exhibit significant changes in proportions due to DE in other genes. I've had a few exchanges like this:

Collaborator: So, this gene looks like it's upregulated upon treatment X.

Me: Well, the proportion of reads assigned to this gene increases in X.

Collaborator: So that means that the expression goes up?

Me: No. The proportion goes up.

Collaborator: But what about the expression?

Me: I can't tell you that.

Collaborator: (pause) You're pretty useless, aren't you?

ADD REPLY • link 5.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Haha. And you’re like: not useless at all, only precise!

Totally understood and agreed about the proportions. One has to be comfortable with that scientific question. For the particular experiment I am analyzing it is actually appropriate, as cells are separated into two pools by FACS that are then directly compared to each other. I am happy to just get a sense of proportion change, it’s really all I can hope for in this setting. Thanks for your help Aaron.

ADD REPLY • link 5.8 years ago knaxerova ▴ 10