Question

summarizing shRNA data per gene with edgeR

0

Entering edit mode

vladot • 0

@vladot-7304

Last seen 9.2 years ago

United States

I am using edgeR for analyzing shRNA data. So far I got the DE pvalue per shRNA. Now I want to summarize the results per gene rather than per shRNA. What is the "official" way to do this?

There are ad hoc things such as for example: pick significant shRNAs and then summarize per gene, taking care of conflicting differentials from probes.

I was hoping to do something more elegant, that might involve a linear model which per genes is something like ~probe+condition, this way I can read off the DE with the probe effect removed from the condition coefficient in a linear model fit per gene. edgeR includes a GLM feature which I was hoping will help with this, but I could not figure out how, because different genes have different number of shRNAs so a table with rows = "genes" and columns = "conditions repeated for every possible probe (up to the max number of shRNA probes in a gene)" ends up with a lot of NAs and edgeR dose not seem to be capable of dealing with this. This is also funny because edgeR thinks of columns as samples and ends normalizing the table... It is a mess.

Any ideas?

edgeR • 2.3k views

ADD COMMENT • link updated 9.2 years ago by Wolfgang Huber ★ 13k • written 9.2 years ago by vladot • 0

0

Entering edit mode

I'm not too familiar with shRNA experiments. Can you explain why there are multiple shRNAs per gene?

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

shRNA in this case act like the probes in the affy expression arrays: the more you have the better your ability to say that what you see (i.e. DE) is real and not just noise. shRNA probes are much noisier than affy probes because they bind off target a lot, despite all the effort in designing them to be specific. This is the main reason why you need many per gene.

ADD REPLY • link 9.2 years ago vladot • 0

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Aaron has described several sensible options for summarizing shRNAs per gene. Generally however we advise people not to average shRNAs if at all possible. This is because some shRNAs are much better than others, and averaging a good one with several bad ones isn't a gain. I would rather that you got a list of DE shRNAs, and then examine the genes they belong to. If you want the "official" line, this is it.

The edgeR glm features are not designed to fit models across rows, so they can't be used for this purpose.

ADD COMMENT • link 9.2 years ago Gordon Smyth 50k

0

Entering edit mode

I do this as well, but I wanted to have some way of ranking the genes. Makes things easier to communicate... shRNA are so noisy that these experiment are never the end of it anyway and genes get considered even if only one probe is DE, if there is additional support for them.

In general I find "cutoffs" suspicious (there is some unpleasant finality to it) and prefer to start from the top and work my down until money runs out :-) Of course it is good to know when the top is full of junk before setting on further validation...

ADD REPLY • link 9.2 years ago vladot • 0

1

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 17 days ago

EMBL European Molecular Biology Laborat…

Summarizing shRNA data per gene is a difficult problem with no good generic solution to my knowledge. There are two papers that made important contributions towards solving it:

Use very many shRNAs per gene ("Ultracomplex Libraries"), trust in the statistics of large numbers, and learn from the data which shRNAs are effective. See e.g. "A Systematic Mammalian Genetic Interaction Map Reveals Pathways Underlying Ricin Susceptibility" by Mike Bassik et al. http://dx.doi.org/10.1016/j.cell.2013.01.030
Tune your algorithm using a 'ground truth', see e.g. "Measuring error rates in genomic perturbation screens: gold standards for human functional genomics" by Traver Hart and Jason Moffat's lab, DOI: 10.15252/msb.20145216

ADD COMMENT • link 9.2 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

More probes is definitely a solution to the noise problem of shRNA and eventually we might do this for a subset.

ADD REPLY • link 9.2 years ago vladot • 0

score 3 · Accepted Answer · 2015-01-28

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

There are several options here. The simplest is to take the shRNA with the highest average abundance, and to use that as a representative of the entire gene. However, this doesn't really take advantage of the additional evidence provided by the other shRNAs of that gene.

The second option is to use Simes' method (check out the combineTests function in the csaw package). The null hypothesis here is that none of the shRNAs within a gene are DE. Simes' method will compute a combined p-value for each gene, based on the p-values for all shRNAs within that gene. This combined p-value represents the evidence against the aforementioned null, and incorporates information from all shRNAs in the gene. Genes with low combined p-values are of interest as they will contain at least one DE shRNA.

However, it may not be appropriate in your case, as strong DE for one shRNA can result in a low combined p-value for the entire gene. From what I understand, you are looking for genes that have consistent DE across several shRNAs. That brings me to the third option, which is to treat all shRNAs in a gene as a "gene set" (i.e., shRNAs are the "genes" in the set, and the gene containing the shRNAs is the "gene set"). You can then use methods like ROAST to do a self-contained gene set test. The set will be statistically significant only when the majority of the shRNAs are DE, which might be what you want. Check out ?roast and ?roast.DGEList for more information.

ADD COMMENT • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

This is an interesting use case for gene set tests that I hadn't considered before.

Also, Simes' method may not be appropriate here for another reason, which is that it does not penalize the case where different shRNAs of the same gene are changing in opposite directions.

How would the strategy of taking the highest-abundance shRNA for each gene as representative compare to simply adding all the shRNA counts to get gene-level counts and analyzing those?

ADD REPLY • link 9.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Relative performance would depend on how the counts are distributed between shRNAs of the same gene. For example, the gene-level counts could be dominated by one or two high abundance shRNAs. In such cases, you'd get results similar to those obtained when you only use the highest-abundance shRNA. More generally, summing counts across the entire gene will result in loss of shRNA-specific information. This might not be desirable if you need to check for consistent DE across different shRNAs of the same gene.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Simes method can be applied to one-sided p-values, which would solve the problem of shRNAs changing in opposite directions.

ADD REPLY • link 9.2 years ago Gordon Smyth 50k

0

Entering edit mode

Nice! This, somewhat tangentially, introduced me to romer and camera as well, which will come handy. Always good to learn something new. Thank you, good sir!

ADD REPLY • link 9.2 years ago vladot • 0

score 2 · Accepted Answer · 2015-01-28

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

edgeR and other GLM- and LM-based differential expression packages are designed to fit the same design model to every gene, so you will not be able to use them directly in this way.

However, edgeR has a function called spliceVariants that fits a separate model for each gene to test the interaction between exon and group while estimating dispersions at the gene level. You could treat your shRNAs like "exons", but you don't want an interaction term, you want an additive mode. So you could write a similar function that fits your desired model to each gene to test for condition effects using shRNA ID as a blocking factor.

ADD COMMENT • link 9.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

If I remember correctly, the spliceVariants function assumes independence between exons (or shRNAs, in this case). Obviously, this isn't ideal as you'd expect some correlations between shRNAs of the same gene. Also, a purely additive model (i.e., ~shRNA + group) assumes that all shRNAs change in the same manner between groups. This probably won't be true, which will compromise the model fit and violate the distributional assumptions of the associated statistics. More sophisticated designs can overcome both of these issues, but I'm not sure it's worth the effort.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

I see, so an additive model isn't even the right approach, since different shRNAs might have different responses.

ADD REPLY • link 9.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I like that: the function "hacks" around the problem I had with different probes for different genes, by considering batches of genes with similar number of probes. Neat trick which I will use to implement my original idea. Perhaps the linear model (with or without interaction) might not be entirely appropriate for shRNA, but has its merits too. For one thing it is a little more transparent to me...

ADD REPLY • link 9.2 years ago vladot • 0

1

Entering edit mode

Glad you like the spliceVariants function. I don't think this approach will work for your problem though because positive statistical correlations between shRNAs will wreck the validity of any statistical test you may do. (The correlations don't cause a problem for spliceVariants, because it is doing tests between features within samples, but you are planning to average features and the correlations will accumulate rather than cancel out.)

ADD REPLY • link 9.2 years ago Gordon Smyth 50k