I am using edgeR for analyzing shRNA data. So far I got the DE pvalue per shRNA. Now I want to summarize the results per gene rather than per shRNA. What is the "official" way to do this?
There are ad hoc things such as for example: pick significant shRNAs and then summarize per gene, taking care of conflicting differentials from probes.
I was hoping to do something more elegant, that might involve a linear model which per genes is something like ~probe+condition, this way I can read off the DE with the probe effect removed from the condition coefficient in a linear model fit per gene. edgeR includes a GLM feature which I was hoping will help with this, but I could not figure out how, because different genes have different number of shRNAs so a table with rows = "genes" and columns = "conditions repeated for every possible probe (up to the max number of shRNA probes in a gene)" ends up with a lot of NAs and edgeR dose not seem to be capable of dealing with this. This is also funny because edgeR thinks of columns as samples and ends normalizing the table... It is a mess.
Any ideas?
I'm not too familiar with shRNA experiments. Can you explain why there are multiple shRNAs per gene?
shRNA in this case act like the probes in the affy expression arrays: the more you have the better your ability to say that what you see (i.e. DE) is real and not just noise. shRNA probes are much noisier than affy probes because they bind off target a lot, despite all the effort in designing them to be specific. This is the main reason why you need many per gene.