I am working with an RNA-seq dataset where we're looking at the potential synergistic effects of two genes (by definition, the output in a synergistic system suggests that the two genes, when acting together as proteins on the same cell, will have a greater effect than the sum of their individual effects on a cell). All 4 mouse samples are stimulated with the proteins expressed by these genes:
Wild Type
Single KO of gene 1
Single KO of gene 2
Double KO
Since I'm interested in asking "What genes are differentially expressed as a result of the synergistic action of gene 1 and gene 2", I've come to the conclusion that I need to take my WT sample (which should represent gene expression from both the synergistic and individual effects of the proteins) and do a differential analysis with both SKO1 and SKO2. The DKO sample expression can also be added in order to account for the basal levels of gene expression without the two stimuli; the differential "equation" thus becomes:
(WT+DKO) - (SKO1 + SKO2) = Synergy
*I am not taking the average of SKO1 and SKO2 on purpose; to note synergistic effects, we must compare to an additive effect. I am specifically interested in the genes expressed due to synergy, not the individual additions of the stimuli.
I've thought of a few ways to attempt this (my analysis is in edgeR but happy to work with DESeq2/other packages as well), but I am worried that these methods are not technically sound. One method is to combine the raw counts of SKO1 and SKO2 as well as those of WT and DKO, then run dispersion/glmFit/makeContrasts. I've tried this pre- and post-normalization, and my outputs are similar (and makes biological sense as the one gene we know is synergistic is one of the top genes). However, even with logical outputs and after going through the edgeR manual/other posts, I'm not quite sure what differs in the analysis when I do combine my counts (especially because the “library sizes” of the combined counts are almost twice as big as the individual samples).
Is there any precedent for combining counts of non-replicate samples and looking at a differential output?
Would it make any sense to do this at another stage (ex. FASTQ file/reads/before generation of counts) if combining counts is just grossly wrong and flawed? If not, are there any other ways I could get at this question of synergy from differential analysis with this data?
Aaron,
Thanks for your detailed and helpful response; it seems my design matrix was not set up correctly to create a contrast by adding 2 different groups, but it’s resolved now. Looking at my results, it appears my initial “manual” work-around combination of counts (adding SKO1/SKO2 and DKO/WT columns together) results in a similar list of genes with similar log-fold changes to setting it up in makeContrasts.
I’m curious, however, as to why you suggest a contrast where I’m subtracting out the background from the additive effect - this gives me the genes that are theoretically synergistically “activated” as negative log fold changes based on the following logic:
WT=Synergy + Additive effect of SKO1/SKO2 under 1 Basal background
DKO=Basal background
(SKO1 + SKO2) – (DKO + WT) = (Additive+Basal+Basal) - (Basal+ Basal + Synergy + Additive) = –Synergy
So I would think that my initial proposal is still what I want (WT+DKO) – (SKO1+SKO2)? (not that I can’t just as easily note in your case that negative log fold changes = Synergistically “activated” genes while positive log fold changes = Synergistically “repressed” genes)
Thanks,
Tima
I would advise against using that workaround in general; adding counts assumes that you have the same number of samples in each group and all samples have similar library sizes, etc. You also reduce the number of residual d.f. in your model, thus reducing power.
Well, I interpret the "effect" as that of the treatment, which in this case is the knockout of each gene. From a strictly experimental perspective, this is most accurate; an experiment that really tests the effect of an active gene would over-express it. For example, in your case, it could be something like, "knock out both genes (or find a system in which neither were endogenously expressed) and then rescue the expression of one/both of them with plasmids or lentiviruses". The distinction between the two possible experiments makes it worth being precise, at least when you're writing up your methods - you can wax lyrical about synergistic gene activity in the discussion.