Question: Combining counts from non-replicate samples in RNA-Seq analysis for a synergistic system
0
5 months ago by
Tima Karginov0 wrote:

I am working with an RNA-seq dataset where we're looking at the potential synergistic effects of two genes (by definition, the output in a synergistic system suggests that the two genes, when acting together as proteins on the same cell, will have a greater effect than the sum of their individual effects on a cell). All 4 mouse samples are stimulated with the proteins expressed by these genes:

1. Wild Type

2. Single KO of gene 1

3. Single KO of gene 2

4. Double KO

Since I'm interested in asking "What genes are differentially expressed as a result of the synergistic action of gene 1 and gene 2", I've come to the conclusion that I need to take my WT sample (which should represent gene expression from both the synergistic and individual effects of the proteins) and do a differential analysis with both SKO1 and SKO2. The DKO sample expression can also be added in order to account for the basal levels of gene expression without the two stimuli; the differential "equation" thus becomes:

(WT+DKO) - (SKO1 + SKO2) = Synergy

*I am not taking the average of SKO1 and SKO2 on purpose; to note synergistic effects, we must compare to an additive effect. I am specifically interested in the genes expressed due to synergy, not the individual additions of the stimuli.

I've thought of a few ways to attempt this (my analysis is in edgeR but happy to work with DESeq2/other packages as well), but I am worried that these methods are not technically sound. One method is to combine the raw counts of SKO1 and SKO2 as well as those of WT and DKO, then run dispersion/glmFit/makeContrasts. I've tried this pre- and post-normalization, and my outputs are similar (and makes biological sense as the one gene we know is synergistic is one of the top genes). However, even with logical outputs and after going through the edgeR manual/other posts, I'm not quite sure what differs in the analysis when I do combine my counts (especially because the “library sizes” of the combined counts are almost twice as big as the individual samples).

1. Is there any precedent for combining counts of non-replicate samples and looking at a differential output?

2. Would it make any sense to do this at another stage (ex. FASTQ file/reads/before generation of counts) if combining counts is just grossly wrong and flawed? If not, are there any other ways I could get at this question of synergy from differential analysis with this data?

modified 5 months ago by Aaron Lun25k • written 5 months ago by Tima Karginov0
Answer: Combining counts from non-replicate samples in RNA-Seq analysis for a synergisti
4
5 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

This should be a standard test for a non-zero interaction term.

groups <- rep(c("WT", "SKO1", "SKO2", "DKO"), each=4)
design <- model.matrix(~0 + groups)
colnames(design) <- sub("groups", "", colnames(design)) # just to clean up

# ... Normalize, estimate dispersions, etc. ...
con <- makeContrasts((SKO1 - WT) - (DKO - SKO2), levels=design)


The contrast asks whether or not the effect of knocking out gene 1 in a WT background is the same as the effect of the same knockout in a gene 2-KO background. The log-fold change reported by edgeR corresponds to the interaction term between the two knockouts, and will be zero if the two knockouts have additive effects (i.e., their effect is independent of the background in which they occur, at least in the context of these two genes).

Incidentally, the same contrast can be formulated as:

con <- makeContrasts((SKO2 - WT) - (DKO - SKO1), levels=design)


... or less intuitively:

con <- makeContrasts((SKO1 + SKO2) - (DKO + WT), levels=design)


... which is simply the reverse of what you proposed in your post.

I should point out that the more conventional approach to formulating an interaction model is to do:

is.gene1 <- groups %in% c("DKO", "SKO1")
is.gene2 <- groups %in% c("DKO", "SKO2")
design <- model.matrix(~is.gene1 * is.gene2)


... such that the last term represents the interaction. While this is easy enough in this simple case, I tend to find that this formulation gets quite complicated to interpret once you have more factors in play.

Aaron,

Thanks for your detailed and helpful response; it seems my design matrix was not set up correctly to create a contrast by adding 2 different groups, but it’s resolved now. Looking at my results, it appears my initial “manual” work-around combination of counts (adding SKO1/SKO2 and DKO/WT columns together) results in a similar list of genes with similar log-fold changes to setting it up in makeContrasts.

I’m curious, however, as to why you suggest a contrast where I’m subtracting out the background from the additive effect - this gives me the genes that are theoretically synergistically “activated” as negative log fold changes based on the following logic:

WT=Synergy + Additive effect of SKO1/SKO2 under 1 Basal background

DKO=Basal background

(SKO1 + SKO2) – (DKO + WT) = (Additive+Basal+Basal) - (Basal+ Basal + Synergy + Additive) = –Synergy

So I would think that my initial proposal is still what I want (WT+DKO) – (SKO1+SKO2)? (not that I can’t just as easily note in your case that negative log fold changes = Synergistically “activated” genes while positive log fold changes = Synergistically “repressed” genes)

Thanks,

Tima

1

Looking at my results, it appears my initial “manual” work-around combination of counts (adding SKO1/SKO2 and DKO/WT columns together) results in a similar list of genes with similar log-fold changes to setting it up in makeContrasts.

I would advise against using that workaround in general; adding counts assumes that you have the same number of samples in each group and all samples have similar library sizes, etc. You also reduce the number of residual d.f. in your model, thus reducing power.

So I would think that my initial proposal is still what I want (WT+DKO) – (SKO1+SKO2)?

Well, I interpret the "effect" as that of the treatment, which in this case is the knockout of each gene. From a strictly experimental perspective, this is most accurate; an experiment that really tests the effect of an active gene would over-express it. For example, in your case, it could be something like, "knock out both genes (or find a system in which neither were endogenously expressed) and then rescue the expression of one/both of them with plasmids or lentiviruses". The distinction between the two possible experiments makes it worth being precise, at least when you're writing up your methods - you can wax lyrical about synergistic gene activity in the discussion.