I'm looking for some comments on different strategies to compare different sets of regulated genes. I've seen different approaches, and would like to get some insight into pros/cons of each.
In my specific application, I have a microarray analysis of gene expression 1 hour after learning, and then a later experiment of tissue collected 24 hours after learning. I'd like to compare the regulation observed at each time point:
- Which genes are similarly regulated at both time points?
- Which genes are distinctly regulated (only at one time point)?
- Overall, how similar is regulation across these different conditions?
I think this is a specific example that would generalize to comparing regulation across groups (say: comparing regulated genes across men vs. women, etc.)
Here are the strategies I'm exploring with what I can tell of the pros and cons:
1) Venn diagram of the two lists of regulated transcripts. I see this a lot. But it seems to me to be completely inadequate. It makes the assumption the difference between significant and non-significant is, itself significant, which is... well, wrong. A gene could be missing from one list just due to lack of power, not due to a meaningful difference in the degree of regulation.
2) Re-run the analysis over both experiments looking for overall regulated transcripts and interaction-significant transcripts. It seems the question of "genes regulated at both time points" requires an overall analysis to see if the transcript is significant when examined over both conditions. And it seems the "distinctly regulated genes" question is really about an interaction (in my case between learning-regulation and time-point). This feels right, but it seems there could be two weak points:
- I worry that 'overall regulated' may not be as stringent as 'regulated at each time point, regardless of the other". That is, maybe strong regulation at one time point pulls a gene up to overall regulated even if the evidence for regulation at the other time point is weak. The Venn diagram approach seems a bit more stringent in this case for identifying transcripts clearly regulated under both conditions.
- It also seems that the interaction question could end up being low-powered, leaving many transcripts in a gray zone (neither an interaction, nor an overall regulated). One way I've tried to address this is to do this overall analysis only on the set of genes already marked regulated at one or the other time point--thus, fewer comparisons, and higher power. Not sure if this makes sense, though.
3) Scatter plot? Both the above strategies make qualitative judgements, but I'd also like to get an overall feel of the degree to which regulation is similar between time points. So I've also tried creating a scatterplot of the FCs from each time point. Overall, it shows almost no correlation (r = 0.03) even though overall expression levels were very consistent across experiments. This seems like reasonable evidence that regulation is fairly distinct across these time points. The weakness of this approach, though, might be that it counts tons of non-regulated transcripts, most of which show FCs that consist entirely of noise, and that could be washing out a real relationship that exists amongst actually regulated genes. So I've tried color-coding my scatterplot by if the gene was significant at one, the other, or both time points, so that any subtrends might be obvious--though here there is probably a restriction of range problem. I've copied in the scatterplot at the end of this post.
I'm guessing there are no "right" ways to answer these questions--but I was hoping to collect some feedback and comments to help guide me in making some informed choices. Any input more than welcome. Thanks,
The scatter on the left is the overall expression levels across the two studies, which shows good reliability of measurement. The one on the right shows the LFCs for the two studies, coded by if the transcript was regulated in one (triangle) the other (square), both (x), or neither (circle) study. There are very few dots well away from both axes, which is were a transcript strongly regulated in both conditions would be (whether consistently or inconsistently regulated). Overall, there is almost no correlation between the LFCs...so my interpretation is that these processes are largely distinct... but is this a reasonable claim?