I was wondering if anybody might have some advice on identifying groups of genes that occur in groups multiple times in a list. I am carrying out time series data analysis and i carried out a co-expression analysis which is really nice, giving us nice ideas of which genes might be changing in the same or opposite direction over time.
The way that it works is that for each gene in a data set (either DE list or it could even be the entire transcriptome) every other gene gets a correlation coefficient based on how that gene changes. If it changes positively, it is flagged up as a gene that may be interesting as it has a pattern similar to that of the reference gene. Every single gene is tested against each other. The resulting table is a matrix of 'reference genes' and genes that are positively correlated with the reference genes.
So the table will look something like this.
reference Associated genes gene1 gene 2 gene1 gene15 gene1 gene16 gene1 gene37 gene2 gene 3 gene 2 gene15 gene2 gene16 gene2 gene37
etc x20,000 genes
Gene 1 is repeated in the first column as it has multiple genes that are associated with it. I.e, if gene 1 goes down over time, there may be 4 genes (i.e. as above) that also go down over time but are positively correlated with gene 1 and therefore may indicate co-expression.
Gene 2 will be the same and so on... of course, Gene 2 may also have positive associations with some of the genes associated with gene 1 etc, and as such, they will be repeated throughout if they are predicted to have multiple predicted associations with several genes.
What I would like to do is identify groups of genes in the second column that occur multiple times in the entirety of the data... Not specific genes in mind.. just any gene that occurs multiple times. Just as an example.. does gene 2, 15,16 and 37 associated with gene 1, turn up as a group also associated with another gene? For example, 3 genes in gene 1 are also positively associated in gene 2. It should be blind to number of genes and only identify where there are patterns. I.e gene 1 and 2 have a group of 3 genes as you see above associated with them... but those 3 genes are also associated with another 2 reference genes in the data for example, and that would be interesting. I would like also to be able to find groups of genes that occur multiple times regardless of how many genes there are. It could be 10 genes that keep popping up throughout the data for example...
Is a similarity matrix something one might think about and if so, is it possible to do this with columns of characters? Any advice about how to go about the above as well as some suggestion of code to get started with would be much appreciated it! The resulting table I get is a data frame. The genes are ensembl ID's.