I have a question about the appropriate manner in which to do a gene set test.
The aim is to functionally annotate a list of non-coding genes, using guilt-by-association with known coding genes, based on gene expression correlation.
Briefly: I have a list of non-coding genes (both known ENSEMBL-annotated, and de novo). I have an analysis package that calculates the correlation of the GX values (voom on RNAseq counts) of all annotated protein-coding genes vs the query non-coding gene list, then selects clusters based on two thresholds (cluster size - that is, #coding genes per non-coding gene; and correlation threshold). Each cluster is 1 non-coding vs correlated coding genes. These clusters are then bootstrapped to remove unstable associations, giving a final set of robust clusters. Typically each bootstrapped cluster has 100s of coding genes (> the cluster size threshold) that are significantly correlated with a single non-coding gene.
The idea would be to do a gene set test or GSEA using e.g. the mSigDB gene sets on the coding genes present within each cluster. This would hopefully provide clues as to the function of each ncRNA.
However, I've run into a problem: the various GSEA packages in Bioconductor do not make it clear whether they're entirely appropriate for this kind of GSEA that I would like to do. As I see it, there would be two ways to do test for the enrichment of mSigDB on my single-ncRNA-versus-coding-genes clusters:
1) the "cluster's coding genes versus sets" approach: I could subset the gene expression matrix, given the list of coding genes within each ncRNA's cluster: then I would have 1 GX matrix per cluster, consisting of columns = samples and rows = coding genes associated with a single ncRNA. I could then test for enrichment of mSigDB sets in each of these cluster's coding genes. However, in this case, GSEA would not be the correct method to use because of the subsetting of genes from the total data set; is this correct? And instead I could just test for enrichment using a hypergeometric test, or similar?
2) the "all genes' correlations" approach: the other approach I have considered would be using the correlations of *all* coding genes against each non-coding gene. So instead of a GX matrix as input for the gene set test, I would have a correlation matrix with rows = coding genes and columns = non-coding RNAs.
In this case I could then, for each ncRNA, rank all the coding genes by their correlation; then use a function such as geneSetTest() from limma to then test for enrichment of the mSigDB sets within these correlation-ranked genes.
In both cases, the key point is that the co-expression relationships between the non-coding RNA of interest, and the coding genes that will be used to functionally annotate the ncRNA, is somehow preserved. My reading suggests that using the second approach based on correlations would provide the result I'm looking for, although any previous experience and/or advice and suggestions would be very welcome!
Best wishes and thanks for any assistance in advance,