I have previously asked a question about doing correlation analysis between host genes and pathogen genes. (Co-expression between host and pathogen genes). One of the possible approaches was taking a GLM approach to compare one gene (from the host) to all the genes (from the pathogen) at a time, and identifying significantly correlated genes based on FDR values.
Given that large number of comparisons is done, with large number of correlations, I was wondering about the possibility of high correlations surfacing even with random ordering of data.
If you're worried about the effect of multiple tests, the standard protection is to perform some sensible correction for multiple testing. I would suggest simply pooling the p-values from the analyses for all host genes, and then applying the Benjamini-Hochberg method on the pool to control the FDR across all tested host-pathogen pairs. This tends to be okay, even when the correlation values are themselves correlated between gene pairs from being calculated from the same values for each gene. If your genes are correlated, correlations between correlations should just lead to lots of lower p-values, which ablates the severity of the correction due to the rank-based nature of the method and the enforced monotonicity on the adjusted p-values.
So if I take a GLM approach (with glmFit and glmLRT), it would give me raw p-values and the FDR values for each of the associated genes). If I understand correctly, multiple test correcting can be done by taking the raw p-values from all comparisons and do the adjustment (with p.adjust)?
Yes, that's right. You'll be controlling the FDR across all host:pathogen gene pairs. The only thing you have to be careful of is just to do your book-keeping properly, i.e., make sure each adjusted p-value gets back to its correct pair.
You should also consider a neighborhood selection, such as what is available in the package "HUGE". Here, you regress each gene in turn upon *all* the other genes. This may reveal the difference between direct and indirect effects (ie, genes A and B are marginally correlated, but conditionally independent given their shared regulator C). The problem is well-posed if you add a penalty, such as the lasso. In some circumstances, you can show that such a procedure will identify the minimal relevant set of predictors in your system. Of course, you have now turned a hypothesis testing problem into a tuning parameter selection problem, however there are ways to control the false discovery rate using permutation-like procedures such as stability selection.
Thanks Aaron,
So if I take a GLM approach (with glmFit and glmLRT), it would give me raw p-values and the FDR values for each of the associated genes). If I understand correctly, multiple test correcting can be done by taking the raw p-values from all comparisons and do the adjustment (with p.adjust)?
Yes, that's right. You'll be controlling the FDR across all host:pathogen gene pairs. The only thing you have to be careful of is just to do your book-keeping properly, i.e., make sure each adjusted p-value gets back to its correct pair.