Correlation analysis taking into account multiple correlations
2
0
Entering edit mode
jhj89 ▴ 10
@jhj89-9623
Last seen 6.9 years ago

Hi,

I have previously asked a question about doing correlation analysis between host genes and pathogen genes. (Co-expression between host and pathogen genes). One of the possible approaches was taking a GLM approach to compare one gene (from the host) to all the genes (from the pathogen) at a time, and identifying significantly correlated genes based on FDR values.

Given that large number of comparisons is done, with large number of correlations, I was wondering about the possibility of high correlations surfacing even with random ordering of data.

This post: (http://stats.stackexchange.com/questions/5750/look-and-you-shall-find-a-correlation) has answers that show possible ways to identify "true" correlation, such as permutation test, but I was wondering whether, with the usage of edgeR, I can do something like this as well? Thanks!

edger rna-seq correlation • 1.4k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 16 minutes ago
The city by the bay

If you're worried about the effect of multiple tests, the standard protection is to perform some sensible correction for multiple testing. I would suggest simply pooling the p-values from the analyses for all host genes, and then applying the Benjamini-Hochberg method on the pool to control the FDR across all tested host-pathogen pairs. This tends to be okay, even when the correlation values are themselves correlated between gene pairs from being calculated from the same values for each gene. If your genes are correlated, correlations between correlations should just lead to lots of lower p-values, which ablates the severity of the correction due to the rank-based nature of the method and the enforced monotonicity on the adjusted p-values.

ADD COMMENT
0
Entering edit mode

Thanks Aaron,

So if I take a GLM approach (with glmFit and glmLRT), it would give me raw p-values and the FDR values for each of the associated genes). If I understand correctly, multiple test correcting can be done by taking the raw p-values from all comparisons and do the adjustment (with p.adjust)?

ADD REPLY
0
Entering edit mode

Yes, that's right. You'll be controlling the FDR across all host:pathogen gene pairs. The only thing you have to be careful of is just to do your book-keeping properly, i.e., make sure each adjusted p-value gets back to its correct pair.

ADD REPLY
1
Entering edit mode
@andrew_mcdavid-11488
Last seen 13 months ago
United States

You should also consider a neighborhood selection, such as what is available in the package "HUGE".  Here, you regress each gene in turn upon *all* the other genes.  This may reveal the difference between direct and indirect effects (ie, genes A and B are marginally correlated, but conditionally independent given their shared regulator C). The problem is well-posed if you add a penalty, such as the lasso.  In some circumstances, you can show that such a procedure will identify the minimal relevant set of predictors in your system.  Of course, you have now turned a hypothesis testing problem into a tuning parameter selection problem, however there are ways to control the false discovery rate using permutation-like procedures such as stability selection.

ADD COMMENT

Login before adding your answer.

Traffic: 681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6