Question

Correlation analysis taking into account multiple correlations

0

Entering edit mode

jhj89 ▴ 10

@jhj89-9623

Last seen 6.9 years ago

Hi,

I have previously asked a question about doing correlation analysis between host genes and pathogen genes. (Co-expression between host and pathogen genes). One of the possible approaches was taking a GLM approach to compare one gene (from the host) to all the genes (from the pathogen) at a time, and identifying significantly correlated genes based on FDR values.

Given that large number of comparisons is done, with large number of correlations, I was wondering about the possibility of high correlations surfacing even with random ordering of data.

This post: (http://stats.stackexchange.com/questions/5750/look-and-you-shall-find-a-correlation) has answers that show possible ways to identify "true" correlation, such as permutation test, but I was wondering whether, with the usage of edgeR, I can do something like this as well? Thanks!

edger rna-seq correlation • 1.4k views

ADD COMMENT • link updated 7.6 years ago by Andrew_McDavid ▴ 270 • written 7.6 years ago by jhj89 ▴ 10

1

Entering edit mode

Andrew_McDavid ▴ 270

@andrew_mcdavid-11488

Last seen 13 months ago

United States

You should also consider a neighborhood selection, such as what is available in the package "HUGE". Here, you regress each gene in turn upon *all* the other genes. This may reveal the difference between direct and indirect effects (ie, genes A and B are marginally correlated, but conditionally independent given their shared regulator C). The problem is well-posed if you add a penalty, such as the lasso. In some circumstances, you can show that such a procedure will identify the minimal relevant set of predictors in your system. Of course, you have now turned a hypothesis testing problem into a tuning parameter selection problem, however there are ways to control the false discovery rate using permutation-like procedures such as stability selection.

ADD COMMENT • link 7.6 years ago Andrew_McDavid ▴ 270

score 3 · Accepted Answer · 2016-09-15

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 16 minutes ago

The city by the bay

If you're worried about the effect of multiple tests, the standard protection is to perform some sensible correction for multiple testing. I would suggest simply pooling the p-values from the analyses for all host genes, and then applying the Benjamini-Hochberg method on the pool to control the FDR across all tested host-pathogen pairs. This tends to be okay, even when the correlation values are themselves correlated between gene pairs from being calculated from the same values for each gene. If your genes are correlated, correlations between correlations should just lead to lots of lower p-values, which ablates the severity of the correction due to the rank-based nature of the method and the enforced monotonicity on the adjusted p-values.

ADD COMMENT • link 7.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron,

So if I take a GLM approach (with glmFit and glmLRT), it would give me raw p-values and the FDR values for each of the associated genes). If I understand correctly, multiple test correcting can be done by taking the raw p-values from all comparisons and do the adjustment (with p.adjust)?

ADD REPLY • link 7.6 years ago jhj89 ▴ 10

0

Entering edit mode

Yes, that's right. You'll be controlling the FDR across all host:pathogen gene pairs. The only thing you have to be careful of is just to do your book-keeping properly, i.e., make sure each adjusted p-value gets back to its correct pair.

ADD REPLY • link 7.6 years ago Aaron Lun ★ 28k