**600**wrote:

Hi,

From the `limmaUsersGuide() `

section "9.7 Multi-level Experiments" I learned about using `duplicateCorrelation() `

and passing that information to `lmFit()`

to handle observations from the same individual (using `block`

and `correlation`

arguments). Similarly, from "9.6.2 Many time points" I learned about how to multiply splines with a group variable and extract the F-statistic showing the differences between groups across time using `topTable(coef)`

. Those sections don't use `voom()`

as they were likely written before `voom()`

existed.

In any case, I'm analyzing a data set with 2 regions and individuals spanning many age groups. Some individuals yielded data for both regions, some did not. To explore the correlation by individual, I ran limma-voom with and without using `duplicateCorrelation()`

. The p-values and adjusted-p-values are smaller for the model that takes the individual correlation into account, but are fairly similar. However, a subset of the F-statistics are very different.

First of all, I was surprised by the range of some of the F-statistics (over 3 in log10). Second, this plot doesn't agree with how similar the p-values look like. How come the 2 models yield similar p-values for genes with widely different F-statistics? (the degrees of freedom should be the same)

I then explored the coefficients of interest (the interaction between the 2nd region and the time linear splines) and they look pretty different (see last pages of the pdf in the gist).

Assuming that I used the functions correctly, do you expect the F-statistics to be so different for a few of genes? What about the coefficients? I know that `lmFit()`

changes between lm.series and gls.series for the 2 different ways I used it, but I expected the coefficients to look mostly similar given that the consensus correlation is about 0.24 (much lower than the default of 0.75). Or could the lm.series vs gls.series explain the difference here? ***At this point I realize that I didn't use topTable() correctly***

Best,

Leonardo

Code used (shows parts of the design matrix and session info):

**19k**• written 6 months ago by Leonardo Collado Torres •

**600**

Hi,

While I did figure out my main problem, I decided it would still be worth posting just as an example of how completing all the Bioconductor posting guide steps can help users find their own problem.

In my case, when I was writing the last part of my question, I realized that maybe I wasn't comparing the correct genes between the 2 models. It turns out that I was missing

`sort.by = 'none'`

in my`topTable()`

calls. The fixed plots are available at .Now most of the plots makes sense. The p-values (and adjusted-p-values) are no longer so similar as they once were: from the topTable() docs I thought that topTable() sorted the genes by F in this case, not p-value (they are p-value sorted, which explains the previous plots). I'm still surprised by the range of the F-statistics though.

In general though, I'm still unsure how to determine whether to go ahead with the

`lmFit()`

results from using or not using`duplicateCorrelation()`

. How would you choose which model to use? I expected the results from using the correlation to lead to less DE calls. Using it might be too time consuming for expression features more numerous than genes (about 24k here).`## In general: DE signal among models`

`> addmargins(table('Using corr' = top$adj.P.Val < 0.05, 'Naive' = top2$adj.P.Val < 0.05))`

Naive

Using corr FALSE TRUE Sum

FALSE 2882 361 3243

TRUE 919 20490 21409

Sum 3801 20851 24652

Best,

Leonardo

600