**20**wrote:

Dear list,

I have a large collection of RNA-seq data (300 samples from different cells types, e.g. 2 samples from Basophils, 3 samples from cultured Mast cells etc; the number of replicates per cell type varies from 2 to 5). I would like to analyze the counts or the TPM data and find correlated RefSeq genes. I have estimated Kendall correlations of the TPM data. For comparison reasons, I would also like to model the association between the gene counts data by the GLM model of negative binomial distribution. In my data, I expect that non-parametric correlations and glm model will give quite different results for several gene pairs. For example,

gene1<-rnbinom(300,mu=50,size=5) gene2<-rnbinom(300,mu=50,size=5) plot(gene1,gene2) cor.test(gene1,gene2,method="kendall") summary(glm.nb(gene2 ~ gene1))

I understand that the above glm model is over-simplified (for example it does not take into account the replicates information) but at this

point I am not worried about that yet. My question is whether it is possible to incorporate the offset (library size * normalization

factor) to the glm model e.g. the way that edgeR works. Looking at edgeR source code at "mglmLS.R" function (a similar expression also exist at "mglmLevenberg.R"): mu <- exp(beta %*% t(X) + offset)

In my case, would it be correct if I estimated the offset by

offset <- getOffset()

and then I estimated the gene / gene associations by glm using

something like mu <- exp(beta %*% t(X-offset) + offset) ? if yes, is

there any reason for this model to follow the edgeR dispersion

estimation method or simply glm.nb would do what I want?

Thank you in advance,

Makis

**39k**• written 6.4 years ago by Makis Motakis •

**20**