Gene-gene correlation analysis with RNA-seq

0

Entering edit mode

Makis Motakis ▴ 30

@makis-motakis-5507

Last seen 8.7 years ago

Japan

Dear list,

I have a large collection of RNA-seq data (300 samples from different cells types, e.g. 2 samples from Basophils, 3 samples from cultured Mast cells etc; the number of replicates per cell type varies from 2 to 5). I would like to analyze the counts or the TPM data and find correlated RefSeq genes. I have estimated Kendall correlations of the TPM data. For comparison reasons, I would also like to model the association between the gene counts data by the GLM model of negative binomial distribution. In my data, I expect that non-parametric correlations and glm model will give quite different results for several gene pairs. For example,

gene1<-rnbinom(300,mu=50,size=5)
gene2<-rnbinom(300,mu=50,size=5)
plot(gene1,gene2)
cor.test(gene1,gene2,method="kendall")
summary(glm.nb(gene2 ~ gene1))

I understand that the above glm model is over-simplified (for example it does not take into account the replicates information) but at this
point I am not worried about that yet. My question is whether it is possible to incorporate the offset (library size * normalization
factor) to the glm model e.g. the way that edgeR works. Looking at edgeR source code at "mglmLS.R" function (a similar expression also exist at "mglmLevenberg.R"): mu <- exp(beta %*% t(X) + offset)

In my case, would it be correct if I estimated the offset by

offset <- getOffset()

and then I estimated the gene / gene associations by glm using
something like mu <- exp(beta %*% t(X-offset) + offset) ? if yes, is
there any reason for this model to follow the edgeR dispersion
estimation method or simply glm.nb would do what I want?

Thank you in advance,
Makis

edgeR • 8.5k views

ADD COMMENT • link updated 11 weeks ago by Dimitris • 0 • written 10.8 years ago by Makis Motakis ▴ 30

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Correlating all other genes with a hub gene

Dear Makis,

We often do an analysis similar to what you have described when we have a particular hub geneX of interest, and we want to find other genes correlated with geneX. With 300 samples, it might be more efficient to do this with voom rather than edgeR, just for speed. The voom analysis would go something like this.

v <- voom(counts)

Get the logCPM for GeneX:

i <- which(rownames(v)=="GeneX")
GeneX <- v$E[i,]

Get expression for all other genes:

v2 <- v[-i,]
design <- cbind(1,GeneX)
fit <- lmFit(v2,design)
fit <- eBayes(fit)
topTable(fit,coef=2)

This will correlate all the other genes with GeneX, and will rank them from most to least correlated. This is surely more efficient than
correlating individual pairs of genes using glm.nb!

This is for one GeneX, but you could repeat the analysis quickly for any other gene of interest.

The edgeR analysis would be very similar, but I am not sure how fast the dispersion estimation step will be with 300 samples. It would go something like:

   y <- DGEList(counts=counts)
   logCPM <- cpm(y,log=TRUE,prior.count)
   GeneX <- logCPM[i,]
   design <- cbind(1,GeneX)
   y2 <- y[-i,]
   y2 <- estimateDisp(y2,design)
   fit <- glmFit(y2,design)
   lrt <- glmLRT(fit)
   topTags(lrt)

Best wishes
Gordon

ADD COMMENT • link 10.8 years ago • updated 6.2 years ago Gordon Smyth 50k

0

Entering edit mode

Hi.

I have used this code to correlate the expression of 4 genes (each separately) to the the rest of genes. I've found that in some cases the correlations goes above 1 (logFC column) ¿How is this possible?

Just as control I've seen that the correlation is always 1 when comparing any of the 4 genes agains themself, meaning that genes with correlation above 1 are other genes (not the initial 4).

The only difference in my code is that my counts are RSEM not raw counts, but in theory should work as I've read somewhere else.

Thanks!

ADD REPLY • link 3.6 years ago RottenPotato • 0

0

Entering edit mode

How to visualize this result (lrt table with lfc pvalues etc ) to be able to see the expression of GeneX correlated to the other differentially expressed genes in R?

ADD REPLY • link 11 weeks ago Dimitris • 0

1

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 12 days ago

EMBL European Molecular Biology Laborat…

Dear Makis with 300 samples, it should not matter much which correlation coefficient (Kendall, Pearson, Spearman, some other definition) you use, unless you really want to depend on the behaviour of the data at the extreme tails and have the measure be dominated by few (outlier?) samples. However, it will be necessary to work on normalised counts [others are more qualified to tell you how to do that in edgeR; in DESeq2, look at 'counts', 'rlogTransformation' or 'varianceStabilizingTransformation.] I am not sure I got the point you were trying to make with the below code example - can you expand on that? Also, note that the quantity you compute from summary(glm.nb(gene2 ~ gene1)) would not be symmetric under exchange gene1 and gene2, which might hinder its interpretation as a measure of correlation. Best wishes Wolfgang On Jul 1, 2013, at 6:48 am, Makis Motakis <emotakis at="" hotmail.com=""> wrote: > > > > Dear list, > > I have a large collection of RNA-seq data (300 samples from different cells types, e.g. 2 samples from Basophils, 3 samples from cultured Mast cells etc; the number of replicates per cell type varies from 2 to 5). I would like to analyze the counts or the TPM data and find correlated RefSeq genes. I have estimated Kendall correlations of the TPM data. For comparison reasons, I would also like to model the association between the gene counts data by the GLM model of negative binomial distribution. In my data, I expect that non-parametric correlations and glm model will give quite different results for several gene pairs. For example, > > gene1<-rnbinom(300,mu=50,size=5) > gene2<-rnbinom(300,mu=50,size=5) > plot(gene1,gene2) > cor.test(gene1,gene2,method="kendall") > summary(glm.nb(gene2 ~ gene1)) > > I understand that the above glm model is over-simplified (for example it does not take into account the replicates information) but at this point I am not worried about that yet. My question is whether it is possible to incorporate the offset (library size * normalization factor) to the glm model e.g. the way that edgeR works. Looking at edgeR source code at "mglmLS.R" function (a similar expression also exist at "mglmLevenberg.R"): mu <- exp(beta %*% t(X) + offset) > > In my case, would it be correct if I estimated the offset by > > offset <- getOffset() > > and then I estimated the gene / gene associations by glm using something like mu <- exp(beta %*% t(X-offset) + offset) ? if yes, is there any reason for this model to follow the edgeR dispersion estimation method or simply glm.nb would do what I want? > > Thank you in advance, > Makis > > > > > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.8 years ago Wolfgang Huber ★ 13k

Login before adding your answer.