Question: Gene-gene correlation analysis with RNA-seq
0
gravatar for Makis Motakis
6.4 years ago by
Japan
Makis Motakis20 wrote:

Dear list,

I have a large collection of RNA-seq data (300 samples from different cells types, e.g. 2 samples from Basophils, 3 samples from cultured Mast cells etc; the number of replicates per cell type varies from 2 to 5). I would like to analyze the counts or the TPM data and find correlated RefSeq genes. I have estimated Kendall correlations of the TPM data. For comparison reasons, I would also like to model the association between the gene counts data by the GLM model of negative binomial distribution. In my data, I expect that non-parametric correlations and glm model will give quite different results for several gene pairs. For example,

gene1<-rnbinom(300,mu=50,size=5)
gene2<-rnbinom(300,mu=50,size=5)
plot(gene1,gene2)
cor.test(gene1,gene2,method="kendall")
summary(glm.nb(gene2 ~ gene1))

I understand that the above glm model is over-simplified (for example it does not take into account the replicates information) but at this
point I am not worried about that yet. My question is whether it is possible to incorporate the offset (library size *  normalization
factor) to the glm model e.g. the way that edgeR works. Looking at edgeR source code at "mglmLS.R"  function  (a similar expression also exist at "mglmLevenberg.R"): mu <- exp(beta %*% t(X) + offset)

In my case, would it be correct if I estimated the offset by

offset <- getOffset()

and then I estimated the gene / gene associations by glm using
something like mu <- exp(beta %*% t(X-offset) + offset) ? if yes, is
there any reason for this model to follow the edgeR dispersion
estimation method or simply glm.nb would do what I want?

Thank you in advance,
Makis

edger • 5.4k views
ADD COMMENTlink modified 22 months ago by Gordon Smyth39k • written 6.4 years ago by Makis Motakis20
Answer: [bioc] correlation analysis of RNA-seq
2
gravatar for Gordon Smyth
6.4 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

Correlating all other genes with a hub gene

Dear Makis,

We often do an analysis similar to what you have described when we have a particular hub geneX of interest, and we want to find other genes correlated with geneX.  With 300 samples, it might be more efficient to do this with voom rather than edgeR, just for speed.  The voom analysis would go something like this.

v <- voom(counts)

Get the logCPM for GeneX:

i <- which(rownames(v)=="GeneX")
GeneX <- v$E[i,]

Get expression for all other genes:

v2 <- v[-i,]
design <- cbind(1,GeneX)
fit <- lmFit(v2,design)
fit <- eBayes(fit)
topTable(fit,coef=2)

This will correlate all the other genes with GeneX, and will rank them from most to least correlated.  This is surely more efficient than
correlating individual pairs of genes using glm.nb!

This is for one GeneX, but you could repeat the analysis quickly for any other gene of interest.

The edgeR analysis would be very similar, but I am not sure how fast the dispersion estimation step will be with 300 samples.  It would go something like:

   y <- DGEList(counts=counts)
   logCPM <- cpm(y,log=TRUE,prior.count)
   GeneX <- logCPM[i,]
   design <- cbind(1,GeneX)
   y2 <- y[-i,]
   y2 <- estimateDisp(y2,design)
   fit <- glmFit(y2,design)
   lrt <- glmLRT(fit)
   topTags(lrt)

Best wishes
Gordon

 

ADD COMMENTlink modified 22 months ago • written 6.4 years ago by Gordon Smyth39k
Answer: [bioc] correlation analysis of RNA-seq
1
gravatar for Wolfgang Huber
6.4 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:
Dear Makis with 300 samples, it should not matter much which correlation coefficient (Kendall, Pearson, Spearman, some other definition) you use, unless you really want to depend on the behaviour of the data at the extreme tails and have the measure be dominated by few (outlier?) samples. However, it will be necessary to work on normalised counts [others are more qualified to tell you how to do that in edgeR; in DESeq2, look at 'counts', 'rlogTransformation' or 'varianceStabilizingTransformation.] I am not sure I got the point you were trying to make with the below code example - can you expand on that? Also, note that the quantity you compute from summary(glm.nb(gene2 ~ gene1)) would not be symmetric under exchange gene1 and gene2, which might hinder its interpretation as a measure of correlation. Best wishes Wolfgang On Jul 1, 2013, at 6:48 am, Makis Motakis <emotakis at="" hotmail.com=""> wrote: > > > > Dear list, > > I have a large collection of RNA-seq data (300 samples from different cells types, e.g. 2 samples from Basophils, 3 samples from cultured Mast cells etc; the number of replicates per cell type varies from 2 to 5). I would like to analyze the counts or the TPM data and find correlated RefSeq genes. I have estimated Kendall correlations of the TPM data. For comparison reasons, I would also like to model the association between the gene counts data by the GLM model of negative binomial distribution. In my data, I expect that non-parametric correlations and glm model will give quite different results for several gene pairs. For example, > > gene1<-rnbinom(300,mu=50,size=5) > gene2<-rnbinom(300,mu=50,size=5) > plot(gene1,gene2) > cor.test(gene1,gene2,method="kendall") > summary(glm.nb(gene2 ~ gene1)) > > I understand that the above glm model is over-simplified (for example it does not take into account the replicates information) but at this point I am not worried about that yet. My question is whether it is possible to incorporate the offset (library size * normalization factor) to the glm model e.g. the way that edgeR works. Looking at edgeR source code at "mglmLS.R" function (a similar expression also exist at "mglmLevenberg.R"): mu <- exp(beta %*% t(X) + offset) > > In my case, would it be correct if I estimated the offset by > > offset <- getOffset() > > and then I estimated the gene / gene associations by glm using something like mu <- exp(beta %*% t(X-offset) + offset) ? if yes, is there any reason for this model to follow the edgeR dispersion estimation method or simply glm.nb would do what I want? > > Thank you in advance, > Makis > > > > > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 6.4 years ago by Wolfgang Huber13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 272 users visited in the last hour