genereating correlation matrix from gene expression data
2
1
Entering edit mode
pankaj borah ▴ 120
@pankaj-borah-3804
Last seen 9.7 years ago
Hi all, i am a new user of R and Bioconductor package. i deal with gene expression data. i was wondering if there is a way to generate a correlation matrix (all to all square symmetric matrix) forĀ  a set of genes and their expression values. is there any available function? Pankaj Barah Dept of Biology NTNU, Trondheim The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. [[alternative HTML version deleted]]
• 9.8k views
ADD COMMENT
2
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi Pankaj, On Nov 18, 2009, at 1:30 PM, pankaj borah wrote: > Hi all, > > i am a new user of R and Bioconductor package. i deal with gene > expression data. i was wondering if there is a way to generate a > correlation matrix (all to all square symmetric matrix) for a set > of genes and their expression values. is there any available function? Try ?cor to get the pairwise correlation between columns of your matrix, eg: R> x <- matrix(rnorm(20), 4, 5) R> x [,1] [,2] [,3] [,4] [,5] [1,] 0.5818808 0.58781709 2.5511157 -1.5332180 0.6905731 [2,] -0.7640311 0.25960974 0.7246655 -1.5539226 -0.5459625 [3,] -1.7141619 -0.25808091 -0.0868366 -0.6547804 0.4629494 [4,] -2.2906217 0.04932864 0.5694895 0.7736206 0.4078665 R> cor(x) [,1] [,2] [,3] [,4] [,5] [1,] 1.00000000 0.851982381 0.8715055 -0.8404848 0.074770380 [2,] 0.85198238 1.000000000 0.9429553 -0.5338964 0.004627258 [3,] 0.87150545 0.942955253 1.0000000 -0.4719936 0.325584972 [4,] -0.84048477 -0.533896414 -0.4719936 1.0000000 0.309414781 [5,] 0.07477038 0.004627258 0.3255850 0.3094148 1.000000000 If your genes are in rows, you'll just need to pass in the transpose of your matrix. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
1
Entering edit mode
Pankaj, in addition to what Steve provided below, if you are dealing with expression data, it might be useful for you the function qpPCC from the bioconductor package 'qpgraph'. it performs the same calculation as the 'cor' function illustrated in Steve's message but in addition it accepts as input parameter not only a matrix or a data frame but also an ExpressionSet object (not a big deal) and, more interestingly, it calculates for you the P-values of a two sided test for the null hypothesis of zero Pearson correlation. this latter feature might be useful since as far as i know if one wants to test every pair of genes and get the P-values one needs to call 'cor.test' for each pair of genes within a two nested for loops or some 'apply' call and this function 'qpPCC' does it using matrix calculations and thus is faster for that purpose. of course, if you are interested in determining significance through the P-values you will have to deal with the multiple testing problem. library(qpgraph) set.seed(123) x <- matrix(rnorm(20), 4, 5) round(x, digits=3) [,1] [,2] [,3] [,4] [,5] [1,] -0.560 0.129 -0.687 0.401 0.498 [2,] -0.230 1.715 -0.446 0.111 -1.967 [3,] 1.559 0.461 1.224 -0.556 0.701 [4,] 0.071 -1.265 0.360 1.787 -0.473 lapply(qpPCC(x), round, digits=3) $R 1 2 3 4 5 1 1.000 -0.016 0.958 -0.490 0.437 2 -0.016 1.000 -0.271 -0.753 -0.462 3 0.958 -0.271 1.000 -0.218 0.431 4 -0.490 -0.753 -0.218 1.000 -0.198 5 0.437 -0.462 0.431 -0.198 1.000 $P 1 2 3 4 5 1 0.000 0.984 0.042 0.510 0.563 2 0.984 0.000 0.729 0.247 0.538 3 0.042 0.729 0.000 0.782 0.569 4 0.510 0.247 0.782 0.000 0.802 5 0.563 0.538 0.569 0.802 0.000 cor.test(x[,1], x[,2]) Pearson's product-moment correlation data: x[, 1] and x[, 2] t = -0.0231, df = 2, p-value = 0.9837 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.962 0.960 sample estimates: cor -0.0163 as a final shameless remark :) if you are interested in trying to sort out genuine from spurious correlations you might want to calculate the so-called "partial correlations" (see http://en.wikipedia.org/wiki/Partial_correlation) which in fact one cannot calculate with gene expression data because typically you will many more genes than samples. instead, as a sort of surrogate for partial correlations, you might find useful to calculate the so-called non-rejection rates through the functions 'qpNrr' or 'qpAvgNrr' from the 'qpgraph' package (look at the corresponding help pages if you're interested). best, robert. On Wed, 2009-11-18 at 14:05 -0500, Steve Lianoglou wrote: > Hi Pankaj, > > On Nov 18, 2009, at 1:30 PM, pankaj borah wrote: > > > Hi all, > > > > i am a new user of R and Bioconductor package. i deal with gene > > expression data. i was wondering if there is a way to generate a > > correlation matrix (all to all square symmetric matrix) for a set > > of genes and their expression values. is there any available function? > > Try ?cor to get the pairwise correlation between columns of your > matrix, eg: > > R> x <- matrix(rnorm(20), 4, 5) > R> x > [,1] [,2] [,3] [,4] [,5] > [1,] 0.5818808 0.58781709 2.5511157 -1.5332180 0.6905731 > [2,] -0.7640311 0.25960974 0.7246655 -1.5539226 -0.5459625 > [3,] -1.7141619 -0.25808091 -0.0868366 -0.6547804 0.4629494 > [4,] -2.2906217 0.04932864 0.5694895 0.7736206 0.4078665 > > R> cor(x) > [,1] [,2] [,3] [,4] [,5] > [1,] 1.00000000 0.851982381 0.8715055 -0.8404848 0.074770380 > [2,] 0.85198238 1.000000000 0.9429553 -0.5338964 0.004627258 > [3,] 0.87150545 0.942955253 1.0000000 -0.4719936 0.325584972 > [4,] -0.84048477 -0.533896414 -0.4719936 1.0000000 0.309414781 > [5,] 0.07477038 0.004627258 0.3255850 0.3094148 1.000000000 > > If your genes are in rows, you'll just need to pass in the transpose > of your matrix. > > HTH, > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
1
Entering edit mode
Robert Castelo ★ 3.3k
@rcastelo
Last seen 1 day ago
Barcelona/Universitat Pompeu Fabra
hi Pankaj, since you mentioned you're a new user of R and Bioconductor let me tell you that you should take a look to the posting guide at http://www.bioconductor.org/docs/postingGuide.html if you have not done so and try to follow the guidelines which, among other things, require to respond to everyone in the list in order to ensure that the response is archived so that the emails work as a knowledge base for all of us and future users. regarding you question below, tab-separated .txt format is fine to import the data into R and for that purpose you don't need any specific function of the 'qpgraph' package but the general-purpose reading function 'read.table'. depending on whether you have row and column headers you might need to search for appropriate values of its arguments which you can do by looking at its help page with: help(read.table) but probably a call like this one should work for you: expData <- read.table("filename.txt", header=TRUE, sep="\t") a simple QC for the previous operation is to call the function 'head' which will allow you to check whether you have read what you expect: head(expData) then you can do the calls specified below, like: myPCCs <- qpPCC(expData) regarding the particular functions 'set.seed', 'matrix', 'rnorm' and 'round' i put them there in order to give you an example that you can reproduce in your computer. that is, by copying and pasting the example below (or the example that Steve gave you) into your R session, you should get the same results in your computer and that should help you in understanding the solution. look at the help pages of these functions in order to understand what they do and you should be able then to figure out why are they given in that particular order. let me know if after looking to the help pages you still have problems in understanding the example. cheers, robert. > Hi Robert, > > Thanks for your reply. I have my data as follows : > > At2g14610.1 5.66901 5.13719 5.7854 5.59656 > At2g14610_41.45 5.82303 4.88138 5.7313 5.23123 > At3g26830.1 4.76497 4.5498 4.96243 4.85919 > At2g24850_37.1 4.08244 4.22297 5.39891 4.9186 > At4g10500.1 4.3768 4.15227 4.76488 5.05476 > At1g15520.1 4.42941 3.92527 4.60306 5.17746 > At3g28510.1 4.67999 3.61874 4.75607 4.94384 > At1g57630.1 4.44181 4.20935 4.28739 4.50546 > At3g22600.1 4.64839 3.89768 4.48499 4.3105 > At2g24850.1 3.72857 4.09308 4.85082 4.57857 > At3g57260.1 4.40517 3.80811 4.52704 4.4601 > > Here 20 genes in rows and 4 expression values in columns. I have this > file in xls as well as tab separated .txt format. How can import this > matrix to 'qpgraph'? and what these function mean: > > set.seed(123) > x <- matrix(rnorm(20), 4, 5) > > round(x, digits=3) > > > Thanks and regards, > > Pankaj Barah > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- On Thu, 19/11/09, Robert Castelo <robert.castelo at="" upf.edu=""> wrote: > > From: Robert Castelo <robert.castelo at="" upf.edu=""> > Subject: Re: [BioC] genereating correlation matrix from gene > expression data > To: "pankaj borah" <pankajborah2k3 at="" yahoo.co.in=""> > Cc: bioconductor at stat.math.ethz.ch > Date: Thursday, 19 November, 2009, 2:33 PM > > Pankaj, > > in addition to what Steve provided below, if you are dealing > with > expression data, it might be useful for you the function qpPCC > from the > bioconductor package 'qpgraph'. it performs the same > calculation as the > 'cor' function illustrated in Steve's message but in addition > it accepts > as input parameter not only a matrix or a data frame but also > an > ExpressionSet object (not a big deal) and, more interestingly, > it > calculates for you the P-values of a two sided test for the > null > hypothesis of zero Pearson correlation. this latter feature > might be > useful since as far as i know if one wants to test every pair > of genes > and get the P-values one needs to call 'cor.test' for each > pair of genes > within a two nested for loops or some 'apply' call and this > function > 'qpPCC' does it using matrix calculations and thus is faster > for that > purpose. of course, if you are interested in determining > significance > through the P-values you will have to deal with the multiple > testing > problem. > > library(qpgraph) > > set.seed(123) > x <- matrix(rnorm(20), 4, 5) > > round(x, digits=3) > [,1] [,2] [,3] [,4] [,5] > [1,] -0.560 0.129 -0.687 0.401 0.498 > [2,] -0.230 1.715 -0.446 0.111 -1.967 > [3,] 1.559 0.461 1.224 -0.556 0.701 > [4,] 0.071 -1.265 0.360 1.787 -0.473 > > lapply(qpPCC(x), round, digits=3) > $R > 1 2 3 4 5 > 1 1.000 -0.016 0.958 -0.490 0.437 > 2 -0.016 1.000 -0.271 -0.753 -0.462 > 3 0.958 -0.271 1.000 -0.218 0.431 > 4 -0.490 -0.753 -0.218 1.000 -0.198 > 5 0.437 -0.462 0.431 -0.198 1.000 > > $P > 1 2 3 4 5 > 1 0.000 0.984 0.042 0.510 0.563 > 2 0.984 0.000 0.729 0.247 0.538 > 3 0.042 0.729 0.000 0.782 0.569 > 4 0.510 0.247 0.782 0.000 0.802 > 5 0.563 0.538 0.569 0.802 0.000 > > cor.test(x[,1], x[,2]) > > Pearson's product-moment correlation > > data: x[, 1] and x[, 2] > t = -0.0231, df = 2, p-value = 0.9837 > alternative hypothesis: true correlation is not equal to 0 > 95 percent confidence interval: > -0.962 0.960 > sample estimates: > cor > -0.0163 > > as a final shameless remark :) if you are interested in trying > to sort > out genuine from spurious correlations you might want to > calculate the > so-called "partial correlations" (see > http://en.wikipedia.org/wiki/Partial_correlation) which in > fact one > cannot calculate with gene expression data because typically > you will > many more genes than samples. instead, as a sort of surrogate > for > partial correlations, you might find useful to calculate the > so-called > non-rejection rates through the functions 'qpNrr' or > 'qpAvgNrr' from the > 'qpgraph' package (look at the corresponding help pages if > you're > interested). > > best, > robert. > > On Wed, 2009-11-18 at 14:05 -0500, Steve Lianoglou wrote: > > Hi Pankaj, > > > > On Nov 18, 2009, at 1:30 PM, pankaj borah wrote: > > > > > Hi all, > > > > > > i am a new user of R and Bioconductor package. i deal with > gene > > > expression data. i was wondering if there is a way to > generate a > > > correlation matrix (all to all square symmetric matrix) > for a set > > > of genes and their expression values. is there any > available function? > > > > Try ?cor to get the pairwise correlation between columns of > your > > matrix, eg: > > > > R> x <- matrix(rnorm(20), 4, 5) > > R> x > > [,1] [,2] [,3] [,4] > [,5] > > [1,] 0.5818808 0.58781709 2.5511157 -1.5332180 0.6905731 > > [2,] -0.7640311 0.25960974 0.7246655 -1.5539226 -0.5459625 > > [3,] -1.7141619 -0.25808091 -0.0868366 -0.6547804 0.4629494 > > [4,] -2.2906217 0.04932864 0.5694895 0.7736206 0.4078665 > > > > R> cor(x) > > [,1] [,2] [,3] [,4] > [,5] > > [1,] 1.00000000 0.851982381 0.8715055 -0.8404848 > 0.074770380 > > [2,] 0.85198238 1.000000000 0.9429553 -0.5338964 > 0.004627258 > > [3,] 0.87150545 0.942955253 1.0000000 -0.4719936 > 0.325584972 > > [4,] -0.84048477 -0.533896414 -0.4719936 1.0000000 > 0.309414781 > > [5,] 0.07477038 0.004627258 0.3255850 0.3094148 > 1.000000000 > > > > If your genes are in rows, you'll just need to pass in the > transpose > > of your matrix. > > > > HTH, > > > > -steve > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University > > Contact Info: http://cbio.mskcc.org/~lianos/contact > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > ______________________________________________________________________ > The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.
ADD COMMENT

Login before adding your answer.

Traffic: 418 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6