Question

stat/math question on Category vignette

0

Entering edit mode

Kimpel, Mark W ▴ 890

@kimpel-mark-w-727

Last seen 11.2 years ago

I am working my way through the Category vignette and have a question as to how the t statistics for categories are computed from the incidence matrix and individual probeset t-statistics. The code that does this can be found on the bottom of page 3 (development version vignette) and is as follows: There are 135 pathways (categories)... A = AmER2 %*% tobs$statistic A = tA/sqrt(rs2) ames(tA) = row.names(AmER2) I know this is matrix multiplication, but don't know the mathematical or statistical basis for the computation. I am interested in turning the t statistic values in tA into p values, so I need to know the df. for each resultant t. Is that the rs2? This is know doubt a simple question for the statisticians in the group, but not for me! :) Thanks for your help, Mark -- --- Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 663-0513 Home (no voice mail please)

Pathways Category Pathways Category • 1.3k views

ADD COMMENT • link updated 18.2 years ago by Seth Falcon ★ 7.4k • written 18.2 years ago by Kimpel, Mark W ▴ 890

score 0 · Answer 1 · 2007-08-23

Hi Mark, Mark W Kimpel <mkimpel at="" iupui.edu=""> writes: > I am working my way through the Category vignette and have a question as > to how the t statistics for categories are computed from the incidence > matrix and individual probeset t-statistics. The code that does this can > be found on the bottom of page 3 (development version vignette) and is > as follows: > > There are 135 pathways (categories)... > A = AmER2 %*% tobs$statistic > A = tA/sqrt(rs2) > ames(tA) = row.names(AmER2) > > I know this is matrix multiplication, but don't know the mathematical or > statistical basis for the computation. I am interested in turning the t > statistic values in tA into p values, so I need to know the df. for each > resultant t. Is that the rs2? Each row of the matrix represents a gene set (a category) and each column a gene. Each cell in the matrix is 0/1 depending on whether the given gene is in the given gene set. The vector tobs$statistic has the t-stat for each gene. The matrix multiplication is a convenient way to obtain the sum of the t-stats for each gene set. Does that help? + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center BioC: http://bioconductor.org/ Blog: http://userprimary.net/user/

score 0 · Answer 2 · 2007-08-23

Hi Mark, Mark W Kimpel wrote: > I am working my way through the Category vignette and have a question as > to how the t statistics for categories are computed from the incidence > matrix and individual probeset t-statistics. The code that does this can > be found on the bottom of page 3 (development version vignette) and is > as follows: > > There are 135 pathways (categories)... > A = AmER2 %*% tobs$statistic > A = tA/sqrt(rs2) > ames(tA) = row.names(AmER2) Actually you have a typo here. It should read tA = AmER2 %*% tobs$statistic tA = tA/sqrt(rs2) As for the computation being done here, it is actually very simple. AmER2 is a matrix of dimension [npathways x nprobesets], where npathways is the number of pathways you are interrogating, and nprobesets is the number of probesets that remain after you do all the filtering steps that preceded this part. Each row of AmER2 consists of zeros and ones; a zero if the corresponding probeset doesn't map to that particular pathway, and a one if it does. By computing AmER2 %*% tobs$statistic, we are (in one shot) doing the same as apply(AmER2, 1, function(x) sum(tobs$statistic[as.logical(x)]) In other words, we are just summing for each row the t-statistics of the probesets that are in a particular pathway. Since there will be a different number of statistics that are being summed, we then divide by sqrt(rs2), which is just the square root of the number of t-statistics summed. We do this to normalize the sums. > > I know this is matrix multiplication, but don't know the mathematical or > statistical basis for the computation. I am interested in turning the t > statistic values in tA into p values, so I need to know the df. for each > resultant t. Is that the rs2? So to answer this question, the values in tA aren't t-statistics. They are sums of t-statistics. If you look at the top of the page you are quoting, you can see that if we make some assumptions, these values are approximately multivariate normal, so you don't need to know the df. If you don't want to assume multivariate normal, you can permute to get the p-value as is done on page 6. Best, Jim > > This is know doubt a simple question for the statisticians in the group, > but not for me! :) Thanks for your help, > > Mark > -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623