I am working my way through the Category vignette and have a question
as
to how the t statistics for categories are computed from the incidence
matrix and individual probeset t-statistics. The code that does this
can
be found on the bottom of page 3 (development version vignette) and is
as follows:
There are 135 pathways (categories)...
A = AmER2 %*% tobs$statistic
A = tA/sqrt(rs2)
ames(tA) = row.names(AmER2)
I know this is matrix multiplication, but don't know the mathematical
or
statistical basis for the computation. I am interested in turning the
t
statistic values in tA into p values, so I need to know the df. for
each
resultant t. Is that the rs2?
This is know doubt a simple question for the statisticians in the
group,
but not for me! :) Thanks for your help,
Mark
--
---
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 663-0513 Home (no voice mail please)
Hi Mark,
Mark W Kimpel <mkimpel at="" iupui.edu=""> writes:
> I am working my way through the Category vignette and have a
question as
> to how the t statistics for categories are computed from the
incidence
> matrix and individual probeset t-statistics. The code that does this
can
> be found on the bottom of page 3 (development version vignette) and
is
> as follows:
>
> There are 135 pathways (categories)...
> A = AmER2 %*% tobs$statistic
> A = tA/sqrt(rs2)
> ames(tA) = row.names(AmER2)
>
> I know this is matrix multiplication, but don't know the
mathematical or
> statistical basis for the computation. I am interested in turning
the t
> statistic values in tA into p values, so I need to know the df. for
each
> resultant t. Is that the rs2?
Each row of the matrix represents a gene set (a category) and each
column a gene. Each cell in the matrix is 0/1 depending on whether
the given gene is in the given gene set.
The vector tobs$statistic has the t-stat for each gene. The matrix
multiplication is a convenient way to obtain the sum of the t-stats
for each gene set.
Does that help?
+ seth
--
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research
Center
BioC: http://bioconductor.org/
Blog: http://userprimary.net/user/
Hi Mark,
Mark W Kimpel wrote:
> I am working my way through the Category vignette and have a
question as
> to how the t statistics for categories are computed from the
incidence
> matrix and individual probeset t-statistics. The code that does this
can
> be found on the bottom of page 3 (development version vignette) and
is
> as follows:
>
> There are 135 pathways (categories)...
> A = AmER2 %*% tobs$statistic
> A = tA/sqrt(rs2)
> ames(tA) = row.names(AmER2)
Actually you have a typo here. It should read
tA = AmER2 %*% tobs$statistic
tA = tA/sqrt(rs2)
As for the computation being done here, it is actually very simple.
AmER2 is a matrix of dimension [npathways x nprobesets], where
npathways
is the number of pathways you are interrogating, and nprobesets is the
number of probesets that remain after you do all the filtering steps
that preceded this part.
Each row of AmER2 consists of zeros and ones; a zero if the
corresponding probeset doesn't map to that particular pathway, and a
one
if it does. By computing AmER2 %*% tobs$statistic, we are (in one
shot)
doing the same as
apply(AmER2, 1, function(x) sum(tobs$statistic[as.logical(x)])
In other words, we are just summing for each row the t-statistics of
the
probesets that are in a particular pathway. Since there will be a
different number of statistics that are being summed, we then divide
by
sqrt(rs2), which is just the square root of the number of t-statistics
summed. We do this to normalize the sums.
>
> I know this is matrix multiplication, but don't know the
mathematical or
> statistical basis for the computation. I am interested in turning
the t
> statistic values in tA into p values, so I need to know the df. for
each
> resultant t. Is that the rs2?
So to answer this question, the values in tA aren't t-statistics. They
are sums of t-statistics. If you look at the top of the page you are
quoting, you can see that if we make some assumptions, these values
are
approximately multivariate normal, so you don't need to know the df.
If you don't want to assume multivariate normal, you can permute to
get
the p-value as is done on page 6.
Best,
Jim
>
> This is know doubt a simple question for the statisticians in the
group,
> but not for me! :) Thanks for your help,
>
> Mark
>
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623