Entering edit mode
Ken Termiso
▴
250
@ken-termiso-1087
Last seen 10.3 years ago
Hello,
I'd like to threshold-free stats (i.e. Kolmogorov-Smirnov) to ID
overrepresented gene ontology terms..
I've got a script to do all the parsing of various GO terms, so that
part is done, and I'd really prefer to use a nuts and bolts approach
from the bottom up than to use something like TopGO..partly for my own
educational purposes..
What I would like to do is this:
1) Nonspecific filtering to remove ~50-70% of genes that are 'absent'
(Affy chips), don't change much, etc..
2) Use rank products to get two-tailed p-values for remaining genes.
Rank products (papers in FEBS letters and Bioinformatics.. pubmed for
'Breitling r[au]' and you'll find 'em)
3) Use ks.test() to compare genes from each ontology category to the
rest of the genes that are not represented in that ontology..
specifically, i'd input two vectors to ks.test() - one would be the
p-values from rank products from each ontology, the other would be the
remaining rank products p-values from all other genes that are not in
the ontology.. e.g.
x <- c(0.04, 0.003, 0.19, 1.0, 0.57, 0.0001, 0, 0.89, 0.77, 0.0245) #
GO category of interest p-values
y <- rnorm(6000, 0, 1) # Remaining p-values for all other retained
genes after nonspec filtering (I can't type 3,000 rank products'
p-values here, so I use rnorm and keep positives)
y <- subset(y, y>0) # only keep positive simulated p-values; gives ~
3,000
ks.test(x,y)
The second and third lines of code are just to create a normal
distribution of p-values here for illustration purposes; in reality I
would use p-values generated by rank products..not exactly normal but
probably the best that can be approximated.
I would think that using the p-values generated by rank products
algorithm would deal with multiple testing, as rank products p-values
are simultaneously adjusted for FDR by using permutations of sample
labels to estimate distribution. This also eliminates correlated gene
dependencies..
I'd also expect that using two-tailed p-values eliminates the issue of
having to separate up-regulated and down-regulated genes? This issue
is discussed breifly in several places, and probably most lucidly in
'How to use GOstats testing gene lists for GO term association' by
Seth Falcon and Robert Gentleman [google 'GOstatsHyperG.pdf'] and
there the authors speculate that using the two-tailed p-values is
suboptimal. The reason the authors imply (unless I am mistaken) is
because it is most desirable to find genes that go in the same
direction. I dispute this point specifically in the context of gene
ontology-related testing - quite often in genetics you can have a gain
of function mutation in a pathway (say in mapk pathway). Say when a
pathway component that normally inhibits another component gets a null
mutation. Then the downstream component that was previously inhibited
by the mutated gene is now upregulated, i.e. say you have
A -> B -x-> C
As a simple pathway where gene A normally induces gene B, and gene B
normally inhibits gene C. If gene B is downregulated (say by null
mutation), gene C is upregulated because B normally inhibits C, and
that inhibition is lost by the null mutation in B. So in reality there
are two genes affected in this simple pathway, but their directions
are opposite (A is unchanged; B goes down; C goes up). Yet the pathway
itself is profoundly changed. My point is that by keeping
directionality and NOT using two-tailed p-values, you automatically
limit detection of the scenario I just illustrated here.
Any comments, criticisms, advice would be much much obliged..
Thanks in advance,
-Ken
_________________________________________________________________
[[trailing spam removed]]