Question: Subset of genes can find a term but whole set cannot
0
2.3 years ago by
ankita.mandal285290 wrote:

I have a set of genes and I want to find out disease ontology term from that set. I have used DOSE and biomaRt R package. If I put a subset of genes, I can get a significant term but if I put the whole set of genes, that term is missing. Is it possible? Or, I have made any mistake?

biomart dose • 550 views
modified 2.3 years ago by James W. MacDonald51k • written 2.3 years ago by ankita.mandal285290

You need to show some self-contained code that others can run. Otherwise, it's not clear exactly what you are doing, nor what the problem is.

My code:

library(DOSE)
library(biomaRt)
ensembl=useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",version=89)
write.table(entrez_gene,file="entrezgene.txt",sep="\t")
entrez_gene1<-entrez_gene[,1]
y<-as.data.frame(x)
write.table(y,file="disease_ontology_output.txt",sep="\t"

If I use "First_50_Selected_Gene_name.txt", I can get a significant term but if I use "First_100_Selected_Gene_name.txt", that term is missing. My question is, "Is it possible that subset can have a term but whole set can not ?"

Answer: Subset of genes can find a term but whole set cannot
1
2.3 years ago by
United States
James W. MacDonald51k wrote:

By 'self-contained' I actually meant 'code that somebody other than you could run'. Perhaps I wasn't explicit enough.

Anyway, note that enrichDO is just fitting a Fisher's exact test, which is based on a 2x2 table, where the cells of the table are

Genes in term Genes not in term
Gene is significant
Gene not significant

And if you only use half of your genes, you may affect the count of genes in all four of those cells. Say you have 7 'influenza' genes in your set, and you select all 7 regardless of whether or not you use 50 or 100 genes.

Genes in term Genes not in term
Gene is significant 7 43
Gene not significant Inf.genes - 7 Univ - 50

But then you add in the other 50 genes

Genes in term Genes not in term
Gene is significant 7 93
Gene not significant Inf.genes - 7

Univ - 100

In that situation, three of the cells used to compute the statistic have now changed, which, depending on the number of influenza genes in the universe, and the number in the universe, may affect the statistic sufficiently for it to no longer be significant.

We can make a stupid example to test this, using the example data for enrichDO:

> example(enrichDO)
<snip>
> z <- summary(yy)

fun <- function(gene, sigresult, B){
rslt <- vector(length = B)
for(i in seq(B)){
gene2 <- gene[sample(1:525, 212)] # Randomly select half the genes
tmp <- summary(enrichDO(gene2))
rslt[i] <- sum(!row.names(tmp) %in% row.names(sigresult))
}
rslt
}

> y <- fun(gene, z, 100)

> y
[1]   0  37   7 104   0   0  15   3 108  70  20 158  72   7  31  59   1   9
[19]  72  88  47 156  71  34 109  90  74  91  47  87  56  29  38 110   3  48
[37]  33 102  29   0  95   0   5  80  10  74  82   3  82  32  36 181  19   2
[55]   0  51  69  50   6  54  75   6 112  64  13  77  30   6   1  32  24  17
[73]  34  14 112  33  68  83  26  49  36  23 105  12  53  12  59  83  26   8
[91]   7   0 142  74   1  17  16  93  19  17

So we get from 0 to edit - 181 terms that appear when we use a random subset of half the original 525 genes used in the example, as compared to the terms we got in the first place.