Subset of genes can find a term but whole set cannot
1
0
Entering edit mode
@ankitamandal28529-13188
Last seen 7.6 years ago

I have a set of genes and I want to find out disease ontology term from that set. I have used DOSE and biomaRt R package. If I put a subset of genes, I can get a significant term but if I put the whole set of genes, that term is missing. Is it possible? Or, I have made any mistake?

biomart dose • 1.9k views
ADD COMMENT
0
Entering edit mode

You need to show some self-contained code that others can run. Otherwise, it's not clear exactly what you are doing, nor what the problem is.

ADD REPLY
0
Entering edit mode

My code:

 

library(DOSE)
library(biomaRt)
ensembl=useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",version=89)
geneId<-readLines("First_100_Selected_Gene_name.txt")   try(entrez_gene<-getBM(attributes=c('entrezgene'),filters='hgnc_symbol',values=geneId,mart=ensembl))
write.table(entrez_gene,file="entrezgene.txt",sep="\t")
entrez_gene1<-entrez_gene[,1]
try(x<-enrichDO(gene=entrez_gene1,ont="DO",pvalueCutoff=0.05,pAdjustMethod="BH",minGSSize=5,readable=FALSE))
y<-as.data.frame(x)
write.table(y,file="disease_ontology_output.txt",sep="\t"

 

If I use "First_50_Selected_Gene_name.txt", I can get a significant term but if I use "First_100_Selected_Gene_name.txt", that term is missing. My question is, "Is it possible that subset can have a term but whole set can not ?"

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

By 'self-contained' I actually meant 'code that somebody other than you could run'. Perhaps I wasn't explicit enough.

Anyway, note that enrichDO is just fitting a Fisher's exact test, which is based on a 2x2 table, where the cells of the table are

  Genes in term Genes not in term
Gene is significant    
Gene not significant    

And if you only use half of your genes, you may affect the count of genes in all four of those cells. Say you have 7 'influenza' genes in your set, and you select all 7 regardless of whether or not you use 50 or 100 genes.

  Genes in term Genes not in term
Gene is significant 7 43
Gene not significant Inf.genes - 7 Univ - 50

But then you add in the other 50 genes

  Genes in term Genes not in term
Gene is significant 7 93
Gene not significant Inf.genes - 7

Univ - 100

 In that situation, three of the cells used to compute the statistic have now changed, which, depending on the number of influenza genes in the universe, and the number in the universe, may affect the statistic sufficiently for it to no longer be significant.

We can make a stupid example to test this, using the example data for enrichDO:

> example(enrichDO)
<snip>
> z <- summary(yy)

fun <- function(gene, sigresult, B){
    rslt <- vector(length = B)
    for(i in seq(B)){
        gene2 <- gene[sample(1:525, 212)] # Randomly select half the genes
        tmp <- summary(enrichDO(gene2))
        rslt[i] <- sum(!row.names(tmp) %in% row.names(sigresult))
    }
    rslt
}

> y <- fun(gene, z, 100)

> y
 [1]   0  37   7 104   0   0  15   3 108  70  20 158  72   7  31  59   1   9
 [19]  72  88  47 156  71  34 109  90  74  91  47  87  56  29  38 110   3  48
 [37]  33 102  29   0  95   0   5  80  10  74  82   3  82  32  36 181  19   2
 [55]   0  51  69  50   6  54  75   6 112  64  13  77  30   6   1  32  24  17
 [73]  34  14 112  33  68  83  26  49  36  23 105  12  53  12  59  83  26   8
 [91]   7   0 142  74   1  17  16  93  19  17

So we get from 0 to edit - 181 terms that appear when we use a random subset of half the original 525 genes used in the example, as compared to the terms we got in the first place.

ADD COMMENT

Login before adding your answer.

Traffic: 664 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6