Search
Question: featureCounts in-built annotation doesn't match BioMart annotation?
0
12 months ago by
Jack0
Jack0 wrote:

Hi all

I first use the featureCounts in-build annotation (Explainnation of featureCounts in-built annotation?) and get the result.

There are 20454 genes in the results list:

a <- read.table(results_txt, head=TRUE)

dim(a)
[1] 20454    18

snapshot of the results

Since in the featureCounts in-built annotation results, there are serveral values in "Chr", "Start","End","Strand" annotation for each gene (Explainnation of featureCounts in-built annotation?), what I want to get is the position of each gene. Then I use BioMart to get the annotation. The result is that there are only 17659 corresponding genes. It is not what I exprected. What I exprected was they match each other exactly.

b <- getBM(attributes=c("entrezgene","ensembl_gene_id","external_gene_name","chromosome_name","strand", "start_position","end_position"),
filters=c('entrezgene'),
values=a\$ENTREZID,
mart=ensembl)

> dim(b)
[1] 17659     7

snapshot of the result

Can anyone help me explain the difference?

modified 12 months ago by James W. MacDonald48k • written 12 months ago by Jack0
3
12 months ago by
United States
James W. MacDonald48k wrote:

You don't show enough code to say for sure. But do note that (as the help page for featureCounts states) the annot.inbuilt data come from the NCBI RefSeq annotations for a given genome build. And the biomaRt package uses data from EBI/Ensembl. NCBI and EBI are two different groups, who use different methods to say what is and isn't a gene, and where that gene comes from (and how many transcripts it has, and how many exons, etc).

So when you ask EBI to map a bunch of Entrez Gene IDs (e.g., NCBI Gene IDs) to their genomic positions, you are in essence saying 'Hey, take these IDs from this different group, map them to your own IDs as best you can, and then tell me where all the corresponding gene locations are'. It shouldn't be surprising that you lose some genes, where NCBI says it's a gene, and EBI says 'NAH'. Or maybe both groups agree that there is a gene, but they call it different things, so EBI won't map the Entrez Gene ID, because they say the gene is different.

And this ignores any cross-build mappings you might be doing (biomaRt is by definition querying the GRCh38 genome, and if you are using the 'hg19' inbuilt annotations, then those are not even the same genome build). And it also ignores the fact that some genes are found on multiple chromosomes or haplotypes, and for those edge cases what was done for the inbuilt annotations may not be even remotely similar to what biomaRt does.

Anyway, long story short, you shouldn't expect to get 100% correspondence between NCBI and EBI/Ensembl, because, well, see above.