featureCounts in-built annotation doesn't match BioMart annotation?
1
0
Entering edit mode
Jack • 0
@jack-14069
Last seen 4.4 years ago

Hi all

I first use the featureCounts in-build annotation (Explainnation of featureCounts in-built annotation?) and get the result.

There are 20454 genes in the results list:

a <- read.table(results_txt, head=TRUE)

dim(a)
[1] 20454    18

snapshot of the results

Since in the featureCounts in-built annotation results, there are serveral values in "Chr", "Start","End","Strand" annotation for each gene (Explainnation of featureCounts in-built annotation?), what I want to get is the position of each gene. Then I use BioMart to get the annotation. The result is that there are only 17659 corresponding genes. It is not what I exprected. What I exprected was they match each other exactly.

b <- getBM(attributes=c("entrezgene","ensembl_gene_id","external_gene_name","chromosome_name","strand", "start_position","end_position"),
           filters=c('entrezgene'),
           values=a$ENTREZID,
           mart=ensembl)

> dim(b)
[1] 17659     7

snapshot of the result

 

 

Can anyone help me explain the difference?

 

 

rnaseq annotation featurecounts biomart • 1.2k views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 27 minutes ago
United States

You don't show enough code to say for sure. But do note that (as the help page for featureCounts states) the annot.inbuilt data come from the NCBI RefSeq annotations for a given genome build. And the biomaRt package uses data from EBI/Ensembl. NCBI and EBI are two different groups, who use different methods to say what is and isn't a gene, and where that gene comes from (and how many transcripts it has, and how many exons, etc).

So when you ask EBI to map a bunch of Entrez Gene IDs (e.g., NCBI Gene IDs) to their genomic positions, you are in essence saying 'Hey, take these IDs from this different group, map them to your own IDs as best you can, and then tell me where all the corresponding gene locations are'. It shouldn't be surprising that you lose some genes, where NCBI says it's a gene, and EBI says 'NAH'. Or maybe both groups agree that there is a gene, but they call it different things, so EBI won't map the Entrez Gene ID, because they say the gene is different.

And this ignores any cross-build mappings you might be doing (biomaRt is by definition querying the GRCh38 genome, and if you are using the 'hg19' inbuilt annotations, then those are not even the same genome build). And it also ignores the fact that some genes are found on multiple chromosomes or haplotypes, and for those edge cases what was done for the inbuilt annotations may not be even remotely similar to what biomaRt does.

Anyway, long story short, you shouldn't expect to get 100% correspondence between NCBI and EBI/Ensembl, because, well, see above.

ADD COMMENT

Login before adding your answer.

Traffic: 826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6