I am using getBM to get start/end positions for my genes, however biomart ensembl hgnz_names only contain 19869 out of my 29581 genes. In my dataset, gene names are HUGO, too. It seems that around 10,000 of my genes are not included in the total of 36,713 genes in biomart ensembl. I checked to see if there is a potential naming difference, but I couldn't find anything incriminating. Does anyone have an idea why that happens? Thanks!
Note that your tab delimited file seems to contain a bunch of gene IDs that don't belong to the current Gencode release (Release 27) for Human. You can easily check this with:
1) Download the GFF3 file containing the "Comprehensive gene annotation" for ALL regions from
library(GenomicFeatures) txdb <- makeTxDbFromGFF("gencode.v27.chr_patch_hapl_scaff.annotation.gff3.gz") tx_by_gene <- transcriptsBy(txdb, by="gene") table(df$gene_id %in% names(tx_by_gene)) # FALSE TRUE # 13 2
So 13 out of the 15 genes in your above list are not valid Human Gencode IDs! Could it be that they are from another organism and/or from an old (and obsolete) version of Gencode?
This questions the accuracy/correctness of this file. Knowing more about how/when the file was generated and how the Gencode IDs in it were mapped to HUGO symbols might help shed some light on your question.