You shouldn't use AnnotationDbi::select
with an EnsDb
object. There is a specific method for EnsDb
objects that you are avoiding.
> showMethods(select)
Function: select (package AnnotationDbi)
> selectMethod(select, c(x = "EnsDb"))
Method Definition:
function (x, keys, columns, keytype, ...)
if (missing(keys))
keys <- NULL
if (missing(columns))
columns <- NULL
if (missing(keytype))
keytype <- "GENEID"
return(.select(x = x, keys = keys, columns = columns, keytype = keytype,
<bytecode: 0x000000004a875eb8>
<environment: namespace:ensembldb>
target "EnsDb"
defined "EnsDb"
I don't grok tidyverse, so I am not sure what all your code is meant to do, but it's simple enough to do using regular R code. But do note that Ensembl v86 is really old and might not be what you want (see the AnnotationHub
package and its vignette). In addition, the EnsDb
packages are meant in general to provide genomic locations rather than mapping ID to symbol, so biomaRt
is arguably a better resource.
> gns <- genes(EnsDb.Hsapiens.v86)
> gns2 <- gns[gns$gene_biotype %in% "protein_coding"]
> gns2
GRanges object with 22285 ranges and 6 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000186092 1 69091-70008 + | ENSG00000186092
ENSG00000279928 1 182393-184158 + | ENSG00000279928
ENSG00000279457 1 184923-200322 - | ENSG00000279457
ENSG00000278566 1 450740-451678 - | ENSG00000278566
ENSG00000273547 1 685716-686654 - | ENSG00000273547
... ... ... ... . ...
ENSG00000205916 Y 24833843-24907040 + | ENSG00000205916
ENSG00000185894 Y 25030901-25062548 - | ENSG00000185894
ENSG00000279115 Y 25307702-25308107 + | ENSG00000279115
ENSG00000280301 Y 25463994-25473714 + | ENSG00000280301
ENSG00000172288 Y 25622162-25624902 + | ENSG00000172288
gene_name gene_biotype seq_coord_system symbol
<character> <character> <character> <character>
ENSG00000186092 OR4F5 protein_coding chromosome OR4F5
ENSG00000279928 FO538757.2 protein_coding chromosome FO538757.2
ENSG00000279457 FO538757.1 protein_coding chromosome FO538757.1
ENSG00000278566 OR4F29 protein_coding chromosome OR4F29
ENSG00000273547 OR4F16 protein_coding chromosome OR4F16
... ... ... ... ...
ENSG00000205916 DAZ4 protein_coding chromosome DAZ4
ENSG00000185894 BPY2C protein_coding chromosome BPY2C
ENSG00000279115 AC006386.1 protein_coding chromosome AC006386.1
ENSG00000280301 AC006328.1 protein_coding chromosome AC006328.1
ENSG00000172288 CDY1 protein_coding chromosome CDY1
ENSG00000186092 79501
ENSG00000279928 107984078,102725121
ENSG00000279457 <NA>
ENSG00000278566 729759,81399,26683
ENSG00000273547 81399,26683
... ...
ENSG00000205916 57135,1617
ENSG00000185894 442868
ENSG00000279115 <NA>
ENSG00000280301 <NA>
ENSG00000172288 9085
seqinfo: 357 sequences from GRCh38 genome
> head(mcols(gns2)[,1:2])
DataFrame with 6 rows and 2 columns
gene_id gene_name
<character> <character>
ENSG00000186092 ENSG00000186092 OR4F5
ENSG00000279928 ENSG00000279928 FO538757.2
ENSG00000279457 ENSG00000279457 FO538757.1
ENSG00000278566 ENSG00000278566 OR4F29
ENSG00000273547 ENSG00000273547 OR4F16
ENSG00000187634 ENSG00000187634 SAMD11
But probably better to use biomaRt
, because it's intended for this sort of annotation lookup.
> library(biomaRt)
> mart <- useMart("ensembl","hsapiens_gnee_ensembl")
> > bmgns <- getBM(c("ensembl_gene_id","hgnc_symbol", "gene_biotype"), mart = mart)
> head(bmgns)
ensembl_gene_id hgnc_symbol gene_biotype
1 ENSG00000210049 MT-TF Mt_tRNA
2 ENSG00000211459 MT-RNR1 Mt_rRNA
3 ENSG00000210077 MT-TV Mt_tRNA
4 ENSG00000210082 MT-RNR2 Mt_rRNA
5 ENSG00000209082 MT-TL1 Mt_tRNA
6 ENSG00000198888 MT-ND1 protein_coding
> bmgns2 <- subset(bmgns, gene_biotype %in% "protein_coding")
> head(bmgns2)
ensembl_gene_id hgnc_symbol gene_biotype
6 ENSG00000198888 MT-ND1 protein_coding
10 ENSG00000198763 MT-ND2 protein_coding
16 ENSG00000198804 MT-CO1 protein_coding
19 ENSG00000198712 MT-CO2 protein_coding
21 ENSG00000228253 MT-ATP8 protein_coding
22 ENSG00000198899 MT-ATP6 protein_coding
I just convert to ENSG to gene symbol by EnsDb.Hsapiens.v86 package. Now, After saving the CSV file, using the column of the official symbol genes, I just want to filter out the coding genes. The following data frame is the CSV file. This is the gene list I want to filter out. I appreciate to help me out