Question

How to filter the coding protein genes by EnsDb.Hsapiens.v86

0

Entering edit mode

beslinail • 0

@d4a633b9

Last seen 21 months ago

Turkey

Enter the body of text here

I just want to filter the protein-coding genes in redf.csv file. The gene list in redf.csv file is in geneID or symbol column.

Code should be placed in three backticks as shown below

# -- convert ENSG to gene symbol

ens2sym <- AnnotationDbi::select(EnsDb.Hsapiens.v86, keys = keys(EnsDb.Hsapiens.v86),
                                 columns = c("SYMBOL"))
resdf <- resdf %>%
  rownames_to_column() %>%
  mutate(GENEID = gsub(rowname, pattern = "\\..+", replacement = "")) %>%
  dplyr::select(-rowname) %>%
  inner_join(y = ens2sym, by = "GENEID")
View(resdf)

resdf %>%
  dplyr::filter(padj < .05 & log2FoldChange > 2) %>%
  write_csv(file = "T4_vs_T1.overexp.csv")
View(resdf)

# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )

EnsDb.Hsapiens.v86 gene • 2.2k views

ADD COMMENT • link 3.0 years ago beslinail • 0

score 0 · Answer 1 · 2022-03-14

You shouldn't use AnnotationDbi::select with an EnsDb object. There is a specific method for EnsDb objects that you are avoiding.

> showMethods(select)
Function: select (package AnnotationDbi)
x="ChipDb"
x="EnsDb"
x="GODb"
x="Mart"
x="OrgDb"
x="OrthologyDb"
x="ReactomeDb"
x="TxDb"

> selectMethod(select, c(x = "EnsDb"))
Method Definition:

function (x, keys, columns, keytype, ...) 
{
    if (missing(keys)) 
        keys <- NULL
    if (missing(columns)) 
        columns <- NULL
    if (missing(keytype)) 
        keytype <- "GENEID"
    return(.select(x = x, keys = keys, columns = columns, keytype = keytype, 
        ...))
}
<bytecode: 0x000000004a875eb8>
<environment: namespace:ensembldb>

Signatures:
        x      
target  "EnsDb"
defined "EnsDb"

I don't grok tidyverse, so I am not sure what all your code is meant to do, but it's simple enough to do using regular R code. But do note that Ensembl v86 is really old and might not be what you want (see the AnnotationHub package and its vignette). In addition, the EnsDb packages are meant in general to provide genomic locations rather than mapping ID to symbol, so biomaRt is arguably a better resource.

> gns <- genes(EnsDb.Hsapiens.v86)
> gns2 <- gns[gns$gene_biotype %in% "protein_coding"]
> gns2
GRanges object with 22285 ranges and 6 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000186092        1       69091-70008      + | ENSG00000186092
  ENSG00000279928        1     182393-184158      + | ENSG00000279928
  ENSG00000279457        1     184923-200322      - | ENSG00000279457
  ENSG00000278566        1     450740-451678      - | ENSG00000278566
  ENSG00000273547        1     685716-686654      - | ENSG00000273547
              ...      ...               ...    ... .             ...
  ENSG00000205916        Y 24833843-24907040      + | ENSG00000205916
  ENSG00000185894        Y 25030901-25062548      - | ENSG00000185894
  ENSG00000279115        Y 25307702-25308107      + | ENSG00000279115
  ENSG00000280301        Y 25463994-25473714      + | ENSG00000280301
  ENSG00000172288        Y 25622162-25624902      + | ENSG00000172288
                    gene_name   gene_biotype seq_coord_system      symbol
                  <character>    <character>      <character> <character>
  ENSG00000186092       OR4F5 protein_coding       chromosome       OR4F5
  ENSG00000279928  FO538757.2 protein_coding       chromosome  FO538757.2
  ENSG00000279457  FO538757.1 protein_coding       chromosome  FO538757.1
  ENSG00000278566      OR4F29 protein_coding       chromosome      OR4F29
  ENSG00000273547      OR4F16 protein_coding       chromosome      OR4F16
              ...         ...            ...              ...         ...
  ENSG00000205916        DAZ4 protein_coding       chromosome        DAZ4
  ENSG00000185894       BPY2C protein_coding       chromosome       BPY2C
  ENSG00000279115  AC006386.1 protein_coding       chromosome  AC006386.1
  ENSG00000280301  AC006328.1 protein_coding       chromosome  AC006328.1
  ENSG00000172288        CDY1 protein_coding       chromosome        CDY1
                             entrezid
                               <list>
  ENSG00000186092               79501
  ENSG00000279928 107984078,102725121
  ENSG00000279457                <NA>
  ENSG00000278566  729759,81399,26683
  ENSG00000273547         81399,26683
              ...                 ...
  ENSG00000205916          57135,1617
  ENSG00000185894              442868
  ENSG00000279115                <NA>
  ENSG00000280301                <NA>
  ENSG00000172288                9085
  -------
  seqinfo: 357 sequences from GRCh38 genome
> head(mcols(gns2)[,1:2])
DataFrame with 6 rows and 2 columns
                        gene_id   gene_name
                    <character> <character>
ENSG00000186092 ENSG00000186092       OR4F5
ENSG00000279928 ENSG00000279928  FO538757.2
ENSG00000279457 ENSG00000279457  FO538757.1
ENSG00000278566 ENSG00000278566      OR4F29
ENSG00000273547 ENSG00000273547      OR4F16
ENSG00000187634 ENSG00000187634      SAMD11

But probably better to use biomaRt, because it's intended for this sort of annotation lookup.

> library(biomaRt)
> mart <- useMart("ensembl","hsapiens_gnee_ensembl")
> > bmgns <- getBM(c("ensembl_gene_id","hgnc_symbol", "gene_biotype"), mart = mart)
> head(bmgns)
  ensembl_gene_id hgnc_symbol   gene_biotype
1 ENSG00000210049       MT-TF        Mt_tRNA
2 ENSG00000211459     MT-RNR1        Mt_rRNA
3 ENSG00000210077       MT-TV        Mt_tRNA
4 ENSG00000210082     MT-RNR2        Mt_rRNA
5 ENSG00000209082      MT-TL1        Mt_tRNA
6 ENSG00000198888      MT-ND1 protein_coding
> bmgns2 <- subset(bmgns, gene_biotype %in% "protein_coding")
> head(bmgns2)
   ensembl_gene_id hgnc_symbol   gene_biotype
6  ENSG00000198888      MT-ND1 protein_coding
10 ENSG00000198763      MT-ND2 protein_coding
16 ENSG00000198804      MT-CO1 protein_coding
19 ENSG00000198712      MT-CO2 protein_coding
21 ENSG00000228253     MT-ATP8 protein_coding
22 ENSG00000198899     MT-ATP6 protein_coding