How to filter the coding protein genes by EnsDb.Hsapiens.v86
1
0
Entering edit mode
beslinail • 0
@d4a633b9
Last seen 2.3 years ago
Turkey

Enter the body of text here

I just want to filter the protein-coding genes in redf.csv file. The gene list in redf.csv file is in geneID or symbol column.

Code should be placed in three backticks as shown below

# -- convert ENSG to gene symbol

ens2sym <- AnnotationDbi::select(EnsDb.Hsapiens.v86, keys = keys(EnsDb.Hsapiens.v86),
                                 columns = c("SYMBOL"))
resdf <- resdf %>%
  rownames_to_column() %>%
  mutate(GENEID = gsub(rowname, pattern = "\\..+", replacement = "")) %>%
  dplyr::select(-rowname) %>%
  inner_join(y = ens2sym, by = "GENEID")
View(resdf)

resdf %>%
  dplyr::filter(padj < .05 & log2FoldChange > 2) %>%
  write_csv(file = "T4_vs_T1.overexp.csv")
View(resdf)

# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )
EnsDb.Hsapiens.v86 gene • 2.5k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

You shouldn't use AnnotationDbi::select with an EnsDb object. There is a specific method for EnsDb objects that you are avoiding.

> showMethods(select)
Function: select (package AnnotationDbi)
x="ChipDb"
x="EnsDb"
x="GODb"
x="Mart"
x="OrgDb"
x="OrthologyDb"
x="ReactomeDb"
x="TxDb"

> selectMethod(select, c(x = "EnsDb"))
Method Definition:

function (x, keys, columns, keytype, ...) 
{
    if (missing(keys)) 
        keys <- NULL
    if (missing(columns)) 
        columns <- NULL
    if (missing(keytype)) 
        keytype <- "GENEID"
    return(.select(x = x, keys = keys, columns = columns, keytype = keytype, 
        ...))
}
<bytecode: 0x000000004a875eb8>
<environment: namespace:ensembldb>

Signatures:
        x      
target  "EnsDb"
defined "EnsDb"

I don't grok tidyverse, so I am not sure what all your code is meant to do, but it's simple enough to do using regular R code. But do note that Ensembl v86 is really old and might not be what you want (see the AnnotationHub package and its vignette). In addition, the EnsDb packages are meant in general to provide genomic locations rather than mapping ID to symbol, so biomaRt is arguably a better resource.

> gns <- genes(EnsDb.Hsapiens.v86)
> gns2 <- gns[gns$gene_biotype %in% "protein_coding"]
> gns2
GRanges object with 22285 ranges and 6 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000186092        1       69091-70008      + | ENSG00000186092
  ENSG00000279928        1     182393-184158      + | ENSG00000279928
  ENSG00000279457        1     184923-200322      - | ENSG00000279457
  ENSG00000278566        1     450740-451678      - | ENSG00000278566
  ENSG00000273547        1     685716-686654      - | ENSG00000273547
              ...      ...               ...    ... .             ...
  ENSG00000205916        Y 24833843-24907040      + | ENSG00000205916
  ENSG00000185894        Y 25030901-25062548      - | ENSG00000185894
  ENSG00000279115        Y 25307702-25308107      + | ENSG00000279115
  ENSG00000280301        Y 25463994-25473714      + | ENSG00000280301
  ENSG00000172288        Y 25622162-25624902      + | ENSG00000172288
                    gene_name   gene_biotype seq_coord_system      symbol
                  <character>    <character>      <character> <character>
  ENSG00000186092       OR4F5 protein_coding       chromosome       OR4F5
  ENSG00000279928  FO538757.2 protein_coding       chromosome  FO538757.2
  ENSG00000279457  FO538757.1 protein_coding       chromosome  FO538757.1
  ENSG00000278566      OR4F29 protein_coding       chromosome      OR4F29
  ENSG00000273547      OR4F16 protein_coding       chromosome      OR4F16
              ...         ...            ...              ...         ...
  ENSG00000205916        DAZ4 protein_coding       chromosome        DAZ4
  ENSG00000185894       BPY2C protein_coding       chromosome       BPY2C
  ENSG00000279115  AC006386.1 protein_coding       chromosome  AC006386.1
  ENSG00000280301  AC006328.1 protein_coding       chromosome  AC006328.1
  ENSG00000172288        CDY1 protein_coding       chromosome        CDY1
                             entrezid
                               <list>
  ENSG00000186092               79501
  ENSG00000279928 107984078,102725121
  ENSG00000279457                <NA>
  ENSG00000278566  729759,81399,26683
  ENSG00000273547         81399,26683
              ...                 ...
  ENSG00000205916          57135,1617
  ENSG00000185894              442868
  ENSG00000279115                <NA>
  ENSG00000280301                <NA>
  ENSG00000172288                9085
  -------
  seqinfo: 357 sequences from GRCh38 genome
> head(mcols(gns2)[,1:2])
DataFrame with 6 rows and 2 columns
                        gene_id   gene_name
                    <character> <character>
ENSG00000186092 ENSG00000186092       OR4F5
ENSG00000279928 ENSG00000279928  FO538757.2
ENSG00000279457 ENSG00000279457  FO538757.1
ENSG00000278566 ENSG00000278566      OR4F29
ENSG00000273547 ENSG00000273547      OR4F16
ENSG00000187634 ENSG00000187634      SAMD11

But probably better to use biomaRt, because it's intended for this sort of annotation lookup.

> library(biomaRt)
> mart <- useMart("ensembl","hsapiens_gnee_ensembl")
> > bmgns <- getBM(c("ensembl_gene_id","hgnc_symbol", "gene_biotype"), mart = mart)
> head(bmgns)
  ensembl_gene_id hgnc_symbol   gene_biotype
1 ENSG00000210049       MT-TF        Mt_tRNA
2 ENSG00000211459     MT-RNR1        Mt_rRNA
3 ENSG00000210077       MT-TV        Mt_tRNA
4 ENSG00000210082     MT-RNR2        Mt_rRNA
5 ENSG00000209082      MT-TL1        Mt_tRNA
6 ENSG00000198888      MT-ND1 protein_coding
> bmgns2 <- subset(bmgns, gene_biotype %in% "protein_coding")
> head(bmgns2)
   ensembl_gene_id hgnc_symbol   gene_biotype
6  ENSG00000198888      MT-ND1 protein_coding
10 ENSG00000198763      MT-ND2 protein_coding
16 ENSG00000198804      MT-CO1 protein_coding
19 ENSG00000198712      MT-CO2 protein_coding
21 ENSG00000228253     MT-ATP8 protein_coding
22 ENSG00000198899     MT-ATP6 protein_coding
ADD COMMENT
0
Entering edit mode

I just convert to ENSG to gene symbol by EnsDb.Hsapiens.v86 package. Now, After saving the CSV file, using the column of the official symbol genes, I just want to filter out the coding genes. The following data frame is the CSV file. This is the gene list I want to filter out. I appreciate to help me out

view(df) head(df) baseMean log2FoldChange lfcSE stat pvalue padj 1 3684.303688 2.048604 0.08119188 25.231642 1.801430e-140 4.744763e-138 2 156.318095 3.781286 0.25573414 14.786002 1.803524e-49 4.228798e-48 3 4.297691 2.417489 0.30488523 7.929176 2.206049e-15 1.113738e-14 4 195.821695 5.662296 0.30571405 18.521544 1.384092e-76 7.896623e-75 5 1.170333 2.998214 0.38135253 7.862055 3.778827e-15 1.879027e-14 6 2418.549784 4.347788 0.21348879 20.365418 3.389484e-92 3.022825e-90 GENEID SYMBOL 1 ENSG00000001617 SEMA3F 2 ENSG00000002726 AOC1 3 ENSG00000004846 ABCB5 4 ENSG00000004848 ARX 5 ENSG00000005073 HOXA11 6 ENSG00000006016 CRLF1

ADD REPLY

Login before adding your answer.

Traffic: 649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6