Question

Pick coding vs non-coding from Ensembl data set

0

Entering edit mode

jdmentze • 0

@jdmentze-18572

Last seen 5.4 years ago

I have the ensembl and EnsDb.Hsapiens.v86 packages, as well as a data set with 60,000+ rows of genes. However, I only wish to focus on those that are non-coding or coding genes at a time. I am unsure which function to use, and how to properly word it to do this. How would I do so(in R) so that I can create a new data set for retrieving and analysis? EX:

R> data_coding <- "function"

R>data_noncoding <- "function"

R ensembl ensembldb noncoding rna protein coding • 986 views

ADD COMMENT • link updated 5.4 years ago by Johannes Rainer ★ 2.0k • written 5.4 years ago by jdmentze • 0

score 0 · Answer 1 · 2018-11-29

You can simply filter the complete EnsDb database to contain only protein coding genes or all other genes (note: this includes miRNA genes, lincRNAs, pseudogenes, snoRNA, scaRNA, sRNA, scRNA, rRNA ...).

> library(EnsDb.Hsapiens.v86)
> ## Filter the EnsDb database for protein coding genes
> edb_coding <- filter(EnsDb.Hsapiens.v86, filter = ~ gene_biotype == "protein_coding")
> ## Any query will now extract only information for protein coding genes
> genes(edb_coding)
GRanges object with 22285 ranges and 6 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000186092        1       69091-70008      + | ENSG00000186092
  ENSG00000279928        1     182393-184158      + | ENSG00000279928
              ...      ...               ...    ... .             ...
  ENSG00000280301        Y 25463994-25473714      + | ENSG00000280301
  ENSG00000172288        Y 25622162-25624902      + | ENSG00000172288
                    gene_name   gene_biotype seq_coord_system      symbol
                  <character>    <character>      <character> <character>
  ENSG00000186092       OR4F5 protein_coding       chromosome       OR4F5
  ENSG00000279928  FO538757.2 protein_coding       chromosome  FO538757.2
              ...         ...            ...              ...         ...
  ENSG00000280301  AC006328.1 protein_coding       chromosome  AC006328.1
  ENSG00000172288        CDY1 protein_coding       chromosome        CDY1
                                 entrezid
                                   <list>
  ENSG00000186092                   79501
  ENSG00000279928 c(107984078, 102725121)
              ...                     ...
  ENSG00000280301                      NA
  ENSG00000172288                    9085
  -------
  seqinfo: 287 sequences from GRCh38 genome
> ## Or getting all transcripts
> transcripts(edb_coding)
GRanges object with 158444 ranges and 6 metadata columns:
                  seqnames            ranges strand |           tx_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENST00000335137        1       69091-70008      + | ENST00000335137
  ENST00000624431        1     182393-184158      + | ENST00000624431
              ...      ...               ...    ... .             ...
  ENST00000361963        Y 25622162-25624338      + | ENST00000361963
  ENST00000306609        Y 25622162-25624902      + | ENST00000306609
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <integer>      <integer>
  ENST00000335137 protein_coding            69091          70008
  ENST00000624431 protein_coding           182709         184158
              ...            ...              ...            ...
  ENST00000361963 protein_coding         25622443       25624065
  ENST00000306609 protein_coding         25622443       25624527
                          gene_id         tx_name
                      <character>     <character>
  ENST00000335137 ENSG00000186092 ENST00000335137
  ENST00000624431 ENSG00000279928 ENST00000624431
              ...             ...             ...
  ENST00000361963 ENSG00000172288 ENST00000361963
  ENST00000306609 ENSG00000172288 ENST00000306609
  -------
  seqinfo: 287 sequences from GRCh38 genome

For the non-coding genes you can do it analogously:

> edb_noncoding <- filter(EnsDb.Hsapiens.v86, filter = ~ gene_biotype != "protein_coding")

Also note that you can use return.type = "DataFrame" in each function (e.g. genes, transcripts, ...) to extract the information as a DataFrame instead of the default GRanges.

cheers, jo