Question: Pick coding vs non-coding from Ensembl data set
0
gravatar for jdmentze
10 months ago by
jdmentze0
jdmentze0 wrote:

I have the ensembl and EnsDb.Hsapiens.v86 packages, as well as a data set with 60,000+ rows of genes. However, I only wish to focus on those that are non-coding or coding genes at a time. I am unsure which function to use, and how to properly word it to do this. How would I do so(in R) so that I can create a new data set for retrieving and analysis? EX:

R> data_coding <- "function"

R>data_noncoding <- "function"

ADD COMMENTlink modified 10 months ago by Johannes Rainer1.5k • written 10 months ago by jdmentze0
Answer: Pick coding vs non-coding from Ensembl data set
0
gravatar for Johannes Rainer
10 months ago by
Johannes Rainer1.5k
Italy
Johannes Rainer1.5k wrote:

You can simply filter the complete EnsDb database to contain only protein coding genes or all other genes (note: this includes miRNA genes, lincRNAs, pseudogenes, snoRNA, scaRNA, sRNA, scRNA, rRNA ...).

> library(EnsDb.Hsapiens.v86)
> ## Filter the EnsDb database for protein coding genes
> edb_coding <- filter(EnsDb.Hsapiens.v86, filter = ~ gene_biotype == "protein_coding")
> ## Any query will now extract only information for protein coding genes
> genes(edb_coding)
GRanges object with 22285 ranges and 6 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000186092        1       69091-70008      + | ENSG00000186092
  ENSG00000279928        1     182393-184158      + | ENSG00000279928
              ...      ...               ...    ... .             ...
  ENSG00000280301        Y 25463994-25473714      + | ENSG00000280301
  ENSG00000172288        Y 25622162-25624902      + | ENSG00000172288
                    gene_name   gene_biotype seq_coord_system      symbol
                  <character>    <character>      <character> <character>
  ENSG00000186092       OR4F5 protein_coding       chromosome       OR4F5
  ENSG00000279928  FO538757.2 protein_coding       chromosome  FO538757.2
              ...         ...            ...              ...         ...
  ENSG00000280301  AC006328.1 protein_coding       chromosome  AC006328.1
  ENSG00000172288        CDY1 protein_coding       chromosome        CDY1
                                 entrezid
                                   <list>
  ENSG00000186092                   79501
  ENSG00000279928 c(107984078, 102725121)
              ...                     ...
  ENSG00000280301                      NA
  ENSG00000172288                    9085
  -------
  seqinfo: 287 sequences from GRCh38 genome
> ## Or getting all transcripts
> transcripts(edb_coding)
GRanges object with 158444 ranges and 6 metadata columns:
                  seqnames            ranges strand |           tx_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENST00000335137        1       69091-70008      + | ENST00000335137
  ENST00000624431        1     182393-184158      + | ENST00000624431
              ...      ...               ...    ... .             ...
  ENST00000361963        Y 25622162-25624338      + | ENST00000361963
  ENST00000306609        Y 25622162-25624902      + | ENST00000306609
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <integer>      <integer>
  ENST00000335137 protein_coding            69091          70008
  ENST00000624431 protein_coding           182709         184158
              ...            ...              ...            ...
  ENST00000361963 protein_coding         25622443       25624065
  ENST00000306609 protein_coding         25622443       25624527
                          gene_id         tx_name
                      <character>     <character>
  ENST00000335137 ENSG00000186092 ENST00000335137
  ENST00000624431 ENSG00000279928 ENST00000624431
              ...             ...             ...
  ENST00000361963 ENSG00000172288 ENST00000361963
  ENST00000306609 ENSG00000172288 ENST00000306609
  -------
  seqinfo: 287 sequences from GRCh38 genome

 

For the non-coding genes you can do it analogously:

> edb_noncoding <- filter(EnsDb.Hsapiens.v86, filter = ~ gene_biotype != "protein_coding")

 

Also note that you can use return.type = "DataFrame" in each function (e.g. genes, transcripts, ...) to extract the information as a DataFrame instead of the default GRanges.

cheers, jo

ADD COMMENTlink written 10 months ago by Johannes Rainer1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 251 users visited in the last hour