lncRNA Genes in a Dataset
1
0
Entering edit mode
sushimoto • 0
@2fc02fda
Last seen 2.0 years ago
Turkey

I am trying to extract genes coding "lncRNA"s from a huge dataset. There are about 40,000 genes with the ensemble ID. I can't search all of them on the website obviously. I would like to learn a way to extract these via R. Thank you in advance.

ncRNAtools RNASeqData • 1.1k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

You don't say the species, so I will presume human. The easiest way to get these data is from an EnsDb package, and they are mostly on the AnnotationHub. Here's how you would get one and filter to just the lncRNAs.

> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("homo sapiens","ensdb"))
AnnotationHub with 20 records
# snapshotDate(): 2021-10-20
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

            title                             
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681 | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757 | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773 | Ensembl 91 EnsDb for Homo Sapiens 
  ...       ...                               
  AH83216 | Ensembl 101 EnsDb for Homo sapiens
  AH89180 | Ensembl 102 EnsDb for Homo sapiens
  AH89426 | Ensembl 103 EnsDb for Homo sapiens
  AH95744 | Ensembl 104 EnsDb for Homo sapiens
  AH98047 | Ensembl 105 EnsDb for Homo sapiens

## we'll use the latest version
> ensdb <- hub[["AH98047"]]
loading from cache
require("ensembldb")
> gns <- genes(ensdb)
> gns
GRanges object with 69329 ranges and 9 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000223972        1       11869-14409      + | ENSG00000223972
  ENSG00000227232        1       14404-29570      - | ENSG00000227232
  ENSG00000278267        1       17369-17436      - | ENSG00000278267
  ENSG00000243485        1       29554-31109      + | ENSG00000243485
  ENSG00000284332        1       30366-30503      + | ENSG00000284332
              ...      ...               ...    ... .             ...
  ENSG00000224240        Y 26549425-26549743      + | ENSG00000224240
  ENSG00000227629        Y 26586642-26591601      - | ENSG00000227629
  ENSG00000237917        Y 26594851-26634652      - | ENSG00000237917
  ENSG00000231514        Y 26626520-26627159      - | ENSG00000231514
  ENSG00000235857        Y 56855244-56855488      + | ENSG00000235857
                    gene_name           gene_biotype seq_coord_system
                  <character>            <character>      <character>
  ENSG00000223972     DDX11L1 transcribed_unproces..       chromosome
  ENSG00000227232      WASH7P unprocessed_pseudogene       chromosome
  ENSG00000278267   MIR6859-1                  miRNA       chromosome
  ENSG00000243485 MIR1302-2HG                 lncRNA       chromosome
  ENSG00000284332   MIR1302-2                  miRNA       chromosome
              ...         ...                    ...              ...
  ENSG00000224240     CYCSP49   processed_pseudogene       chromosome
  ENSG00000227629  SLC25A15P1 unprocessed_pseudogene       chromosome
  ENSG00000237917     PARP4P1 unprocessed_pseudogene       chromosome
  ENSG00000231514      CCNQP2   processed_pseudogene       chromosome
  ENSG00000235857     CTBP2P1   processed_pseudogene       chromosome
                             description   gene_id_version canonical_transcript
                             <character>       <character>          <character>
  ENSG00000223972 DEAD/H-box helicase .. ENSG00000223972.5      ENST00000450305
  ENSG00000227232 WASP family homolog .. ENSG00000227232.5      ENST00000488147
  ENSG00000278267 microRNA 6859-1 [Sou.. ENSG00000278267.1      ENST00000619216
  ENSG00000243485 MIR1302-2 host gene .. ENSG00000243485.5      ENST00000473358
  ENSG00000284332 microRNA 1302-2 [Sou.. ENSG00000284332.1      ENST00000607096
              ...                    ...               ...                  ...
  ENSG00000224240 CYCS pseudogene 49 [.. ENSG00000224240.1      ENST00000420810
  ENSG00000227629 solute carrier famil.. ENSG00000227629.1      ENST00000456738
  ENSG00000237917 poly(ADP-ribose) pol.. ENSG00000237917.1      ENST00000435945
  ENSG00000231514 CCNQ pseudogene 2 [S.. ENSG00000231514.1      ENST00000435741
  ENSG00000235857 CTBP2 pseudogene 1 [.. ENSG00000235857.1      ENST00000431853
                       symbol                          entrezid
                  <character>                            <list>
  ENSG00000223972     DDX11L1 102725121,100287596,100287102,...
  ENSG00000227232      WASH7P                              <NA>
  ENSG00000278267   MIR6859-1                         102466751
  ENSG00000243485 MIR1302-2HG                              <NA>
  ENSG00000284332   MIR1302-2                         100302278
              ...         ...                               ...
  ENSG00000224240     CYCSP49                              <NA>
  ENSG00000227629  SLC25A15P1                              <NA>
  ENSG00000237917     PARP4P1                              <NA>
  ENSG00000231514      CCNQP2                              <NA>
  ENSG00000235857     CTBP2P1                              <NA>
  -------
  seqinfo: 456 sequences from GRCh38 genome

## note the gene_biotype column above

> lncs <- gns[gns$gene_biotype %in% "lncRNA"]
> lncs
GRanges object with 18812 ranges and 9 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000243485        1       29554-31109      + | ENSG00000243485
  ENSG00000237613        1       34554-36081      - | ENSG00000237613
  ENSG00000238009        1      89295-133723      - | ENSG00000238009
  ENSG00000239945        1       89551-91105      - | ENSG00000239945
  ENSG00000239906        1     139790-140339      - | ENSG00000239906
              ...      ...               ...    ... .             ...
  ENSG00000228296        Y 25063083-25099892      - | ENSG00000228296
  ENSG00000223641        Y 25182277-25213389      - | ENSG00000223641
  ENSG00000228786        Y 25378300-25394719      - | ENSG00000228786
  ENSG00000240450        Y 25482908-25486705      + | ENSG00000240450
  ENSG00000231141        Y 25728490-25733388      + | ENSG00000231141
                     gene_name gene_biotype seq_coord_system
                   <character>  <character>      <character>
  ENSG00000243485  MIR1302-2HG       lncRNA       chromosome
  ENSG00000237613      FAM138A       lncRNA       chromosome
  ENSG00000238009                    lncRNA       chromosome
  ENSG00000239945                    lncRNA       chromosome
  ENSG00000239906                    lncRNA       chromosome
              ...          ...          ...              ...
  ENSG00000228296       TTTY4C       lncRNA       chromosome
  ENSG00000223641      TTTY17C       lncRNA       chromosome
  ENSG00000228786 LINC00266-4P       lncRNA       chromosome
  ENSG00000240450     CSPG4P1Y       lncRNA       chromosome
  ENSG00000231141        TTTY3       lncRNA       chromosome
                             description   gene_id_version canonical_transcript
                             <character>       <character>          <character>
  ENSG00000243485 MIR1302-2 host gene .. ENSG00000243485.5      ENST00000473358
  ENSG00000237613 family with sequence.. ENSG00000237613.2      ENST00000417324
  ENSG00000238009       novel transcript ENSG00000238009.6      ENST00000477740
  ENSG00000239945       novel transcript ENSG00000239945.1      ENST00000495576
  ENSG00000239906       novel transcript ENSG00000239906.1      ENST00000493797
              ...                    ...               ...                  ...
  ENSG00000228296 testis-specific tran.. ENSG00000228296.1      ENST00000456123
  ENSG00000223641 testis-specific tran.. ENSG00000223641.2      ENST00000421387
  ENSG00000228786 long intergenic non-.. ENSG00000228786.5      ENST00000427373
  ENSG00000240450 CSPG4 pseudogene 1 Y.. ENSG00000240450.1      ENST00000306641
  ENSG00000231141 testis-specific tran.. ENSG00000231141.1      ENST00000417334
                        symbol entrezid
                   <character>   <list>
  ENSG00000243485  MIR1302-2HG     <NA>
  ENSG00000237613      FAM138A   645520
  ENSG00000238009                  <NA>
  ENSG00000239945                  <NA>
  ENSG00000239906                  <NA>
              ...          ...      ...
  ENSG00000228296       TTTY4C   474150
  ENSG00000223641      TTTY17C     <NA>
  ENSG00000228786 LINC00266-4P     <NA>
  ENSG00000240450     CSPG4P1Y   114758
  ENSG00000231141        TTTY3   114760
  -------
  seqinfo: 456 sequences from GRCh38 genome
ADD COMMENT
0
Entering edit mode

Thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6