Question

how to get exons of multiple genes

0

Entering edit mode

pegah.taklifi ▴ 30

@pegahtaklifi-24293

Last seen 3.2 years ago

Iran

when I try to get exons of a certain gene I run a code like this

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

exon<-transcriptsBy(txdb,by="gene") [[geneID]]

where geneID is the ENTREZ ID of gene. this works properly, but what can I do if I want to get exons of multiple genes ? I know I can get exons of each gene separately and then merge them but I'm looking for a more efficient way. I tried this code

exons<-transcriptsBy(txdb,by="gene") [geneVec]

where geneVec is a vector of gene IDs and got this error

Error: subscript contains invalid names

software error • 3.2k views

ADD COMMENT • link 4.8 years ago • updated 4.7 years ago pegah.taklifi ▴ 30

0

Entering edit mode

pegah.taklifi ▴ 30

@pegahtaklifi-24293

Last seen 3.2 years ago

Iran

I understand now that this code was actually correct

 exons<-exonsBy(txdb , by="gene")[geneVec]

but there was a problem with my gene IDs vector. I used this code to convert gene symbols list to gene ENTREZ IDs which gave NA for some genes or incorrect IDs

geneVec<-as.vector(mapIds(org.Hs.eg.db, s, 'ENTREZID', 'SYMBOL') )

ADD COMMENT • link 4.7 years ago pegah.taklifi ▴ 30

score 3 · Accepted Answer · 2020-10-06

What you are doing will give you the transcript boundaries, not the exon boundaries. So you won't be getting anything I would normally consider exons. If you want the exons, you need to use exonsBy. And how you use that depends on what you want. You can get all the exons by transcript, or all the exons by gene. In your case you should probably do it by gene.


> ex.gn <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, "gene")
> ex.gn
GRangesList object of length 27363:
$`1`
GRanges object with 29 ranges and 2 metadata columns:
       seqnames            ranges strand |   exon_id   exon_name
          <Rle>         <IRanges>  <Rle> | <integer> <character>
   [1]    chr19 58345178-58347029      - |    573605        <NA>
   [2]    chr19 58345183-58347029      - |    573606        <NA>
   [3]    chr19 58346854-58347029      - |    573607        <NA>
   [4]    chr19 58346858-58347029      - |    573608        <NA>
   [5]    chr19 58346860-58347029      - |    573609        <NA>
   ...      ...               ...    ... .       ...         ...
  [25]    chr19 58357585-58357649      - |    573632        <NA>
  [26]    chr19 58357952-58358286      - |    573633        <NA>
  [27]    chr19 58358489-58358585      - |    573634        <NA>
  [28]    chr19 58359197-58359323      - |    573635        <NA>
  [29]    chr19 58362677-58362751      - |    573636        <NA>
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

...
<27362 more elements>

And then if you subset you have to make sure that your Entrez Gene IDs are character, not numeric, as the names of the GRangesList are character. You could subset using numbers, but that won't result in what you think.

## subset using numbers

> names( ex.gn[1:40])
 [1] "1"         "10"        "100"       "1000"      "10000"     "100009613"
 [7] "100009667" "100009676" "10001"     "10002"     "10003"     "100033411"
[13] "100033413" "100033414" "100033415" "100033416" "100033417" "100033420"
[19] "100033422" "100033423" "100033424" "100033425" "100033426" "100033427"
[25] "100033428" "100033430" "100033431" "100033432" "100033433" "100033434"
[31] "100033435" "100033436" "100033437" "100033438" "100033439" "100033440"
[37] "100033441" "100033442" "100033443" "100033444"

## those aren't Entrez Gene IDs 1-40!
> names( ex.gn[as.character( 1:40)])
Error: subscript contains invalid names
## this is because there aren't any Entrez Gene IDs like 4, 5, 6,7 ,8, at least not any more
> head( sort( as.numeric( names( ex.gn))),40)
 [1]  1  2  3  9 10 12 13 14 15 16 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[26] 33 34 35 36 37 38 39 40 41 43 47 48 49 50 51

If you have character Entrez Gene IDs, you might need to subset out those that aren't in the names of your GRangesList and follow up at NCBI to see if they have been deprecated.