how to get exons of multiple genes
2
0
Entering edit mode
@pegahtaklifi-24293
Last seen 9 weeks ago
Iran

when I try to get exons of a certain gene I run a code like this

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

exon<-transcriptsBy(txdb,by="gene") [[geneID]]

where geneID is the ENTREZ ID of gene. this works properly, but what can I do if I want to get exons of multiple genes ? I know I can get exons of each gene separately and then merge them but I'm looking for a more efficient way. I tried this code

exons<-transcriptsBy(txdb,by="gene") [geneVec]

where geneVec is a vector of gene IDs and got this error

Error: subscript contains invalid names

software error • 325 views
3
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

What you are doing will give you the transcript boundaries, not the exon boundaries. So you won't be getting anything I would normally consider exons. If you want the exons, you need to use exonsBy. And how you use that depends on what you want. You can get all the exons by transcript, or all the exons by gene. In your case you should probably do it by gene.


> ex.gn <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, "gene")
> ex.gn
GRangesList object of length 27363:
\$1
GRanges object with 29 ranges and 2 metadata columns:
seqnames            ranges strand |   exon_id   exon_name
<Rle>         <IRanges>  <Rle> | <integer> <character>
[1]    chr19 58345178-58347029      - |    573605        <NA>
[2]    chr19 58345183-58347029      - |    573606        <NA>
[3]    chr19 58346854-58347029      - |    573607        <NA>
[4]    chr19 58346858-58347029      - |    573608        <NA>
[5]    chr19 58346860-58347029      - |    573609        <NA>
...      ...               ...    ... .       ...         ...
[25]    chr19 58357585-58357649      - |    573632        <NA>
[26]    chr19 58357952-58358286      - |    573633        <NA>
[27]    chr19 58358489-58358585      - |    573634        <NA>
[28]    chr19 58359197-58359323      - |    573635        <NA>
[29]    chr19 58362677-58362751      - |    573636        <NA>
-------
seqinfo: 595 sequences (1 circular) from hg38 genome

...
<27362 more elements>



And then if you subset you have to make sure that your Entrez Gene IDs are character, not numeric, as the names of the GRangesList are character. You could subset using numbers, but that won't result in what you think.

## subset using numbers

> names( ex.gn[1:40])
[1] "1"         "10"        "100"       "1000"      "10000"     "100009613"
[7] "100009667" "100009676" "10001"     "10002"     "10003"     "100033411"
[13] "100033413" "100033414" "100033415" "100033416" "100033417" "100033420"
[19] "100033422" "100033423" "100033424" "100033425" "100033426" "100033427"
[25] "100033428" "100033430" "100033431" "100033432" "100033433" "100033434"
[31] "100033435" "100033436" "100033437" "100033438" "100033439" "100033440"
[37] "100033441" "100033442" "100033443" "100033444"

## those aren't Entrez Gene IDs 1-40!
> names( ex.gn[as.character( 1:40)])
Error: subscript contains invalid names
## this is because there aren't any Entrez Gene IDs like 4, 5, 6,7 ,8, at least not any more
> head( sort( as.numeric( names( ex.gn))),40)
[1]  1  2  3  9 10 12 13 14 15 16 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[26] 33 34 35 36 37 38 39 40 41 43 47 48 49 50 51



If you have character Entrez Gene IDs, you might need to subset out those that aren't in the names of your GRangesList and follow up at NCBI to see if they have been deprecated.

0
Entering edit mode

.......................................................

0
Entering edit mode

thank you for your answer. I see I should have used exonsBy instead of transcriptsBy. however my question still stands. I use a code like this

 exons<-list()
for(i in 1: length(geneVec))
{
exons[i]<-exonsBy(txdb , by="gene")[[geneVec[i]]]
}

BRCAexons_GRanges<- do.call("c" ,exons )


which takes a long time to run. I'm looking for a more efficient code for this task

1
Entering edit mode

You only run exonsBy one time, and then subset using the entire vector of IDs, like I showed in my original example, rather than sub setting out one at a time like that.

0
Entering edit mode
@pegahtaklifi-24293
Last seen 9 weeks ago
Iran

I understand now that this code was actually correct

 exons<-exonsBy(txdb , by="gene")[geneVec]


but there was a problem with my gene IDs vector. I used this code to convert gene symbols list to gene ENTREZ IDs which gave NA for some genes or incorrect IDs

geneVec<-as.vector(mapIds(org.Hs.eg.db, s, 'ENTREZID', 'SYMBOL') )