What you are doing will give you the transcript boundaries, not the exon boundaries. So you won't be getting anything I would normally consider exons. If you want the exons, you need to use exonsBy
. And how you use that depends on what you want. You can get all the exons by transcript, or all the exons by gene. In your case you should probably do it by gene.
> ex.gn <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, "gene")
> ex.gn
GRangesList object of length 27363:
$`1`
GRanges object with 29 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr19 58345178-58347029 - | 573605 <NA>
[2] chr19 58345183-58347029 - | 573606 <NA>
[3] chr19 58346854-58347029 - | 573607 <NA>
[4] chr19 58346858-58347029 - | 573608 <NA>
[5] chr19 58346860-58347029 - | 573609 <NA>
... ... ... ... . ... ...
[25] chr19 58357585-58357649 - | 573632 <NA>
[26] chr19 58357952-58358286 - | 573633 <NA>
[27] chr19 58358489-58358585 - | 573634 <NA>
[28] chr19 58359197-58359323 - | 573635 <NA>
[29] chr19 58362677-58362751 - | 573636 <NA>
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
...
<27362 more elements>
And then if you subset you have to make sure that your Entrez Gene IDs are character, not numeric, as the names of the GRangesList
are character. You could subset using numbers, but that won't result in what you think.
## subset using numbers
> names( ex.gn[1:40])
[1] "1" "10" "100" "1000" "10000" "100009613"
[7] "100009667" "100009676" "10001" "10002" "10003" "100033411"
[13] "100033413" "100033414" "100033415" "100033416" "100033417" "100033420"
[19] "100033422" "100033423" "100033424" "100033425" "100033426" "100033427"
[25] "100033428" "100033430" "100033431" "100033432" "100033433" "100033434"
[31] "100033435" "100033436" "100033437" "100033438" "100033439" "100033440"
[37] "100033441" "100033442" "100033443" "100033444"
## those aren't Entrez Gene IDs 1-40!
> names( ex.gn[as.character( 1:40)])
Error: subscript contains invalid names
## this is because there aren't any Entrez Gene IDs like 4, 5, 6,7 ,8, at least not any more
> head( sort( as.numeric( names( ex.gn))),40)
[1] 1 2 3 9 10 12 13 14 15 16 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[26] 33 34 35 36 37 38 39 40 41 43 47 48 49 50 51
If you have character Entrez Gene IDs, you might need to subset out those that aren't in the names of your GRangesList
and follow up at NCBI to see if they have been deprecated.
.......................................................
thank you for your answer. I see I should have used
exonsBy
instead oftranscriptsBy
. however my question still stands. I use a code like thiswhich takes a long time to run. I'm looking for a more efficient code for this task
You only run exonsBy one time, and then subset using the entire vector of IDs, like I showed in my original example, rather than sub setting out one at a time like that.