Question

TxDb.Hsapiens.UCSC.hg19.knownGene - Error: subscript contains out-of-bounds indices for some Entrez codes

0

Entering edit mode

Korn • 0

@korn-12440

Last seen 7.2 years ago

Hi all,

I'm having a bit of problem with the TxDb.Hsapiens.UCSC.hg19.knownGene, most of the Entrez identifiers I used are fine when I use the function: transcriptsBy (TxDb.Hsapiens.UCSC.hg19.knownGene, by = "gene")

For example: (example that works)

------------------------------------------------------------------------------------------------------------------

MYOT <-'9499'

transcriptCoordsByGene.GRangesList.MYOT <-
transcriptsBy (TxDb.Hsapiens.UCSC.hg19.knownGene, by = "gene") [MYOT]
transcriptCoordsByGene.GRangesList.MYOT
#GRangesList object of length 1:
#$9499
#GRanges object with 4 ranges and 2 metadata columns:
#seqnames ranges strand | tx_id tx_name
#<Rle> <IRanges> <Rle> | <integer> <character>
#[1] chr5 [137022410, 137223540] + | 21288 uc011cye.2
#[2] chr5 [137203545, 137223540] + | 21290 uc003lbv.3
#[3] chr5 [137203545, 137223540] + | 21291 uc011cyg.2
#[4] chr5 [137203545, 137223540] + | 21292 uc011cyh.2

---------------------------------------------------------------------------------------------------

However, for some other genes such as 201625 which is Entrez code for DNAH12 gene in human (I used library(org.Hs.eg.db) and checked with NCBI) I start to get:

Error: subscript contains out-of-bounds indices

Could you please tell me how I can solve this problem?

or is there any other packages I can use to extract these genes' data?

I need the data here so that I can analyse the motif using the rGADEM package

I am a medical student and extremely new to R and bioconductor

my entire set of genes (Entrez identifiers) which I need to analyse are

[1] 1002 10233 114798 122481 126792 126820 128344 130827 1428 146845 1493 150483 150572
[14] 159989 183 1852 201625 2167 22824 22885 23676 254956 255101 257177 25992 26576
[27] 266629 283152 283726 285141 29895 3067 340286 340706 3860 387712 389125 389177 4617
[40] 4621 4625 51364 5144 51778 5212 54585 55815 56203 56849 56901 57494 6345
[53] 64102 64446 644890 6588 7042 7060 7138 7273 7322 7337 796 79933 8048
[66] 8091 8125 83450 83657 83894 88 89765 9172 9499

Like I said, some of these work perfectly, others don't.

Any help would be appreciated.

Thank you.

bioconductor genomicranges homo sapiens • 1.5k views

ADD COMMENT • link updated 7.2 years ago by James W. MacDonald 65k • written 7.2 years ago by Korn • 0

score 0 · Answer 1 · 2017-02-24

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

Entrez Gene IDs, like all other identifiers are not static. Sometimes people realize that two IDs are really pointing to the same thing, and one gets discontinued. This happens on a regular basis (monthly, I believe). The org.Hs.eg.db and TxDb.Hsapiens.UCSC.hg19.knownGene packages were built last year at about the same time, and it's possible that some of the changes in the Gene table at NCBI hadn't yet percolated through whatever table UCSC uses to map Gene IDs to their knownGene IDs.

Anyway, you would do well to first filter your gene IDs to the subset of those that are in the TxDb object, and then you can try to figure out what's up. One possibility is that the IDs are out of sync, in which case you could get the gene_history file from NCBI and then query for the IDs that aren't in the TxDb object to see if there are retired IDs that you could look for in the TxDb object.

ADD COMMENT • link 7.2 years ago James W. MacDonald 65k

0

Entering edit mode

Ah that makes sense... thank you very much

I'll try that now.

ADD REPLY • link 7.2 years ago Korn • 0

0

Entering edit mode

I checked all my genes codes - only 1 mismatch though which I have already removed and it still didn't work.

all other codes does exist in that human genome file.

However, it won't work if I use any code with "value" higher than 23459

i.e. any code below that would work e.g. 23459 23458 1223 etc would work

but 23460 and above would not work e.g. 23460 23461 124234 would not work

Could you please tell me if there is any way I could correct this problem?

Thank you

ADD REPLY • link 7.2 years ago Korn • 0

0

Entering edit mode

That's most likely because you are subsetting your list using integer Entrez Gene IDs. But the names of your list aren't integers, they are character. In other words, if you do

txlst[124234]

you are saying 'give me the 124234th list item'. But there aren't 124,234 genes! There are like 22,000 or so. What you want to do is

txlst["124234"]

which will give you the list item that has that Entrez Gene ID as its name.

ADD REPLY • link 7.2 years ago James W. MacDonald 65k

0

Entering edit mode

oh wow!

I'm just stupid then I guess....

ok so I removed a single code that wasn't valid and then use as.character to convert everything in my vector as character and it's now up and running!!

Thank you so much!!

ADD REPLY • link 7.2 years ago Korn • 0