TxDb.Hsapiens.UCSC.hg19.knownGene - Error: subscript contains out-of-bounds indices for some Entrez codes
1
0
Entering edit mode
Korn • 0
@korn-12440
Last seen 7.2 years ago

 

 

Hi all,

I'm having a bit of problem with the TxDb.Hsapiens.UCSC.hg19.knownGene, most of the Entrez identifiers I used are fine when I use the function:  transcriptsBy (TxDb.Hsapiens.UCSC.hg19.knownGene, by = "gene")

For example: (example that works) 

------------------------------------------------------------------------------------------------------------------

MYOT <-'9499'


transcriptCoordsByGene.GRangesList.MYOT <-
  transcriptsBy (TxDb.Hsapiens.UCSC.hg19.knownGene, by = "gene") [MYOT]
transcriptCoordsByGene.GRangesList.MYOT
#GRangesList object of length 1:
#$9499 
#GRanges object with 4 ranges and 2 metadata columns:
#seqnames                 ranges strand |     tx_id     tx_name
#<Rle>              <IRanges>  <Rle> | <integer> <character>
#[1]     chr5 [137022410, 137223540]      + |     21288  uc011cye.2
#[2]     chr5 [137203545, 137223540]      + |     21290  uc003lbv.3
#[3]     chr5 [137203545, 137223540]      + |     21291  uc011cyg.2
#[4]     chr5 [137203545, 137223540]      + |     21292  uc011cyh.2

---------------------------------------------------------------------------------------------------

However, for some other genes such as 201625 which is Entrez code for DNAH12 gene in human (I used library(org.Hs.eg.db) and checked with NCBI) I start to get: 

Error: subscript contains out-of-bounds indices

Could you please tell me how I can solve this problem? 

or is there any other packages I can use to extract these genes' data?

I need the data here so that I can analyse the motif using the rGADEM package

 

I am a medical student and extremely new to R and bioconductor 

my entire set of genes (Entrez identifiers) which I need to analyse are 

 [1]   1002  10233 114798 122481 126792 126820 128344 130827   1428 146845   1493 150483 150572
[14] 159989    183   1852 201625   2167  22824  22885  23676 254956 255101 257177  25992  26576
[27] 266629 283152 283726 285141  29895   3067 340286 340706   3860 387712 389125 389177   4617
[40]   4621   4625  51364   5144  51778   5212  54585  55815  56203  56849  56901  57494   6345
[53]  64102  64446 644890   6588   7042   7060   7138   7273   7322   7337    796  79933   8048
[66]   8091   8125  83450  83657  83894     88  89765   9172   9499

Like I said, some of these work perfectly, others don't.

Any help would be appreciated.

Thank you. 

 

bioconductor genomicranges homo sapiens • 1.5k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 5 hours ago
United States

Entrez Gene IDs, like all other identifiers are not static. Sometimes people realize that two IDs are really pointing to the same thing, and one gets discontinued. This happens on a regular basis (monthly, I believe). The org.Hs.eg.db and TxDb.Hsapiens.UCSC.hg19.knownGene packages were built last year at about the same time, and it's possible that some of the changes in the Gene table at NCBI hadn't yet percolated through whatever table UCSC uses to map Gene IDs to their knownGene IDs.

Anyway, you would do well to first filter your gene IDs to the subset of those that are in the TxDb object, and then you can try to figure out what's up. One possibility is that the IDs are out of sync, in which case you could get the gene_history file from NCBI and then query for the IDs that aren't in the TxDb object to see if there are retired IDs that you could look for in the TxDb object.

ADD COMMENT
0
Entering edit mode

Ah that makes sense... thank you very much

I'll try that now. 

ADD REPLY
0
Entering edit mode

I checked all my genes codes - only 1 mismatch though which I have already removed and it still didn't work.

all other codes does exist in that human genome file.

However, it won't work if I use any code with "value" higher than 23459 

i.e. any code below that would work e.g. 23459 23458 1223 etc would work

but 23460 and above would not work e.g. 23460 23461 124234 would not work

Could you please tell me if there is any way I could correct this problem?

Thank you

ADD REPLY
0
Entering edit mode

That's most likely because you are subsetting your list using integer Entrez Gene IDs. But the names of your list aren't integers, they are character. In other words, if you do

txlst[124234]

you are saying 'give me the 124234th list item'. But there aren't 124,234 genes! There are like 22,000 or so. What you want to do is

txlst["124234"]

which will give you the list item that has that Entrez Gene ID as its name.

ADD REPLY
0
Entering edit mode

oh wow!

I'm just stupid then I guess....

ok so I removed a single code that wasn't valid and then use as.character to convert everything in my vector as character and it's now up and running!!

 

Thank you so much!! 

 

ADD REPLY

Login before adding your answer.

Traffic: 842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6