Question

Annotating codelink data file

0

Entering edit mode

Agaz Hussain Wani ▴ 260

@agaz-hussain-wani-7620

Last seen 6.0 years ago

India

I tried to annotate a codelink file using h20kcod.db with the following script.

library(codelink)
library(h20kcod.db)
library(GEOquery)

f = list.files(pattern = "TXT")
codset = readCodelinkSet(filename = f[[1]])
probes.sel <- as.character(fData(codset)$probeName[1:100])

select(h20kcod.db, keys = probes.sel, columns = c("SYMBOL", "GENENAME"), )

The codelink file looks as shown below.

     probeName          probeType logicalRow  logicalCol  meanSNR
1  NM_012429.1_PROBE1 DISCOVERY          1          9  0.7156364
2  NM_003980.2_PROBE1 DISCOVERY          1         10  1.7474178
3     AY044449_PROBE1 DISCOVERY          1         11  1.4991542
4  NM_005015.1_PROBE1 DISCOVERY          1         12  7.0496070
5     AB037823_PROBE1 DISCOVERY          1         13 10.1904870
6  NM_032986.1_PROBE1 DISCOVERY          1         14  1.0015290
7     AB032981_PROBE1 DISCOVERY          1         16  1.4521567
8  NM_001907.1_PROBE1 DISCOVERY          1         17  1.7437525
9     AB037745_PROBE1 DISCOVERY          1         18  0.9601950
10 NM_007039.1_PROBE1 DISCOVERY          1         19  0.9245058

I get a message :

Error in .testForValidKeys(x, keys, keytype) :
  None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.

May i know what is going wrong here. Thanks

r codelink annotation annotationdbi • 2.8k views

ADD COMMENT • link updated 9.0 years ago by Diego Diez ▴ 760 • written 9.0 years ago by Agaz Hussain Wani ▴ 260

1

Entering edit mode

Perhaps you can help us. You have an error message that is intended to explain the problem and point to a solution. Can you explain what you don't understand about the error message so we can improve it?

ADD REPLY • link 9.0 years ago James W. MacDonald 65k

0

Entering edit mode

I understood , none of the keys (for example i use probNames as keys to search in h20kcod.db to get annotation ) are valid . So i think i need valid keys to be matched in annotation file to get each probe name annotated. probe-ID's are valid which unfortunately I do not have in above data frame. So don't know what should i use as valid key.

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

OK, so what are those probeNames in that table? Do you perhaps think you could use those somehow?

ADD REPLY • link 9.0 years ago James W. MacDonald 65k

0

Entering edit mode

I think if i am able to get the probe ID's of those probeNames, then it should work. I am not sure about this but i think it should be done. Am i right??

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

1

Entering edit mode

Well hypothetically, yes. I don't know anything about the data you have in hand, so I can't say if you can get the probe IDs or not. But you are missing my point, which is to say that you might already have everything you need, so long as you are willing to think about what you are doing. So here is the first probeName from your file:

NM_012429.1_PROBE1

That is a concatenation of two things: NM_012429.1 and _PROBE1. Do you know what that first thing is? If not, why don't you try googling it?

ADD REPLY • link 9.0 years ago James W. MacDonald 65k

0

Entering edit mode

If i am not wrong, then its REFSEQ key in h20kcod.db. But i have not seen .1 or something in that, may be i have missed something.

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

Exactly! Now we are getting somewhere. So if you have the RefSeq IDs for each row, then how would you get the HUGO symbol and gene name?

As a hint, the RefSeq ID is something specific to the organism (H. sapiens, in this case), not the array used. So it is not likely to be a useful key for the h20kcod.db package.

ADD REPLY • link 9.0 years ago James W. MacDonald 65k

0

Entering edit mode

On the basis of RefSeqIDs i can get the ProbeID's, Gene symbol, Gene Name, indeed anything else from h20kcod.db

But as you mentioned ,its not a reliable key to trust on, what about

Annotation_NCBI_Acc

or

Annotation_UniGene

both of which are present in my data file.

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

1

Entering edit mode

I don't recall saying RefSeq IDs were not reliable, but whatever you like. It's up to you.

ADD REPLY • link 9.0 years ago James W. MacDonald 65k

0

Entering edit mode

Nice way of solving my problem :) . Thanks a lot.

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

Thanks James for your contribution and suggestion. Indeed, this could be a possible approach. I suggested another one in my response, plus I stole and added your suggestion.

ADD REPLY • link 9.0 years ago Diego Diez ▴ 760

score 0 · Answer 1 · 2015-04-30

0

Entering edit mode

Diego Diez ▴ 760

@diego-diez-4520

Last seen 3.5 years ago

Japan

What is going on wrong here is basically what I explained to you in this comment: C: How to prepare targets.txt file in codelink.

The dataset you are using contains legacy probe names (i.e. NM_012429.1_PROBE1) instead of the currently used Codelink ids (e.g. GE53815). Therefore, none of the keys obtained by probeNames() works with the h20kcod.db annotation package. To solve this problem in the future, I may have to consider remapping the names when reading the files. But this will not be immediately implemented. For now you have two options:

1. Do as James suggests and extract the accession number from the legacy probe name. For example:

> library(org.Hs.eg.db)
> select(org.Hs.eg.db, key = c("NM_005015", "AY044449"), keytype = "ACCNUM", columns = c("ENTREZID", "SYMBOL"))
     ACCNUM ENTREZID SYMBOL
1 NM_005015     5018  OXA1L
2  AY044449    57538  ALPK3

2. The other option is do as C: How to prepare targets.txt file in codelink. and use the GEOquery package instead:

> library(GEOquery)
> library(h20kcod.db)
> eset <- getGEO("GSE4797")[[1]]
> probes.sel <- as.character(fData(eset)$PROBE_NAME[1:10])
> select(h20kcod.db, keys = probes.sel, columns = c("ENTREZID", "SYMBOL"))

ADD COMMENT • link 9.0 years ago Diego Diez ▴ 760

0

Entering edit mode

Thanks Diego for your valuable suggestion. I already did some exercise with your suggested GEOqyery() and it works fine.

Then i gave a try for the data which is already downloaded and present in my working directory. I did something like eset <- getGEO(filename = "GSM108294.TXT") to access my file. In reply i got the error

Error in read.table(con, sep = "\t", header = FALSE, nrows = nseries) :
  invalid 'nlines' argument

If there is any provision to use the downloaded data with getGEO()?? then the above should work.

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

I do not know. Check the GEOquery help/vignette.

ADD REPLY • link 9.0 years ago Diego Diez ▴ 760

0

Entering edit mode

Note that you can usually just use ACCNUM for all GeneBank/RefSeq IDs:

> ids
 [1] "NM_012429" "NM_003980" "AY044449"  "NM_005015" "AB037823"  "NM_032986"
 [7] "NM_032986" "NM_001907" "AB037745"  "NM_007039"
> select(org.Hs.eg.db, ids, c("SYMBOL","GENENAME"), "ACCNUM")
      ACCNUM   SYMBOL                                           GENENAME
1  NM_012429  SEC14L2                       SEC14-like 2 (S. cerevisiae)
2  NM_003980     MAP7                   microtubule-associated protein 7
3   AY044449    ALPK3                                     alpha-kinase 3
4  NM_005015    OXA1L             oxidase (cytochrome c) assembly 1-like
5   AB037823    CHPF2                  chondroitin polymerizing factor 2
6  NM_032986   SEC23B                    Sec23 homolog B (S. cerevisiae)
7  NM_032986   SEC23B                    Sec23 homolog B (S. cerevisiae)
8  NM_001907     CTRL                                  chymotrypsin-like
9   AB037745 KIAA1324                                           KIAA1324
10 NM_007039   PTPN21 protein tyrosine phosphatase, non-receptor type 21

ADD REPLY • link 9.0 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks James for your help. Are there cases of no gene name for some ACCNUM, as i got NA for PCTRL1, PCTRL2 and for some more.

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

That is very cool indeed, and a feature I did not know existed. There are some other identifiers of unclear origin that do not seem to match, like 1137103.2_PROBE1 below. I will update my answer to reflect this finding. At any rate, it is up to the OP to choose the option that best suits him.

# using the OP's dataset and reading it with readCodelinkSet() function in codelink package.
> tail(probeNames(ecod))
tail(probeNames(ecod),20)
 [1] "125883.1_PROBE1"    "NM_006756.1_PROBE1" "NM_002794.1_PROBE1"
 [4] "NM_004749.1_PROBE1" "NM_000985.2_PROBE1" "NM_001923.2_PROBE1"
 [7] "1024918.1_PROBE1"   "NM_003406.1_PROBE1" "NM_013417.1_PROBE1"
[10] "NM_002650.1_PROBE1" "NM_003564.1_PROBE1" "NM_004517.1_PROBE1"
[13] "114225.1_PROBE1"    "NM_004037.2_PROBE1" "NM_001605.1_PROBE1"
[16] "AF134825_PROBE1"    "NM_002802.1_PROBE1" "NM_006400.2_PROBE1"
[19] "1137103.2_PROBE1"   "NM_001087.1_PROBE1"

ADD REPLY • link 9.0 years ago Diego Diez ▴ 760

0

Entering edit mode

Thanks Diego for explaining. Is this the case with probeNames() only or what ever the key is taken into account.

For example, if i use ProbeID or any other key( for which probeName is not of known origin), are there chances of a different result.??

Can such cases be avoided??

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260