Question: Annotating codelink data file
0
gravatar for Agaz Hussain Wani
4.6 years ago by
India
Agaz Hussain Wani260 wrote:

I tried to annotate a codelink file using h20kcod.db  with the following script.

library(codelink)
library(h20kcod.db)
library(GEOquery)

f = list.files(pattern = "TXT")
codset = readCodelinkSet(filename = f[[1]])
probes.sel <- as.character(fData(codset)$probeName[1:100])

select(h20kcod.db, keys = probes.sel, columns = c("SYMBOL", "GENENAME"), )

The codelink file looks as shown below.

     probeName          probeType logicalRow  logicalCol  meanSNR
1  NM_012429.1_PROBE1 DISCOVERY          1          9  0.7156364
2  NM_003980.2_PROBE1 DISCOVERY          1         10  1.7474178
3     AY044449_PROBE1 DISCOVERY          1         11  1.4991542
4  NM_005015.1_PROBE1 DISCOVERY          1         12  7.0496070
5     AB037823_PROBE1 DISCOVERY          1         13 10.1904870
6  NM_032986.1_PROBE1 DISCOVERY          1         14  1.0015290
7     AB032981_PROBE1 DISCOVERY          1         16  1.4521567
8  NM_001907.1_PROBE1 DISCOVERY          1         17  1.7437525
9     AB037745_PROBE1 DISCOVERY          1         18  0.9601950
10 NM_007039.1_PROBE1 DISCOVERY          1         19  0.9245058

I get a message :

Error in .testForValidKeys(x, keys, keytype) :
  None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.

May i know what is going wrong here. Thanks

ADD COMMENTlink modified 4.6 years ago by Diego Diez750 • written 4.6 years ago by Agaz Hussain Wani260
1

Perhaps you can help us. You have an error message that is intended to explain the problem and point to a solution. Can you explain what you don't understand about the error message so we can improve it?

ADD REPLYlink written 4.6 years ago by James W. MacDonald52k

I understood , none of the keys (for example i use probNames as keys to search in h20kcod.db to get annotation ) are valid . So i think i need valid keys to be matched in annotation file to get each probe name annotated. probe-ID's are valid which unfortunately  I do not have in above data frame. So don't know what should i use as valid key.

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260

OK, so what are those probeNames in that table? Do you perhaps think you could use those somehow?

ADD REPLYlink written 4.6 years ago by James W. MacDonald52k

I think if i am able to get the probe ID's of those probeNames, then it should work. I am not sure about this but i think it should be done. Am i right??

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260
1

Well hypothetically, yes. I don't know anything about the data you have in hand, so I can't say if you can get the probe IDs or not. But you are missing my point, which is to say that you might already have everything you need, so long as you are willing to think about what you are doing. So here is the first probeName from your file:

NM_012429.1_PROBE1

That is a concatenation of two things: NM_012429.1 and _PROBE1. Do you know what that first thing is? If not, why don't you try googling it?

ADD REPLYlink written 4.6 years ago by James W. MacDonald52k

If i am not wrong, then  its REFSEQ key in h20kcod.db. But i have not seen .1 or something in that, may be i have missed something.
 

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260

Exactly! Now we are getting somewhere. So if you have the RefSeq IDs for each row, then how would you get the HUGO symbol and gene name?

As a hint, the RefSeq ID is something specific to the organism (H. sapiens, in this case), not the array used. So it is not likely to be a useful key for the h20kcod.db package.

ADD REPLYlink written 4.6 years ago by James W. MacDonald52k

On the basis of RefSeqIDs i can get the ProbeID's, Gene symbol, Gene Name, indeed anything else from h20kcod.db

But as you mentioned ,its not a reliable key to trust on, what about

Annotation_NCBI_Acc

 or 

Annotation_UniGene

 both of which are present in my data file.

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260
1

I don't recall saying RefSeq IDs were not reliable, but whatever you like. It's up to you.

ADD REPLYlink written 4.6 years ago by James W. MacDonald52k

Nice way of solving my problem :) . Thanks a lot.
 

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260

Thanks James for your contribution and suggestion. Indeed, this could be a possible approach. I suggested another one in my response, plus I stole and added your suggestion.

ADD REPLYlink written 4.6 years ago by Diego Diez750
Answer: Annotating codelink data file
0
gravatar for Diego Diez
4.6 years ago by
Diego Diez750
Japan
Diego Diez750 wrote:

What is going on wrong here is basically what I explained to you in this comment: C: How to prepare targets.txt file in codelink.

The dataset you are using contains legacy probe names (i.e. NM_012429.1_PROBE1) instead of the currently used Codelink ids (e.g. GE53815). Therefore, none of the keys obtained by probeNames() works with the h20kcod.db annotation package. To solve this problem in the future, I may have to consider remapping the names when reading the files. But this will not be immediately implemented. For now you have two options:

1. Do as James suggests and extract the accession number from the legacy probe name. For example:

> library(org.Hs.eg.db)
> select(org.Hs.eg.db, key = c("NM_005015", "AY044449"), keytype = "ACCNUM", columns = c("ENTREZID", "SYMBOL"))
     ACCNUM ENTREZID SYMBOL
1 NM_005015     5018  OXA1L
2  AY044449    57538  ALPK3

2. The other option is do as C: How to prepare targets.txt file in codelink. and use the GEOquery package instead:

> library(GEOquery)
> library(h20kcod.db)
> eset <- getGEO("GSE4797")[[1]]
> probes.sel <- as.character(fData(eset)$PROBE_NAME[1:10])
> select(h20kcod.db, keys = probes.sel, columns = c("ENTREZID", "SYMBOL"))
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Diego Diez750

Thanks Diego for your valuable suggestion. I already did some exercise with your suggested GEOqyery() and it works fine.

Then i gave a try for the data which is already downloaded and present in my working directory. I did something like eset <- getGEO(filename = "GSM108294.TXT") to access my file. In reply i got the  error

Error in read.table(con, sep = "\t", header = FALSE, nrows = nseries) :
  invalid 'nlines' argument

If there is any provision to use the downloaded data with getGEO()?? then the above should work.

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260

I do not know. Check the GEOquery help/vignette.

ADD REPLYlink written 4.6 years ago by Diego Diez750

Note that you can usually just use ACCNUM for all GeneBank/RefSeq IDs:

> ids
 [1] "NM_012429" "NM_003980" "AY044449"  "NM_005015" "AB037823"  "NM_032986"
 [7] "NM_032986" "NM_001907" "AB037745"  "NM_007039"
> select(org.Hs.eg.db, ids, c("SYMBOL","GENENAME"), "ACCNUM")
      ACCNUM   SYMBOL                                           GENENAME
1  NM_012429  SEC14L2                       SEC14-like 2 (S. cerevisiae)
2  NM_003980     MAP7                   microtubule-associated protein 7
3   AY044449    ALPK3                                     alpha-kinase 3
4  NM_005015    OXA1L             oxidase (cytochrome c) assembly 1-like
5   AB037823    CHPF2                  chondroitin polymerizing factor 2
6  NM_032986   SEC23B                    Sec23 homolog B (S. cerevisiae)
7  NM_032986   SEC23B                    Sec23 homolog B (S. cerevisiae)
8  NM_001907     CTRL                                  chymotrypsin-like
9   AB037745 KIAA1324                                           KIAA1324
10 NM_007039   PTPN21 protein tyrosine phosphatase, non-receptor type 21

 

ADD REPLYlink written 4.6 years ago by James W. MacDonald52k

Thanks James for your help. Are there  cases of no gene name for some ACCNUM, as i got NA for PCTRL1, PCTRL2 and for some more.

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260

That is very cool indeed, and a feature I did not know existed. There are some other identifiers of unclear origin that do not seem to match, like 1137103.2_PROBE1 below. I will update my answer to reflect this finding. At any rate, it is up to the OP to choose the option that best suits him.

# using the OP's dataset and reading it with readCodelinkSet() function in codelink package.
> tail(probeNames(ecod))
tail(probeNames(ecod),20)
 [1] "125883.1_PROBE1"    "NM_006756.1_PROBE1" "NM_002794.1_PROBE1"
 [4] "NM_004749.1_PROBE1" "NM_000985.2_PROBE1" "NM_001923.2_PROBE1"
 [7] "1024918.1_PROBE1"   "NM_003406.1_PROBE1" "NM_013417.1_PROBE1"
[10] "NM_002650.1_PROBE1" "NM_003564.1_PROBE1" "NM_004517.1_PROBE1"
[13] "114225.1_PROBE1"    "NM_004037.2_PROBE1" "NM_001605.1_PROBE1"
[16] "AF134825_PROBE1"    "NM_002802.1_PROBE1" "NM_006400.2_PROBE1"
[19] "1137103.2_PROBE1"   "NM_001087.1_PROBE1"
ADD REPLYlink written 4.6 years ago by Diego Diez750

Thanks Diego for explaining. Is this the case with probeNames() only or what ever the key is taken into account.

For example, if  i use ProbeID or any other key( for which probeName is not of known origin), are there chances of a different result.??

Can such cases be avoided??

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Agaz Hussain Wani260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 154 users visited in the last hour