Question: How to get feature ID's from the raw codelink data.
0
4.6 years ago by
India
Agaz Hussain Wani260 wrote:

My codelink data looks like below

 CodeLink Expression Analysis 4.1.0.29054 UKE Report for Slide (T00285152) LAYOUT EXP10336X2-770.22.ID PROJECT EXPERIMENT Report( 1 ): Testis -------------------------------------------------------------------------------- Idx Array Sample_name Probe_name Annotation_PIN Annotation_NCBI_Acc Annotation_NCBI_NID Annotation_LocusLink Annotation_OGS Annotation_UniGene Annotation_ENSEMBL Probe_type Feature_id Raw_intensity Normalized_intensity Quality_flag Signal_strength Logical_row Logical_col 1 1 Sample001 NM_012429.1_PROBE1 NM_012429 23541 Hs.277728 DISCOVERY 109 149.3979 0.3639 G 1.051 1 9 2 1 Sample001 NM_003980.2_PROBE1 NM_003980 9053 Hs.146388 DISCOVERY 110 253.3235 0.617 G 1.5776 1 10 3 1 Sample001 AY044449_PROBE1 AY044449 NULL Hs.301242 DISCOVERY 111 376.9439 0.9181 G 1.8513 1 11 4 1 Sample001 NM_005015.1_PROBE1 NM_005015 5018 Hs.151134 DISCOVERY 112 2112.2793 5.1446 G 7.8443 1 12 5 1 Sample001 AB037823_PROBE1 AB037823 54480 Hs.86392 DISCOVERY 113 2860.5891 6.9672 G 11.6142 1 13 6 1 Sample001 NM_032986.1_PROBE1 NM_032986 10483 Hs.173497 DISCOVERY 114 272.215 0.663 G 1.5529 1 14 7 1 Sample001 AB032981_PROBE1 AB032981 57218 Hs.102657 DISCOVERY 116 225.6981 0.5497 G 1.4131 1 16 8 1 Sample001 NM_001907.1_PROBE1 NM_001907 1506 Hs.150601 DISCOVERY 117 315.7597 0.7691 G 1.6992 1 17 9 1 Sample001 AB037745_PROBE1 AB037745 57535 Hs.104696 DISCOVERY 118 56.3103 0.1371 L 0.7039 1 18 10 1 Sample001 NM_007039.1_PROBE1 NM_007039 11099 Hs.155693 DISCOVERY 119 164.9445 0.4017 G 1.0947 1 19

This  part of codelink raw data  I want to process say by codelink package.

To get the expression value i use exprs() and to get intensity value getSNR(). If i am interested to get the feature ids , which can be seen in the column with header Feature_id (109,110,111 and so on)in the above shown sample data. How to get Feature_id column with the help of codelink package. Thanks in advance.

modified 4.6 years ago • written 4.6 years ago by Agaz Hussain Wani260
0
4.6 years ago by
Diego Diez750
Japan
Diego Diez750 wrote:

If your raw data contains the Feature_id column, readCodelinkSet() should be able to detect it and the information should be available using featureNames() method. One limitation is that all the files need to contain the column. If this is not the case, then codelink will treat the dataset as if Feature_id is missing.

If i am not wrong , featureNames() give the first column with header Idx , not the column with header Feature_id

Well, I think you are wrong. Have you tried or are you just guessing? Once you try my advice and if you get anything different then you post the results. Otherwise I think I have already given you enough information and I am still unable to really understand what are your motivations.

0
4.6 years ago by
India
Agaz Hussain Wani260 wrote:

May be i am wrong . Let me display my results also.

f = list.files(pattern = "TXT")

f

[1] "GSM108293.TXT"

** product: Unknown
** chip size: 20469
** reading file 1 of 1 : GSM108293.TXT
+ detected . as decimal symbol.
+ sample Name: GSM108293.TXT
** applying flags to MSR spots ...OK
** computing weights ...OK
** computing SNR ...OK

1
1  237.4510
2  566.4462
3  473.6080
4 1976.1409
5 3079.4490
6  404.0833

1
1  0.7156364
2  1.7474178
3  1.4991542
4  7.0496070
5 10.1904870
6  1.0015290

[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"

I think nothing else is left to show.

I see. Everything I said still applies. In that GEO dataset, not all the files contain the Feature_id column. Check for example: GSM108109.TXT.gz. This is basically what I meant with the inconsistency. Now, if one of the files fails to contain the Feature_id column, then codelink will not use it, because we cannot assume the ordering to be the same. Indeed, I found some datasets in the past with different ordering. There is nothing you can really do. You cannot have Feature_id unless you make strong assumptions and certainly codelink package does not allow it. I will update my answer to reflect this.

You are absolutely right regarding the column Feature_id missing in some files. I am providing only one data file to process by your codelink package. How it affects a single file having Feature_id present.

If the user want to give many files in which Feature_id column is missing in some files, then codelink should apply the rule of not using the Feature_id column, otherwise the case should be different.

OK, now I get what you mean. After checking the code I remembered the main reason to use Feature_id internally is to ensure proper ordering of the rows. But then the information is not used for the row names in the expression matrix. This is intentional since otherwise different datasets from the same chip would have different row names, which would not be good. However, the Feature_id information was meant to be stored somewhere. In the old Codelink-class it is present in the $id slot. It seems in the CodelinkSet-class I did not include it. So in that regard you were right, thank you. I will send a bug fix for this when I find the time and update the answer then. Out of curiosity, I still do not understand why do you need that info anyway- it is never used in the analysis of data. ADD REPLYlink written 4.6 years ago by Diego Diez750 I think that column can be achieved by using CodelinkSet class. i used fea <- CodelinkSet(f) ## f is a data file ids <- fea$id  ## this will give the feature ids if present otherwise NA.

As i told you before, it is used for annotation. Off-course we can annotate the date by GEOquery() in an efficient way  as explained by you earlier. But if the user have a data taken from GEO , one way of annotating the data is using these feature ids to match in the GPl file to get the annotation(Gene name and Gene symbol and so on) .  I am not sure if there are  other ways also to do annotation of data present on GEO and if the GEO data is processed by your codelink package.

if the feature ids are not use full at all, then why every data on GEO has feature id.?? they could have easily neglected it and used probe name or probe id instead.

Anyway, If i process the raw data, my interest is to get a data frame containing gene name or symbol, signal intensity, SNR and call for each individual GSM file like below

Gene Symbol     Intensity      SNR         Call

KIAA0484             1.20000      0.0231      A

KIAA0495           21.0032        0.435       A
KIAA0506             2.3433        1.232       P
KIAA0515           11.023843    2.12         P

KIAA0556             10.21          3.21         P
KIAA0586               9.21           2.00        P

After the preparation of data like this for all the GSM files ,i go for meta-analysis .

This is your code as you posted for my earlier question

eset <- getGEO("GSE4797")
eset <- eset[[1]]
# do preprocessing, e.g. with limma matrix methods. For example:
y <- backgroundCorrect.matrix(eset, offset = 50, method = "normexp")
y <- normalizeBetweenArrays(y, method = "quantile")
exprs(eset) <- y

I can get the expression from the above code .Then what about SNR here?? call, i can decide on the basis of SNR value.   Annotation can be done  by using GEO query as already well explained. Thanks

Hussain,

• You cannot do CodelinkSet(f), there is no function in the codelink package called CodelinkSet().
• If fea is a CodelinkSet object, you cannot do fea\$id, that only works in a Codelink object (as I've explained to you already).
• Maybe you can get annotations from GEO using the Feature_id info. I do not know. Are these annotations up-to-date? Bioconductor annotations are updated every six months and you can get them with the Probe_name information, even if you obtain the data using GEOquery (as I have explained to you already).
• I will update the answer to this post once I send a fix.

Good luck.