Question: How to get feature ID's from the raw codelink data.
0
gravatar for Agaz Hussain Wani
4.6 years ago by
India
Agaz Hussain Wani260 wrote:

My codelink data looks like below

CodeLink Expression Analysis 4.1.0.29054                                    
UKE Report for Slide (T00285152)                                    
LAYOUT EXP10336X2-770.22.ID                                  
PROJECT                                    
EXPERIMENT                                    
Report( 1 ): Testis                                    
--------------------------------------------------------------------------------                                    
Idx Array Sample_name Probe_name Annotation_PIN Annotation_NCBI_Acc Annotation_NCBI_NID Annotation_LocusLink Annotation_OGS Annotation_UniGene Annotation_ENSEMBL Probe_type  Feature_id  Raw_intensity Normalized_intensity Quality_flag Signal_strength Logical_row Logical_col
1 1 Sample001 NM_012429.1_PROBE1   NM_012429   23541   Hs.277728   DISCOVERY  109 149.3979 0.3639 G 1.051 1 9
2 1 Sample001 NM_003980.2_PROBE1   NM_003980   9053   Hs.146388   DISCOVERY  110 253.3235 0.617 G 1.5776 1 10
3 1 Sample001 AY044449_PROBE1   AY044449   NULL   Hs.301242   DISCOVERY  111 376.9439 0.9181 G 1.8513 1 11
4 1 Sample001 NM_005015.1_PROBE1   NM_005015   5018   Hs.151134   DISCOVERY  112 2112.2793 5.1446 G 7.8443 1 12
5 1 Sample001 AB037823_PROBE1   AB037823   54480   Hs.86392   DISCOVERY  113 2860.5891 6.9672 G 11.6142 1 13
6 1 Sample001 NM_032986.1_PROBE1   NM_032986   10483   Hs.173497   DISCOVERY  114 272.215 0.663 G 1.5529 1 14
7 1 Sample001 AB032981_PROBE1   AB032981   57218   Hs.102657   DISCOVERY  116 225.6981 0.5497 G 1.4131 1 16
8 1 Sample001 NM_001907.1_PROBE1   NM_001907   1506   Hs.150601   DISCOVERY  117 315.7597 0.7691 G 1.6992 1 17
9 1 Sample001 AB037745_PROBE1   AB037745   57535   Hs.104696   DISCOVERY  118 56.3103 0.1371 L 0.7039 1 18
10 1 Sample001 NM_007039.1_PROBE1   NM_007039   11099   Hs.155693   DISCOVERY  119 164.9445 0.4017 G 1.0947 1 19

 This  part of codelink raw data  I want to process say by codelink package.

To get the expression value i use exprs() and to get intensity value getSNR(). If i am interested to get the feature ids , which can be seen in the column with header Feature_id (109,110,111 and so on)in the above shown sample data. How to get Feature_id column with the help of codelink package. Thanks in advance.

 

codelink R • 727 views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Agaz Hussain Wani260
Answer: How to get feature ID's from the raw codelink data.
0
gravatar for Diego Diez
4.6 years ago by
Diego Diez750
Japan
Diego Diez750 wrote:

If your raw data contains the Feature_id column, readCodelinkSet() should be able to detect it and the information should be available using featureNames() method. One limitation is that all the files need to contain the column. If this is not the case, then codelink will treat the dataset as if Feature_id is missing.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Diego Diez750

If i am not wrong , featureNames() give the first column with header Idx , not the column with header Feature_id

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260

Well, I think you are wrong. Have you tried or are you just guessing? Once you try my advice and if you get anything different then you post the results. Otherwise I think I have already given you enough information and I am still unable to really understand what are your motivations.

ADD REPLYlink written 4.6 years ago by Diego Diez750
Answer: How to get feature ID's from the raw codelink data.
0
gravatar for Agaz Hussain Wani
4.6 years ago by
India
Agaz Hussain Wani260 wrote:

May be i am wrong . Let me display my results also.

f = list.files(pattern = "TXT")

f

[1] "GSM108293.TXT"

codset = readCodelinkSet(filename = f[[i]])

** product: Unknown
** chip size: 20469
** reading file 1 of 1 : GSM108293.TXT
  + detected . as decimal symbol.
  + sample Name: GSM108293.TXT
** applying flags to MSR spots ...OK
** computing weights ...OK
** computing SNR ...OK

> head(exprs(codset))
          1
1  237.4510
2  566.4462
3  473.6080
4 1976.1409
5 3079.4490
6  404.0833

 

> head(getSNR(codset))
           1
1  0.7156364
2  1.7474178
3  1.4991542
4  7.0496070
5 10.1904870
6  1.0015290

> head(featureNames(codset),20)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"

I think nothing else is left to show.

 

ADD COMMENTlink written 4.6 years ago by Agaz Hussain Wani260

I see. Everything I said still applies. In that GEO dataset, not all the files contain the Feature_id column. Check for example: GSM108109.TXT.gz. This is basically what I meant with the inconsistency. Now, if one of the files fails to contain the Feature_id column, then codelink will not use it, because we cannot assume the ordering to be the same. Indeed, I found some datasets in the past with different ordering. There is nothing you can really do. You cannot have Feature_id unless you make strong assumptions and certainly codelink package does not allow it. I will update my answer to reflect this.

ADD REPLYlink written 4.6 years ago by Diego Diez750

You are absolutely right regarding the column Feature_id missing in some files. I am providing only one data file to process by your codelink package. How it affects a single file having Feature_id present.

If the user want to give many files in which Feature_id column is missing in some files, then codelink should apply the rule of not using the Feature_id column, otherwise the case should be different.

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260

OK, now I get what you mean. After checking the code I remembered the main reason to use Feature_id internally is to ensure proper ordering of the rows. But then the information is not used for the row names in the expression matrix. This is intentional since otherwise different datasets from the same chip would have different row names, which would not be good. However, the Feature_id information was meant to be stored somewhere. In the old Codelink-class it is present in the $id slot. It seems in the CodelinkSet-class I did not include it. So in that regard you were right, thank you. I will send a bug fix for this when I find the time and update the answer then. Out of curiosity, I still do not understand why do you need that info anyway- it is never used in the analysis of data.

ADD REPLYlink written 4.6 years ago by Diego Diez750

I think that column can be achieved by using CodelinkSet class.  

i used

fea <- CodelinkSet(f) ## f is a data file

ids <- fea$id  ## this will give the feature ids if present otherwise NA.

As i told you before, it is used for annotation. Off-course we can annotate the date by GEOquery() in an efficient way  as explained by you earlier. But if the user have a data taken from GEO , one way of annotating the data is using these feature ids to match in the GPl file to get the annotation(Gene name and Gene symbol and so on) .  I am not sure if there are  other ways also to do annotation of data present on GEO and if the GEO data is processed by your codelink package.

if the feature ids are not use full at all, then why every data on GEO has feature id.?? they could have easily neglected it and used probe name or probe id instead.

Anyway, If i process the raw data, my interest is to get a data frame containing gene name or symbol, signal intensity, SNR and call for each individual GSM file like below

 

Gene Symbol     Intensity      SNR         Call

KIAA0484             1.20000      0.0231      A

KIAA0495           21.0032        0.435       A
KIAA0506             2.3433        1.232       P
KIAA0515           11.023843    2.12         P

KIAA0556             10.21          3.21         P
KIAA0586               9.21           2.00        P

After the preparation of data like this for all the GSM files ,i go for meta-analysis .

This is your code as you posted for my earlier question

eset <- getGEO("GSE4797")
eset <- eset[[1]]
head(fData(eset))
# do preprocessing, e.g. with limma matrix methods. For example:
y <- backgroundCorrect.matrix(eset, offset = 50, method = "normexp")
y <- normalizeBetweenArrays(y, method = "quantile")
exprs(eset) <- y

I can get the expression from the above code .Then what about SNR here?? call, i can decide on the basis of SNR value.   Annotation can be done  by using GEO query as already well explained. Thanks

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Agaz Hussain Wani260

Hussain,

  • You cannot do CodelinkSet(f), there is no function in the codelink package called CodelinkSet().
  • If fea is a CodelinkSet object, you cannot do fea$id, that only works in a Codelink object (as I've explained to you already).
  • Maybe you can get annotations from GEO using the Feature_id info. I do not know. Are these annotations up-to-date? Bioconductor annotations are updated every six months and you can get them with the Probe_name information, even if you obtain the data using GEOquery (as I have explained to you already).
  • To know more about how to use the package please read the vignette.
  • I will update the answer to this post once I send a fix.

Good luck. 

ADD REPLYlink written 4.6 years ago by Diego Diez750

Sorry for typo mistake, its actually  readCodelink(). It serves the purpose of getting the Feature_id if present in the file unlike readCodelinkSet(), which was supposed to do this job. Off-course annotations are updated, but may be not after regular intervals. 

ADD REPLYlink written 4.6 years ago by Agaz Hussain Wani260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 322 users visited in the last hour