Question

How to get feature ID's from the raw codelink data.

0

Entering edit mode

Agaz Hussain Wani ▴ 260

@agaz-hussain-wani-7620

Last seen 6.1 years ago

India

My codelink data looks like below

CodeLink Expression Analysis 4.1.0.29054
UKE Report for Slide (T00285152)
LAYOUT	EXP10336X2-770.22.ID
PROJECT
EXPERIMENT
Report( 1 ): Testis
--------------------------------------------------------------------------------
Idx	Array	Sample_name	Probe_name	Annotation_PIN	Annotation_NCBI_Acc	Annotation_NCBI_NID	Annotation_LocusLink	Annotation_OGS	Annotation_UniGene	Annotation_ENSEMBL	Probe_type	Feature_id	Raw_intensity	Normalized_intensity	Quality_flag	Signal_strength	Logical_row	Logical_col
1	1	Sample001	NM_012429.1_PROBE1		NM_012429		23541		Hs.277728		DISCOVERY	109	149.3979	0.3639	G	1.051	1	9
2	1	Sample001	NM_003980.2_PROBE1		NM_003980		9053		Hs.146388		DISCOVERY	110	253.3235	0.617	G	1.5776	1	10
3	1	Sample001	AY044449_PROBE1		AY044449		NULL		Hs.301242		DISCOVERY	111	376.9439	0.9181	G	1.8513	1	11
4	1	Sample001	NM_005015.1_PROBE1		NM_005015		5018		Hs.151134		DISCOVERY	112	2112.2793	5.1446	G	7.8443	1	12
5	1	Sample001	AB037823_PROBE1		AB037823		54480		Hs.86392		DISCOVERY	113	2860.5891	6.9672	G	11.6142	1	13
6	1	Sample001	NM_032986.1_PROBE1		NM_032986		10483		Hs.173497		DISCOVERY	114	272.215	0.663	G	1.5529	1	14
7	1	Sample001	AB032981_PROBE1		AB032981		57218		Hs.102657		DISCOVERY	116	225.6981	0.5497	G	1.4131	1	16
8	1	Sample001	NM_001907.1_PROBE1		NM_001907		1506		Hs.150601		DISCOVERY	117	315.7597	0.7691	G	1.6992	1	17
9	1	Sample001	AB037745_PROBE1		AB037745		57535		Hs.104696		DISCOVERY	118	56.3103	0.1371	L	0.7039	1	18
10	1	Sample001	NM_007039.1_PROBE1		NM_007039		11099		Hs.155693		DISCOVERY	119	164.9445	0.4017	G	1.0947	1	19

This part of codelink raw data I want to process say by codelink package.

To get the expression value i use exprs() and to get intensity value getSNR(). If i am interested to get the feature ids , which can be seen in the column with header Feature_id (109,110,111 and so on)in the above shown sample data. How to get Feature_id column with the help of codelink package. Thanks in advance.

r codelink • 1.7k views

ADD COMMENT • link 9.0 years ago Agaz Hussain Wani ▴ 260

score 0 · Answer 1 · 2015-04-28

0

Entering edit mode

Diego Diez ▴ 760

@diego-diez-4520

Last seen 3.5 years ago

Japan

If your raw data contains the Feature_id column, readCodelinkSet() should be able to detect it and the information should be available using featureNames() method. One limitation is that all the files need to contain the column. If this is not the case, then codelink will treat the dataset as if Feature_id is missing.

ADD COMMENT • link 9.0 years ago Diego Diez ▴ 760

0

Entering edit mode

If i am not wrong , featureNames() give the first column with header Idx , not the column with header Feature_id

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

Well, I think you are wrong. Have you tried or are you just guessing? Once you try my advice and if you get anything different then you post the results. Otherwise I think I have already given you enough information and I am still unable to really understand what are your motivations.

ADD REPLY • link 9.0 years ago Diego Diez ▴ 760

score 0 · Answer 2 · 2015-04-28

0

Entering edit mode

Agaz Hussain Wani ▴ 260

@agaz-hussain-wani-7620

Last seen 6.1 years ago

India

May be i am wrong . Let me display my results also.

f = list.files(pattern = "TXT")

f

[1] "GSM108293.TXT"

codset = readCodelinkSet(filename = f[[i]])

** product: Unknown
** chip size: 20469
** reading file 1 of 1 : GSM108293.TXT
+ detected . as decimal symbol.
+ sample Name: GSM108293.TXT
** applying flags to MSR spots ...OK
** computing weights ...OK
** computing SNR ...OK

> head(exprs(codset))
1
1 237.4510
2 566.4462
3 473.6080
4 1976.1409
5 3079.4490
6 404.0833

> head(getSNR(codset))
1
1 0.7156364
2 1.7474178
3 1.4991542
4 7.0496070
5 10.1904870
6 1.0015290

> head(featureNames(codset),20)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"

I think nothing else is left to show.

ADD COMMENT • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

I see. Everything I said still applies. In that GEO dataset, not all the files contain the Feature_id column. Check for example: GSM108109.TXT.gz. This is basically what I meant with the inconsistency. Now, if one of the files fails to contain the Feature_id column, then codelink will not use it, because we cannot assume the ordering to be the same. Indeed, I found some datasets in the past with different ordering. There is nothing you can really do. You cannot have Feature_id unless you make strong assumptions and certainly codelink package does not allow it. I will update my answer to reflect this.

ADD REPLY • link 9.0 years ago Diego Diez ▴ 760

0

Entering edit mode

You are absolutely right regarding the column Feature_id missing in some files. I am providing only one data file to process by your codelink package. How it affects a single file having Feature_id present.

If the user want to give many files in which Feature_id column is missing in some files, then codelink should apply the rule of not using the Feature_id column, otherwise the case should be different.

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

OK, now I get what you mean. After checking the code I remembered the main reason to use Feature_id internally is to ensure proper ordering of the rows. But then the information is not used for the row names in the expression matrix. This is intentional since otherwise different datasets from the same chip would have different row names, which would not be good. However, the Feature_id information was meant to be stored somewhere. In the old Codelink-class it is present in the $id slot. It seems in the CodelinkSet-class I did not include it. So in that regard you were right, thank you. I will send a bug fix for this when I find the time and update the answer then. Out of curiosity, I still do not understand why do you need that info anyway- it is never used in the analysis of data.

ADD REPLY • link 9.0 years ago Diego Diez ▴ 760

0

Entering edit mode

I think that column can be achieved by using CodelinkSet class.

i used

fea <- CodelinkSet(f) ## f is a data file

ids <- fea$id ## this will give the feature ids if present otherwise NA.

As i told you before, it is used for annotation. Off-course we can annotate the date by GEOquery() in an efficient way as explained by you earlier. But if the user have a data taken from GEO , one way of annotating the data is using these feature ids to match in the GPl file to get the annotation(Gene name and Gene symbol and so on) . I am not sure if there are other ways also to do annotation of data present on GEO and if the GEO data is processed by your codelink package.

if the feature ids are not use full at all, then why every data on GEO has feature id.?? they could have easily neglected it and used probe name or probe id instead.

Anyway, If i process the raw data, my interest is to get a data frame containing gene name or symbol, signal intensity, SNR and call for each individual GSM file like below

Gene Symbol Intensity SNR Call

KIAA0484 1.20000 0.0231 A

KIAA0495         21.0032    0.435       A
KIAA0506             2.3433    1.232       P
KIAA0515           11.023843    2.12         P

KIAA0556 10.21 3.21 P
KIAA0586 9.21 2.00 P

After the preparation of data like this for all the GSM files ,i go for meta-analysis .

This is your code as you posted for my earlier question

eset <- getGEO("GSE4797")
eset <- eset[[1]]
head(fData(eset))
# do preprocessing, e.g. with limma matrix methods. For example:
y <- backgroundCorrect.matrix(eset, offset = 50, method = "normexp")
y <- normalizeBetweenArrays(y, method = "quantile")
exprs(eset) <- y

I can get the expression from the above code .Then what about SNR here?? call, i can decide on the basis of SNR value. Annotation can be done by using GEO query as already well explained. Thanks

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

Hussain,

You cannot do CodelinkSet(f), there is no function in the codelink package called CodelinkSet().
If fea is a CodelinkSet object, you cannot do fea$id, that only works in a Codelink object (as I've explained to you already).
Maybe you can get annotations from GEO using the Feature_id info. I do not know. Are these annotations up-to-date? Bioconductor annotations are updated every six months and you can get them with the Probe_name information, even if you obtain the data using GEOquery (as I have explained to you already).
To know more about how to use the package please read the vignette.
I will update the answer to this post once I send a fix.

Good luck.

ADD REPLY • link 9.0 years ago Diego Diez ▴ 760

0

Entering edit mode

Sorry for typo mistake, its actually readCodelink(). It serves the purpose of getting the Feature_id if present in the file unlike readCodelinkSet(), which was supposed to do this job. Off-course annotations are updated, but may be not after regular intervals.

ADD REPLY • link 9.0 years ago Agaz Hussain Wani ▴ 260