I just tried to use codelink package to process raw data. I managed to get the expression and SNR values, however i didn't get the Probe ids. I used the below code to load two sample files.
f = list.files(pattern = "TXT") codset = readCodelinkSet(filename = f)
I get the expression values by exprs(codset) and SNR value by getSNR(codset). I tried probNames(), which seem to be like
[1] "NM_012429.1_PROBE1" "NM_003980.2_PROBE1" "AY044449_PROBE1" [4] "NM_005015.1_PROBE1" "AB037823_PROBE1" but not like IDs. 1001,1002 and so on
I think i am missing some thing here
pdata = read.AnnotatedDataFrame("targets.txt") codset = readCodelinkSet(filename = pdata$FileName, phenoData = pdata)
What is targets.txt and how to prepare it. How to get the ids
I think you are mixing two questions here. The targets.txt file contains information for each sample (such as treatments, clinical information, etc) not probes. It is not necessary to have it but it somewhat simplifies things later. You can take a look at the information in the limma user's guide, section 4.3 (The Targets Frame) to know more about it. I should probably add a bit of information regarding that in the codelink vignette. Regarding the probe names, the ids use for the probes may depend on how the data was exported. Are you using your own data or downloaded from GEO? Can you show me a few lines of one of your data files? Are there any warnings or other information shown when reading the data?
Mab be you are talking right about the mixture of two questions. But the documentation of code-link (reference) drive me a bit confused, as i didn't find much information about targets.txt. So it would be great to have some addition.
Regarding the codelink raw data, which is downloaded from GEO, i get some warnings but the data is loaded and expressions and SNR values are drawn smoothly.
Here is some data from one file.
CodeLink Expression Analysis 4.0.0.23520
IHF Report for Slide (T00245776)
LAYOUT EXP10336X2-770.22.ID
PROJECT
EXPERIMENT
Report( 1 ): Human 20K
--------------------------------------------------------------------------------
Idx Sample_name Probe_name Annotation_NCBI_Acc Annotation_LocusLink Annotation_OGS Annotation_UniGene Annotation_ENSEMBL Probe_type Raw_intensity Quality_flag Signal_strength Logical_row Logical_col Center_X Center_Y Bkgd_mean Array Annotation_PIN Normalized_intensity Spot_mean Spot_median Spot_stdev Spot_area Spot_diameter Spot_noise_level Bkgd_median Bkgd_stdev Bkgd_area Annotation_Molecular_Function Annotation_Biological_Process Annotation_Cellular_Component Annotation_Cytoband Annotation_HS_Homology Annotation_MM_Homology Annotation_RN_Homology Annotation_Analogous_CodeLink Annotation_Legacy_Probe_Name Description
1 Sample001 NM_012429.1_PROBE1 NM_012429 23541 Hs.277728 DISCOVERY 146,7679 CL 0,7381 1 9 391 211 261,3496 1 NM_012429 0,1362 360,7679 314,0000 202,0388 114 12,0477924346924 488,758895571019 214,0000 183,1726296
I load it by
f = list.files(pattern = "TXT")
and then apply
codset = readCodelinkSet(filename = f[[i]])
codset = codCorrect(codset, method = "half", offset = 0)
codset = codNormalize(codset, method = "quantile")
exprs <- exprs(codset)
snr <- getSNR(codset)
dat <- cbind(exprs,snr) ## to combine exprs and snr
Output
** product: Unknown
** chip size: 20469
** reading file 1 of 1 : GSM108109.TXT
+ detected , as decimal symbol.
+ sample Name: GSM108109.TXT
** applying flags to MSR spots ...OK
** computing weights ...OK
** computing SNR ...OK
1 1
1 7.197392657 0.7381305998
2 12.403618142 13.7535937689
3 8.749735373 1.7720060283
4 11.373505481 6.7607981511
5 11.348655741 6.5203397924
6 9.734132935 2.2048101666
About column 1 i am not sure whether ID or Serial Number. Column 2 is intensity followed by SNR value.
So in codelink data i have see the ID looks like 1001,102,109 or something like this. So what is ID in the above result. ?? Is it missing?? Is the first column ID??
Here are some warnings i get
Warning messages:
1: In .readCodelinkRaw(files = filename, ...) :
'readCodelink' and 'readCodelinkSet' do not convert intensities to NA based on flags anymore, except for spots flagged as 'M' (MSR spot). Instead, createWeights() is used to assign weights. These weights can be used during normalization and linear modeling. To obtain the old behavior use parameter 'old=TRUE' (weights will be created anyway).
2: In .readCodelinkRaw(files = filename, ...) :
Product type for GSM108109.TXT is unknown (missing PRODUCT field in header).
Here is some more information from my results: At last Annotation is blank
CodelinkSet (storageMode: lockedEnvironment)
assayData: 20469 features, 1 samples
element names: background, exprs, flag, snr, weight
protocolData: none
phenoData
rowNames: 1
varLabels: sample
varMetadata: labelDescription
featureData
featureNames: 1 2 ... 20469 (20469 total)
fvarLabels: probeName probeType ... meanSNR (5 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
There is a distinction between feature and probe. Features are unique locations in the array containing a particular probe type (the "spots" to say it plainly). Probe is just an oligonucleotide sequence. Different features can contain the same probe. So when you do probeNames() you are obtaining the probes, not the features, which you can obtain with featureNames(). Anyway, the reason you do not see the features codes in the your data as in the vignette example is because that information is not present in your files. Unfortunately the codelink format is not strictly enforced and different fields can be found depending on how it was exported. If the Feature_ID field is not found then that information is not included. In principle this does not cause any problem (so far, no one has complained about problems with this, and myself did not find any).
Thanks for your comment. But i think the feature ID is important to annotate the data set. If i need to annotate each probe of the processed data by the annotation file(GPL), usually the ID's are matched, that is what i do. May be the approach is different there.
Like in affymetrix data, the probes are annotated from the GPl file by ID only. I think same should be the case here also. Anyway thank U so much for providing information.
That is actually not correct. The feature ID is only relevant to the position in the array. The probe ID, that is the oligonucleotide sequence, is what is linked to any annotation. The annotation packages for codelink, which I also update every release, are linked to the probe ids. You can use those for the annotations. It has been almost 10 years since the original release of the codelink and annotation packages and no one has found any problem in that regard. So please, give it a try first!
Sory I missed it here. Actually i am talking about probe ID's. So U also agree that probe ID's are must. If possible please give some more explanation of your package, that would really be nice. Still i am confused about the probe IDs which i didn't get from the data. As per you, it totally depends on the data given, if i am not wrong. So if the probe ID's are not present, i think we can not annotate the data at-least according to the way how i do.
Please read again my comments. As I said there are two concepts, feature and probe. You are not getting the original feature IDs, but thats OK. You are getting the probe ids (you showed us that with probeNames()) and so there is not problem at all with your code or with the package. Maybe the confusion is that the CodelinkSet object shows featureNames as 1,2,3.... Well, those are the feature IDs. The reason they are called featureNames is because CodelinkSet inherits from ExpressionSet (from Biobase) where they are defined that way. Either way, as I mentioned before, you get the feature IDs using featureNames() and you get the probe ids with probeNames() (and you have used successfully this in your original post). So, if you still have some confusion could you please explain in detail what exactly are you expecting to see? Thank you.
Thanks for explaining. I will looking again into the package to clear my confusion. Thanks a lot.
Hello Dear Diego Diez, with continuation to the above discussion, i got some processed data from GEO, which seems to have four columns. A part of that is shown below. The data has four columns but in the above code i got three columns. Can u please tell me what i missed above to get the second column. Thanks
I am sorry but I am not sure I follow your question. Where in the above code do you have 4 columns? Where did you get the data pasted? I have never seen anything like that data. Remember that the processed data in GEO does not necessarily need to have been obtained using the codelink package. The codelink package does not work with processed GEO data (in terms of reading, and preprocessing of course). Please, either: 1) follow the instructions in the vignette and let me know if there is anything specific to those that you do not understand, 2) use instead GEOquery package to obtain the GEO expression matrix for the GEO entry, 3) or tell us exactly what you are attempting to do, what you expect to find and what you are finding instead. From the above code and data I cannot make a link between what I see and what your problem seems to be.
I think annotation package need to be used to get the second column , which seems to give probe ID (second column). I looked on this link
https://stat.ethz.ch/pipermail/bioconductor/2006-July/013515.html which is yours indeed.
I am trying to annotate the probe IDs with the gene name from the GPL file. For that matter i need to have the second column. I use affymetrix package also which directly give me probe IDs which further i can annotate with the gene name. But here i am bit confused how to annotate probe ids(which are missing) with the gene name.
A part of GPL file is below. I annotate each probe ID of GSM file with the probe ID of GPL file to get the gene name or gene symbol, which is shown below.
Again, I am getting very confused about what exactly is your problem since your comment now seems unrelated to the discussion above. Please, could you give the the GEO id for the data you want to use?
Ok. I want to process the raw data of GEO ID GSM108263 by your codelink package and want to get the first column as shown on GEO, named as ID_REF.
But, I already told you that you do not need that column. You need the probe name or probe id (synonyms), no the feature id. Anyway, it seems there is an additional problem with that dataset. You can of course load it and preprocess it in R as mentioned in the vignette:
The problem is that those probe names are legacy probe names, meaning, they do not match the probe names used in annotation packages. Probably this chips were processed with an old version of the Codelink software. I have the mapping information in the original annotation files but these are not included in the annotations (and not sure how that could be done). An easier, faster way is to use GEOquery package as it seems GEO internally has mapped the probe names from the old to new style:
eset <- getGEO("GSE4797") eset <- eset[[1]] head(fData(eset)) # do preprocessing, e.g. with limma matrix methods. For example: y <- backgroundCorrect.matrix(eset, offset = 50, method = "normexp") y <- normalizeBetweenArrays(y, method = "quantile") exprs(eset) <- y # here we choose 10 probes at random- but this will come from your statistical analysis. probes.sel <- as.character(fData(eset)$PROBE_NAME[1:10]) > select(h20kcod.db, keys = probes.sel, columns = c("SYMBOL", "ENTREZID", "GENENAME"), ) PROBEID SYMBOL ENTREZID GENENAME 1 GE53815 SEC14L2 23541 SEC14-like 2 (S. cerevisiae) 2 GE59986 MAP7 9053 microtubule-associated protein 7 3 GE53908 ALPK3 57538 alpha-kinase 3 4 GE60064 OXA1L 5018 oxidase (cytochrome c) assembly 1-like 5 GE53952 CHPF2 54480 chondroitin polymerizing factor 2 6 GE60188 SEC23B 10483 Sec23 homolog B (S. cerevisiae) 7 GE53801 PAIP2B 400961 poly(A) binding protein interacting protein 2B 8 GE59978 CTRL 1506 chymotrypsin-like 9 GE53905 KIAA1324 57535 KIAA1324 10 GE60052 PTPN21 11099 protein tyrosine phosphatase, non-receptor type 21 > <font face="sans-serif, Arial, Verdana, Trebuchet MS"> </font>
Thanks a lot for your nice explanation. The way of your doing is pretty good to annotate the probe ID from Geo-query to get gene names and other information. And its clear for me at the end that there is no need to have that column, that is really nice.
But the GEO data drive me very confusing (especially the ID-REF column) how to get that. All the data in GEO has same formate as i mentioned above. May be this is some reference they are using for annotation. But its totally confusing me. The ID-REF column is present in both the GSM(samples) and GPL(annotation file).
So its clear now that the column(ID-REF) shown above is not achievable by your codelink package???.
Thanks for your comments.
It is possible the ID_REF refers to the Feature_ID mentioned above. If it is indeed Feature_ID you could get it with
featureNames(ecod)
, as I said before. But, as I said before, it is not always exported- and it is not in the raw codelink files found in your dataset. Anyway, in terms of the R workflow it is (almost) useless.It is possible the ID_REF refers to the Feature_ID mentioned above. If it is indeed Feature_ID you could get it with
featureNames(ecod)
, as I said before. But, as I said before, it is not always exported- and it is not in the raw codelink files found in your dataset. Anyway, in terms of the R workflow it is (almost) useless.Thanks a lot.
I feel very embarrassed and awkward to ask you the same question again and again and i am very sorry for that. Really a hard time for me , as i am interested to get the featureNames() in the GEO data format, but not getting any way.
I checked featureNames(ecod) as per your recommendation, but the featureNames() start with 1,2,3 not from 109 or so as given in the above data frame with four columns. May be i am foolish to ask this!!.
Again i went through your codelink package. There are two example of data
1) data(codset). In data(codset) the featureNames() start from 1001, 1002,1003 and so on. May i know why?? why not from 1,2 and so on
2) data(codelink.example). You change it to newer version by tmp = Codelink2CodelinkSet(codelink.example) and then the featureNames() start from 1,2 and so on. Is it the formate in new version of codelink.
will you please let me know the difference between the featureNames() starting with 1,2,3 and with 109,110 and so on. Does these vary for new to older???
Then again, why are you interested in getting the featureNames() in the GEO data format if we already agreed they are not useful?
The reason you do not get them, as I explained to you before, is because that information IS NOT found in the raw files. Because it is not found,
codelink
uses the ordering in the files to set the row names in the expression matrix, which become what you get with featureNames(), i.e. 1, 2, 3, etc. How then in GEO they know the feature IDs? Well, my guess is that they may assume the ordering in the Codelink file preserves some original ordering and assign them automagically. Because the Codelink format is not strictly enforced and may be inconsistent I prefer not to make that assumption, hence I do not assign information that does not exists in the original file. Maybe there is a better way to handle it but so far this is how it is. I think this is well enough explained.A separate issue is that you are throwing a lot of different questions unrelated to the original posted question. This thwarts the purpose of this support site, where ideally you would write a separate post for each question, so that specific answers can be searched for and found efficiently. Please, keep that in mind.
At any rate, I have added a response to the original question in your post.
Thanks again for explaining in a better way.