How to prepare targets.txt file in codelink.
2
2
Entering edit mode
@agaz-hussain-wani-7620
Last seen 3.0 years ago
India

I just tried to use codelink package to process raw data. I managed to get the expression and SNR values, however i didn't get the Probe ids. I used the below code to load two sample files.

f = list.files(pattern = "TXT")

I get the expression values by exprs(codset) and SNR value by getSNR(codset). I tried probNames(), which seem to be like
[1] "NM_012429.1_PROBE1" "NM_003980.2_PROBE1" "AY044449_PROBE1"

[4] "NM_005015.1_PROBE1" "AB037823_PROBE1" but not like IDs. 1001,1002 and so on

I think i am missing some thing here

codset = readCodelinkSet(filename = pdata$FileName, phenoData = pdata) What is targets.txt and how to prepare it. How to get the ids r codelink • 2.4k views ADD COMMENT 0 Entering edit mode I think you are mixing two questions here. The targets.txt file contains information for each sample (such as treatments, clinical information, etc) not probes. It is not necessary to have it but it somewhat simplifies things later. You can take a look at the information in the limma user's guide, section 4.3 (The Targets Frame) to know more about it. I should probably add a bit of information regarding that in the codelink vignette. Regarding the probe names, the ids use for the probes may depend on how the data was exported. Are you using your own data or downloaded from GEO? Can you show me a few lines of one of your data files? Are there any warnings or other information shown when reading the data? ADD REPLY 0 Entering edit mode Mab be you are talking right about the mixture of two questions. But the documentation of code-link (reference) drive me a bit confused, as i didn't find much information about targets.txt. So it would be great to have some addition. Regarding the codelink raw data, which is downloaded from GEO, i get some warnings but the data is loaded and expressions and SNR values are drawn smoothly. Here is some data from one file. CodeLink Expression Analysis 4.0.0.23520 IHF Report for Slide (T00245776) LAYOUT EXP10336X2-770.22.ID PROJECT EXPERIMENT Report( 1 ): Human 20K -------------------------------------------------------------------------------- Idx Sample_name Probe_name Annotation_NCBI_Acc Annotation_LocusLink Annotation_OGS Annotation_UniGene Annotation_ENSEMBL Probe_type Raw_intensity Quality_flag Signal_strength Logical_row Logical_col Center_X Center_Y Bkgd_mean Array Annotation_PIN Normalized_intensity Spot_mean Spot_median Spot_stdev Spot_area Spot_diameter Spot_noise_level Bkgd_median Bkgd_stdev Bkgd_area Annotation_Molecular_Function Annotation_Biological_Process Annotation_Cellular_Component Annotation_Cytoband Annotation_HS_Homology Annotation_MM_Homology Annotation_RN_Homology Annotation_Analogous_CodeLink Annotation_Legacy_Probe_Name Description 1 Sample001 NM_012429.1_PROBE1 NM_012429 23541 Hs.277728 DISCOVERY 146,7679 CL 0,7381 1 9 391 211 261,3496 1 NM_012429 0,1362 360,7679 314,0000 202,0388 114 12,0477924346924 488,758895571019 214,0000 183,1726296 I load it by f = list.files(pattern = "TXT") and then apply codset = readCodelinkSet(filename = f[[i]]) codset = codCorrect(codset, method = "half", offset = 0) codset = codNormalize(codset, method = "quantile") exprs <- exprs(codset) snr <- getSNR(codset) dat <- cbind(exprs,snr) ## to combine exprs and snr Output ** product: Unknown ** chip size: 20469 ** reading file 1 of 1 : GSM108109.TXT + detected , as decimal symbol. + sample Name: GSM108109.TXT ** applying flags to MSR spots ...OK ** computing weights ...OK ** computing SNR ...OK 1 1 1 7.197392657 0.7381305998 2 12.403618142 13.7535937689 3 8.749735373 1.7720060283 4 11.373505481 6.7607981511 5 11.348655741 6.5203397924 6 9.734132935 2.2048101666 About column 1 i am not sure whether ID or Serial Number. Column 2 is intensity followed by SNR value. So in codelink data i have see the ID looks like 1001,102,109 or something like this. So what is ID in the above result. ?? Is it missing?? Is the first column ID?? Here are some warnings i get Warning messages: 1: In .readCodelinkRaw(files = filename, ...) : 'readCodelink' and 'readCodelinkSet' do not convert intensities to NA based on flags anymore, except for spots flagged as 'M' (MSR spot). Instead, createWeights() is used to assign weights. These weights can be used during normalization and linear modeling. To obtain the old behavior use parameter 'old=TRUE' (weights will be created anyway). 2: In .readCodelinkRaw(files = filename, ...) : Product type for GSM108109.TXT is unknown (missing PRODUCT field in header). Here is some more information from my results: At last Annotation is blank CodelinkSet (storageMode: lockedEnvironment) assayData: 20469 features, 1 samples element names: background, exprs, flag, snr, weight protocolData: none phenoData rowNames: 1 varLabels: sample varMetadata: labelDescription featureData featureNames: 1 2 ... 20469 (20469 total) fvarLabels: probeName probeType ... meanSNR (5 total) fvarMetadata: labelDescription experimentData: use 'experimentData(object)' Annotation: ADD REPLY 0 Entering edit mode There is a distinction between feature and probe. Features are unique locations in the array containing a particular probe type (the "spots" to say it plainly). Probe is just an oligonucleotide sequence. Different features can contain the same probe. So when you do probeNames() you are obtaining the probes, not the features, which you can obtain with featureNames(). Anyway, the reason you do not see the features codes in the your data as in the vignette example is because that information is not present in your files. Unfortunately the codelink format is not strictly enforced and different fields can be found depending on how it was exported. If the Feature_ID field is not found then that information is not included. In principle this does not cause any problem (so far, no one has complained about problems with this, and myself did not find any). ADD REPLY 0 Entering edit mode Thanks for your comment. But i think the feature ID is important to annotate the data set. If i need to annotate each probe of the processed data by the annotation file(GPL), usually the ID's are matched, that is what i do. May be the approach is different there. Like in affymetrix data, the probes are annotated from the GPl file by ID only. I think same should be the case here also. Anyway thank U so much for providing information. ADD REPLY 0 Entering edit mode That is actually not correct. The feature ID is only relevant to the position in the array. The probe ID, that is the oligonucleotide sequence, is what is linked to any annotation. The annotation packages for codelink, which I also update every release, are linked to the probe ids. You can use those for the annotations. It has been almost 10 years since the original release of the codelink and annotation packages and no one has found any problem in that regard. So please, give it a try first! ADD REPLY 0 Entering edit mode Sory I missed it here. Actually i am talking about probe ID's. So U also agree that probe ID's are must. If possible please give some more explanation of your package, that would really be nice. Still i am confused about the probe IDs which i didn't get from the data. As per you, it totally depends on the data given, if i am not wrong. So if the probe ID's are not present, i think we can not annotate the data at-least according to the way how i do. ADD REPLY 0 Entering edit mode Please read again my comments. As I said there are two concepts, feature and probe. You are not getting the original feature IDs, but thats OK. You are getting the probe ids (you showed us that with probeNames()) and so there is not problem at all with your code or with the package. Maybe the confusion is that the CodelinkSet object shows featureNames as 1,2,3.... Well, those are the feature IDs. The reason they are called featureNames is because CodelinkSet inherits from ExpressionSet (from Biobase) where they are defined that way. Either way, as I mentioned before, you get the feature IDs using featureNames() and you get the probe ids with probeNames() (and you have used successfully this in your original post). So, if you still have some confusion could you please explain in detail what exactly are you expecting to see? Thank you. ADD REPLY 0 Entering edit mode Thanks for explaining. I will looking again into the package to clear my confusion. Thanks a lot. ADD REPLY 0 Entering edit mode Hello Dear Diego Diez, with continuation to the above discussion, i got some processed data from GEO, which seems to have four columns. A part of that is shown below. The data has four columns but in the above code i got three columns. Can u please tell me what i missed above to get the second column. Thanks  ID SignalIntensity SNR 1 109 7.5804349594 1.3421827372 2 110 11.2698560261 9.7509784076 3 111 8.6006358676 1.804853485 4 112 9.5435575539 3.2013768505 5 113 10.181235138 4.6459936834 6 114 8.1168949395 1.4226360984 7 116 8.4670495571 1.8682653573 8 117 8.2733657566 1.6663830675 9 118 5.4203308081 0.6769947295 10 119 8.5379673202 1.8326649796 11 120 9.7949416719 4.0503974531 12 121 5.9770829707 0.7442132887 13 123 5.7623248946 0.7208574513 14 124 5.3253576296 0.6793816015 15 125 8.3676877563 1.6437990197 ADD REPLY 0 Entering edit mode I am sorry but I am not sure I follow your question. Where in the above code do you have 4 columns? Where did you get the data pasted? I have never seen anything like that data. Remember that the processed data in GEO does not necessarily need to have been obtained using the codelink package. The codelink package does not work with processed GEO data (in terms of reading, and preprocessing of course). Please, either: 1) follow the instructions in the vignette and let me know if there is anything specific to those that you do not understand, 2) use instead GEOquery package to obtain the GEO expression matrix for the GEO entry, 3) or tell us exactly what you are attempting to do, what you expect to find and what you are finding instead. From the above code and data I cannot make a link between what I see and what your problem seems to be. ADD REPLY 0 Entering edit mode I think annotation package need to be used to get the second column , which seems to give probe ID (second column). I looked on this link https://stat.ethz.ch/pipermail/bioconductor/2006-July/013515.html which is yours indeed. I am trying to annotate the probe IDs with the gene name from the GPL file. For that matter i need to have the second column. I use affymetrix package also which directly give me probe IDs which further i can annotate with the gene name. But here i am bit confused how to annotate probe ids(which are missing) with the gene name. A part of GPL file is below. I annotate each probe ID of GSM file with the probe ID of GPL file to get the gene name or gene symbol, which is shown below.  109 SEC14-like 2 (S. cerevisiae) 110 microtubule-associated protein 7 111 alpha-kinase 3 112 oxidase (cytochrome c) assembly 1-like 113 chondroitin sulfate glucuronyltransferase 114 Sec23 homolog B (S. cerevisiae) 115 116 poly(A) binding protein interacting protein 2B 117 chymotrypsin-like 118 KIAA1324 119 protein tyrosine phosphatase, non-receptor type 21 120 membrane-associated ring finger (C3HC4) 4 ADD REPLY 0 Entering edit mode Again, I am getting very confused about what exactly is your problem since your comment now seems unrelated to the discussion above. Please, could you give the the GEO id for the data you want to use? ADD REPLY 0 Entering edit mode Ok. I want to process the raw data of GEO ID GSM108263 by your codelink package and want to get the first column as shown on GEO, named as ID_REF. ADD REPLY 0 Entering edit mode But, I already told you that you do not need that column. You need the probe name or probe id (synonyms), no the feature id. Anyway, it seems there is an additional problem with that dataset. You can of course load it and preprocess it in R as mentioned in the vignette: > library(codelink) # get files in the current directory. > f <- list.files(pattern = "gz") > f > head(f) [1] "GSM108109.TXT.gz" "GSM108221.TXT.gz" "GSM108225.TXT.gz" "GSM108228.TXT.gz" [5] "GSM108231.TXT.gz" "GSM108235.TXT.gz" # read data. > eraw <- readCodelinkSet(f) > eraw # this is required since codelink cannot guess the chip type, see the warnings upon loading data regarding missing Product information. > annotation(eraw) <- "h20kcod.db" # preprocessing data. > eraw > codPlot(eraw,what = "density") > ecod <- codPreprocess(eraw, bg.method = "normexp") > codPlot(ecod,what = "density") > codPlot(ecod) # check feature information. > head(fData(ecod)) probeName probeType logicalRow logicalCol meanSNR 1 NM_012429.1_PROBE1 DISCOVERY 1 9 0.9594485 2 NM_003980.2_PROBE1 DISCOVERY 1 10 6.5453198 3 AY044449_PROBE1 DISCOVERY 1 11 1.8873607 4 NM_005015.1_PROBE1 DISCOVERY 1 12 5.0572757 5 AB037823_PROBE1 DISCOVERY 1 13 6.6339932 6 NM_032986.1_PROBE1 DISCOVERY 1 14 1.4742470 # get probe names. > probeNames(ecod)[1:10] [1] "NM_012429.1_PROBE1" "NM_003980.2_PROBE1" "AY044449_PROBE1" [4] "NM_005015.1_PROBE1" "AB037823_PROBE1" "NM_032986.1_PROBE1" [7] "AB032981_PROBE1" "NM_001907.1_PROBE1" "AB037745_PROBE1" [10] "NM_007039.1_PROBE1" The problem is that those probe names are legacy probe names, meaning, they do not match the probe names used in annotation packages. Probably this chips were processed with an old version of the Codelink software. I have the mapping information in the original annotation files but these are not included in the annotations (and not sure how that could be done). An easier, faster way is to use GEOquery package as it seems GEO internally has mapped the probe names from the old to new style: eset <- getGEO("GSE4797") eset <- eset[[1]] head(fData(eset)) # do preprocessing, e.g. with limma matrix methods. For example: y <- backgroundCorrect.matrix(eset, offset = 50, method = "normexp") y <- normalizeBetweenArrays(y, method = "quantile") exprs(eset) <- y # here we choose 10 probes at random- but this will come from your statistical analysis. probes.sel <- as.character(fData(eset)$PROBE_NAME[1:10])
> select(h20kcod.db, keys = probes.sel, columns = c("SYMBOL", "ENTREZID", "GENENAME"), )
PROBEID   SYMBOL ENTREZID                                           GENENAME
1  GE53815  SEC14L2    23541                       SEC14-like 2 (S. cerevisiae)
2  GE59986     MAP7     9053                   microtubule-associated protein 7
3  GE53908    ALPK3    57538                                     alpha-kinase 3
4  GE60064    OXA1L     5018             oxidase (cytochrome c) assembly 1-like
5  GE53952    CHPF2    54480                  chondroitin polymerizing factor 2
6  GE60188   SEC23B    10483                    Sec23 homolog B (S. cerevisiae)
7  GE53801   PAIP2B   400961     poly(A) binding protein interacting protein 2B
8  GE59978     CTRL     1506                                  chymotrypsin-like
9  GE53905 KIAA1324    57535                                           KIAA1324
10 GE60052   PTPN21    11099 protein tyrosine phosphatase, non-receptor type 21
> <font face="sans-serif, Arial, Verdana, Trebuchet MS">
</font>
0
Entering edit mode

Thanks a lot for your nice explanation. The way of your doing is pretty good to annotate the probe ID from Geo-query to get gene names and other information. And its clear for me at the end that there is no need to have that column, that is really nice.

But the GEO data drive me very confusing (especially the ID-REF column) how to get that. All the data in GEO has same formate as i mentioned above. May be this is some reference they are using for annotation. But its totally confusing me. The ID-REF column is present in both the GSM(samples) and GPL(annotation file).

So its clear now that the column(ID-REF) shown above is not achievable by your codelink package???.

0
Entering edit mode

It is possible the ID_REF refers to the Feature_ID mentioned above. If it is indeed Feature_ID you could get it with featureNames(ecod), as I said before. But, as I said before, it is not always exported- and it is not in the raw codelink files found in your dataset. Anyway, in terms of the R workflow it is (almost) useless.

0
Entering edit mode

It is possible the ID_REF refers to the Feature_ID mentioned above. If it is indeed Feature_ID you could get it with featureNames(ecod), as I said before. But, as I said before, it is not always exported- and it is not in the raw codelink files found in your dataset. Anyway, in terms of the R workflow it is (almost) useless.

0
Entering edit mode

Thanks a lot.

0
Entering edit mode

I feel very embarrassed and awkward to ask you the same question again and again and i am very sorry for that.  Really a hard time for me , as i am interested to get the featureNames() in the GEO data format, but not getting any way.

I checked featureNames(ecod) as per your recommendation, but the featureNames() start with 1,2,3 not from 109 or so as given in the above data frame with four columns. May be i am foolish to ask this!!.

Again i went through your codelink package. There are two example of data

1) data(codset). In data(codset) the featureNames() start from 1001, 1002,1003 and so on. May i know why?? why not from 1,2 and so on

will you please let me know the difference between the featureNames() starting with 1,2,3 and with 109,110 and so on.  Does these vary for new to older???

0
Entering edit mode

Then again, why are you interested in getting the featureNames() in the GEO data format if we already agreed they are not useful?

The reason you do not get them, as I explained to you before, is because that information IS NOT found in the raw files. Because it is not found, codelink uses the ordering in the files to set the row names in the expression matrix, which become what you get with featureNames(), i.e. 1, 2, 3, etc. How then in GEO they know the feature IDs? Well, my guess is that they may assume the ordering in the Codelink file preserves some original ordering and assign them automagically. Because the Codelink format is not strictly enforced and may be inconsistent I prefer not to make that assumption, hence I do not assign information that does not exists in the original file. Maybe there is a better way to handle it but so far this is how it is. I think this is well enough explained.

A separate issue is that you are throwing a lot of different questions unrelated to the original posted question. This thwarts the purpose of this support site, where ideally you would write a separate post for each question, so that specific answers can be searched for and found efficiently. Please, keep that in mind.

0
Entering edit mode

At any rate, I have added a response to the original question in your post.

0
Entering edit mode

Thanks again for explaining in a better way.

1
Entering edit mode
Diego Diez ▴ 750
@diego-diez-4520
Last seen 6 months ago
Japan

A targets.txt file is just a text file with each row representing a microarray file. I believe the concept was first introduced in the limma package, and it is explained in its user guide (where it is called targets frame). The columns represent information related to each microarray file, and may include the file name itself, the sample name, and other technical (e.g. date of hybridization), experimental (e.g. control, treatment) or phenotypical (e.g. wild type, KO) information. Any information related to the sample hybridized to the microarray can go into the target file. An example targets.txt file may look like this (were FileName, SampleName and Treatment variables are included):

FileName SampleName Treatment
File1.txt S1 Control
File2.txt S2 Control
File3.txt S3 Treatment1
File4.txt S4 Treatment1

This information can be assigned to a phenoData-class object in objects of the class ExpressionSet. Since the CodelinkSet class inherits ExpressionSet, the same logic applies for objects of class CodelinkSet. For example, as explained in the codelink vignette, the array information can be loaded first from a targets frame and assigned to the CodelinkSet object created:

0
Entering edit mode

Again thanks for answering to my original post.

0
Entering edit mode
@agaz-hussain-wani-7620
Last seen 3.0 years ago
India

0
Entering edit mode