How to prepare targets.txt file in codelink.
2
2
Entering edit mode
@agaz-hussain-wani-7620
Last seen 6.6 years ago
India

I just tried to use codelink package to process raw data. I managed to get the expression and SNR values, however i didn't get the Probe ids. I used the below code to load two sample files.

f = list.files(pattern = "TXT")

codset = readCodelinkSet(filename = f)

 

I get the expression values by exprs(codset) and SNR value by getSNR(codset). I tried probNames(), which seem to be like 
[1] "NM_012429.1_PROBE1" "NM_003980.2_PROBE1" "AY044449_PROBE1"

[4] "NM_005015.1_PROBE1" "AB037823_PROBE1" but not like IDs. 1001,1002 and so on

I think i am missing some thing here 

 

pdata = read.AnnotatedDataFrame("targets.txt")

codset = readCodelinkSet(filename = pdata$FileName, phenoData = pdata)

What is targets.txt and how to prepare it. How to get the ids

r codelink • 4.8k views
ADD COMMENT
0
Entering edit mode

I think you are mixing two questions here. The targets.txt file contains information for each sample (such as treatments, clinical information, etc) not probes. It is not necessary to have it but it somewhat simplifies things later. You can take a look at the information in the limma user's guide, section 4.3 (The Targets Frame) to know more about it. I should probably add a bit of information regarding that in the codelink vignette. Regarding the probe names, the ids use for the probes may depend on how the data was exported. Are you using your own data or downloaded from GEO? Can you show me a few lines of one of your data files? Are there any warnings or other information shown when reading the data?

ADD REPLY
0
Entering edit mode

Mab be you are talking right about the mixture of two questions. But the documentation of code-link (reference) drive me a bit confused, as i didn't find much information about targets.txt. So it would be great to have some addition.

Regarding the codelink raw data, which is downloaded from GEO, i get  some warnings but the data is loaded and expressions and SNR values are drawn smoothly.

Here is some data from one file.

CodeLink Expression Analysis 4.0.0.23520
IHF Report for Slide (T00245776)
LAYOUT    EXP10336X2-770.22.ID
PROJECT   
EXPERIMENT   
Report( 1 ): Human 20K
--------------------------------------------------------------------------------
Idx    Sample_name    Probe_name    Annotation_NCBI_Acc    Annotation_LocusLink    Annotation_OGS    Annotation_UniGene    Annotation_ENSEMBL    Probe_type    Raw_intensity    Quality_flag    Signal_strength    Logical_row    Logical_col    Center_X    Center_Y    Bkgd_mean    Array    Annotation_PIN    Normalized_intensity    Spot_mean    Spot_median    Spot_stdev    Spot_area    Spot_diameter    Spot_noise_level    Bkgd_median    Bkgd_stdev    Bkgd_area    Annotation_Molecular_Function    Annotation_Biological_Process    Annotation_Cellular_Component    Annotation_Cytoband    Annotation_HS_Homology    Annotation_MM_Homology    Annotation_RN_Homology    Annotation_Analogous_CodeLink    Annotation_Legacy_Probe_Name    Description
1    Sample001    NM_012429.1_PROBE1    NM_012429    23541        Hs.277728        DISCOVERY    146,7679    CL    0,7381    1    9    391    211    261,3496    1    NM_012429    0,1362    360,7679    314,0000    202,0388    114    12,0477924346924    488,758895571019    214,0000    183,1726296                                  

I load it by

f = list.files(pattern = "TXT")

and then apply

codset = readCodelinkSet(filename = f[[i]])
  codset = codCorrect(codset, method = "half", offset = 0)
  codset = codNormalize(codset, method = "quantile")
exprs <- exprs(codset)
snr  <- getSNR(codset)
dat <- cbind(exprs,snr) ## to combine exprs and snr

Output

** product: Unknown
** chip size: 20469
** reading file 1 of 1 : GSM108109.TXT
  + detected , as decimal symbol.
  + sample Name: GSM108109.TXT
** applying flags to MSR spots ...OK
** computing weights ...OK
** computing SNR ...OK
             1             1
1  7.197392657  0.7381305998
2 12.403618142 13.7535937689
3  8.749735373  1.7720060283
4 11.373505481  6.7607981511
5 11.348655741  6.5203397924
6  9.734132935  2.2048101666

About column 1 i am not sure whether ID or Serial Number. Column 2 is intensity followed by SNR value.

So in codelink data i have see the ID looks like 1001,102,109 or something like this. So what is ID in the above result. ?? Is it missing?? Is the first column ID??

Here are some warnings i get

Warning messages:
1: In .readCodelinkRaw(files = filename, ...) :
  'readCodelink' and 'readCodelinkSet' do not convert intensities to NA based on flags anymore, except for spots flagged as 'M' (MSR spot). Instead, createWeights() is used to assign weights. These weights can be used during normalization and linear modeling. To obtain the old behavior use parameter 'old=TRUE' (weights will be created anyway).
2: In .readCodelinkRaw(files = filename, ...) :
  Product type for GSM108109.TXT is unknown (missing PRODUCT field in header).

 

Here is some more information from my results: At last Annotation is blank

CodelinkSet (storageMode: lockedEnvironment)
assayData: 20469 features, 1 samples
  element names: background, exprs, flag, snr, weight
protocolData: none
phenoData
  rowNames: 1
  varLabels: sample
  varMetadata: labelDescription
featureData
  featureNames: 1 2 ... 20469 (20469 total)
  fvarLabels: probeName probeType ... meanSNR (5 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: 

ADD REPLY
0
Entering edit mode

There is a distinction between feature and probe. Features are unique locations in the array containing a particular probe type (the "spots" to say it plainly). Probe is just an oligonucleotide sequence. Different features can contain the same probe. So when you do probeNames() you are obtaining the probes, not the features, which you can obtain with featureNames(). Anyway, the reason you do not see the features codes in the your data as in the vignette example is because that information is not present in your files. Unfortunately the codelink format is not strictly enforced and different fields can be found depending on how it was exported. If the Feature_ID field is not found then that information is not included. In principle this does not cause any problem (so far, no one has complained about problems with this, and myself did not find any).

ADD REPLY
0
Entering edit mode

Thanks for your comment. But i think the feature ID is important to annotate the data set. If i need to annotate each probe of the processed data by the annotation file(GPL), usually the ID's are matched, that is what i do. May be the approach is different there.

Like in affymetrix data, the probes are annotated from the GPl file by ID only. I think same should be the case here also. Anyway thank U so much for providing information.

ADD REPLY
0
Entering edit mode

That is actually not correct. The feature ID is only relevant to the position in the array. The probe ID, that is the oligonucleotide sequence, is what is linked to any annotation. The annotation packages for codelink, which I also update every release, are linked to the probe ids. You can use those for the annotations. It has been almost 10 years since the original release of the codelink and annotation packages and no one has found any problem in that regard. So please, give it a try first!

ADD REPLY
0
Entering edit mode

Sory I missed it here. Actually i am talking about probe ID's. So U also agree that probe ID's are must. If possible please give some more explanation of your package, that would really be nice. Still i am confused about the probe IDs which i didn't get from the data. As per you, it totally depends on the data given, if i am not wrong. So if the probe ID's are not present, i think we can not annotate the data at-least according to the way how i do.

ADD REPLY
0
Entering edit mode

Please read again my comments. As I said there are two concepts, feature and probe. You are not getting the original feature IDs, but thats OK. You are getting the probe ids (you showed us that with probeNames()) and so there is not problem at all with your code or with the package. Maybe the confusion is that the CodelinkSet object shows featureNames as 1,2,3.... Well, those are the feature IDs. The reason they are called featureNames is because CodelinkSet inherits from ExpressionSet (from Biobase) where they are defined that way. Either way, as I mentioned before, you get the feature IDs using featureNames() and you get the probe ids with probeNames() (and you have used successfully this in your original post). So, if you still have some confusion could you please explain in detail what exactly are you expecting to see? Thank you.

ADD REPLY
0
Entering edit mode

Thanks for explaining. I will looking again into the package to clear my confusion. Thanks a lot.

ADD REPLY
0
Entering edit mode

Hello Dear Diego Diez, with continuation to the above discussion, i got some processed data from GEO, which seems to have four columns. A part of that is shown below. The data has four columns but in the above code i got three columns. Can u please tell me what i missed above to get the second column. Thanks

  ID   SignalIntensity     SNR
1 109   7.5804349594  1.3421827372
2 110    11.2698560261   9.7509784076
3 111    8.6006358676  1.804853485
4 112    9.5435575539  3.2013768505
5 113    10.181235138  4.6459936834
6 114    8.1168949395  1.4226360984
7 116    8.4670495571  1.8682653573
8 117    8.2733657566  1.6663830675
9 118    5.4203308081  0.6769947295
10 119    8.5379673202  1.8326649796
11 120    9.7949416719  4.0503974531
12 121    5.9770829707   0.7442132887
13 123   5.7623248946  0.7208574513
14 124     5.3253576296        0.6793816015
15 125  8.3676877563 1.6437990197  
ADD REPLY
0
Entering edit mode

I am sorry but I am not sure I follow your question. Where in the above code do you have 4 columns? Where did you get the data pasted? I have never seen anything like that data. Remember that the processed data in GEO does not necessarily need to have been obtained using the codelink package. The codelink package does not work with processed GEO data (in terms of reading, and preprocessing of course). Please, either: 1) follow the instructions in the vignette and let me know if there is anything specific to those that you do not understand, 2) use instead GEOquery package to obtain the GEO expression matrix for the GEO entry, 3) or tell us exactly what you are attempting to do, what you expect to find and what you are finding instead. From the above code and data I cannot make a link between what I see and what your problem seems to be.

ADD REPLY
0
Entering edit mode

I think annotation package need to be used to get the second column , which seems to give probe ID (second column). I looked on this link

https://stat.ethz.ch/pipermail/bioconductor/2006-July/013515.html  which is yours indeed.

I am trying to annotate the probe IDs with the gene name from the GPL file. For that matter i need to have the second column. I use affymetrix package also which directly give me probe IDs which further i can annotate with the gene name. But here i am bit confused how to annotate probe ids(which are missing) with the gene name.

 

A part of GPL file is below. I annotate each probe ID of GSM file with the probe ID of GPL file to get the gene name or gene symbol, which is shown below.

109
 SEC14-like 2 (S. cerevisiae)
110
 microtubule-associated protein 7
111       
alpha-kinase 3
112
oxidase (cytochrome c) assembly 1-like
113
chondroitin sulfate glucuronyltransferase
114
Sec23 homolog B (S. cerevisiae)
115
 
116
poly(A) binding protein interacting protein 2B
117
chymotrypsin-like
118
KIAA1324
119
protein tyrosine phosphatase, non-receptor type 21
120
membrane-associated ring finger (C3HC4) 4
ADD REPLY
0
Entering edit mode

Again, I am getting very confused about what exactly is your problem since your comment now seems unrelated to the discussion above. Please, could you give the the GEO id for the data you want to use?

ADD REPLY
0
Entering edit mode

Ok. I want to process the raw data of GEO ID GSM108263 by your codelink package and want to get the first column as shown on GEO, named as ID_REF.

ADD REPLY
0
Entering edit mode

But, I already told you that you do not need that column. You need the probe name or probe id (synonyms), no the feature id. Anyway, it seems there is an additional problem with that dataset. You can of course load it and preprocess it in R as mentioned in the vignette:

> library(codelink)
# get files in the current directory.
> f <- list.files(pattern = "gz")
> f
> head(f)
[1] "GSM108109.TXT.gz" "GSM108221.TXT.gz" "GSM108225.TXT.gz" "GSM108228.TXT.gz"
[5] "GSM108231.TXT.gz" "GSM108235.TXT.gz"
# read data.
> eraw <- readCodelinkSet(f)
> eraw
# this is required since codelink cannot guess the chip type, see the warnings upon loading data regarding missing Product information.
> annotation(eraw) <- "h20kcod.db"
# preprocessing data.
> eraw
> codPlot(eraw,what = "density")
> ecod <- codPreprocess(eraw, bg.method = "normexp")
> codPlot(ecod,what = "density")
> codPlot(ecod)
# check feature information.
> head(fData(ecod))
           probeName probeType logicalRow logicalCol   meanSNR
1 NM_012429.1_PROBE1 DISCOVERY          1          9 0.9594485
2 NM_003980.2_PROBE1 DISCOVERY          1         10 6.5453198
3    AY044449_PROBE1 DISCOVERY          1         11 1.8873607
4 NM_005015.1_PROBE1 DISCOVERY          1         12 5.0572757
5    AB037823_PROBE1 DISCOVERY          1         13 6.6339932
6 NM_032986.1_PROBE1 DISCOVERY          1         14 1.4742470
# get probe names.
> probeNames(ecod)[1:10]
 [1] "NM_012429.1_PROBE1" "NM_003980.2_PROBE1" "AY044449_PROBE1"   
 [4] "NM_005015.1_PROBE1" "AB037823_PROBE1"    "NM_032986.1_PROBE1"
 [7] "AB032981_PROBE1"    "NM_001907.1_PROBE1" "AB037745_PROBE1"   
[10] "NM_007039.1_PROBE1"

The problem is that those probe names are legacy probe names, meaning, they do not match the probe names used in annotation packages. Probably this chips were processed with an old version of the Codelink software. I have the mapping information in the original annotation files but these are not included in the annotations (and not sure how that could be done). An easier, faster way is to use GEOquery package as it seems GEO internally has mapped the probe names from the old to new style:

eset <- getGEO("GSE4797")
eset <- eset[[1]]
head(fData(eset))
# do preprocessing, e.g. with limma matrix methods. For example:
y <- backgroundCorrect.matrix(eset, offset = 50, method = "normexp")
y <- normalizeBetweenArrays(y, method = "quantile")
exprs(eset) <- y

# here we choose 10 probes at random- but this will come from your statistical analysis.
probes.sel <- as.character(fData(eset)$PROBE_NAME[1:10])
> select(h20kcod.db, keys = probes.sel, columns = c("SYMBOL", "ENTREZID", "GENENAME"), )
   PROBEID   SYMBOL ENTREZID                                           GENENAME
1  GE53815  SEC14L2    23541                       SEC14-like 2 (S. cerevisiae)
2  GE59986     MAP7     9053                   microtubule-associated protein 7
3  GE53908    ALPK3    57538                                     alpha-kinase 3
4  GE60064    OXA1L     5018             oxidase (cytochrome c) assembly 1-like
5  GE53952    CHPF2    54480                  chondroitin polymerizing factor 2
6  GE60188   SEC23B    10483                    Sec23 homolog B (S. cerevisiae)
7  GE53801   PAIP2B   400961     poly(A) binding protein interacting protein 2B
8  GE59978     CTRL     1506                                  chymotrypsin-like
9  GE53905 KIAA1324    57535                                           KIAA1324
10 GE60052   PTPN21    11099 protein tyrosine phosphatase, non-receptor type 21
> <font face="sans-serif, Arial, Verdana, Trebuchet MS">
</font>
ADD REPLY
0
Entering edit mode

Thanks a lot for your nice explanation. The way of your doing is pretty good to annotate the probe ID from Geo-query to get gene names and other information. And its clear for me at the end that there is no need to have that column, that is really nice.

But the GEO data drive me very confusing (especially the ID-REF column) how to get that. All the data in GEO has same formate as i mentioned above. May be this is some reference they are using for annotation. But its totally confusing me. The ID-REF column is present in both the GSM(samples) and GPL(annotation file).

So its clear now that the column(ID-REF) shown above is not achievable by your codelink package???. 

Thanks for your comments.

ADD REPLY
0
Entering edit mode

It is possible the ID_REF refers to the Feature_ID mentioned above. If it is indeed Feature_ID you could get it with featureNames(ecod), as I said before. But, as I said before, it is not always exported- and it is not in the raw codelink files found in your dataset. Anyway, in terms of the R workflow it is (almost) useless.

ADD REPLY
0
Entering edit mode

It is possible the ID_REF refers to the Feature_ID mentioned above. If it is indeed Feature_ID you could get it with featureNames(ecod), as I said before. But, as I said before, it is not always exported- and it is not in the raw codelink files found in your dataset. Anyway, in terms of the R workflow it is (almost) useless.

ADD REPLY
0
Entering edit mode

 Thanks a lot.

ADD REPLY
0
Entering edit mode

I feel very embarrassed and awkward to ask you the same question again and again and i am very sorry for that.  Really a hard time for me , as i am interested to get the featureNames() in the GEO data format, but not getting any way.

I checked featureNames(ecod) as per your recommendation, but the featureNames() start with 1,2,3 not from 109 or so as given in the above data frame with four columns. May be i am foolish to ask this!!.

Again i went through your codelink package. There are two example of data

1) data(codset). In data(codset) the featureNames() start from 1001, 1002,1003 and so on. May i know why?? why not from 1,2 and so on

2) data(codelink.example).  You change it to newer version by   tmp = Codelink2CodelinkSet(codelink.example) and then the featureNames() start from 1,2 and so on. Is it the formate in new version of codelink.

will you please let me know the difference between the featureNames() starting with 1,2,3 and with 109,110 and so on.  Does these vary for new to older???

 

 

ADD REPLY
0
Entering edit mode

Then again, why are you interested in getting the featureNames() in the GEO data format if we already agreed they are not useful?

The reason you do not get them, as I explained to you before, is because that information IS NOT found in the raw files. Because it is not found, codelink uses the ordering in the files to set the row names in the expression matrix, which become what you get with featureNames(), i.e. 1, 2, 3, etc. How then in GEO they know the feature IDs? Well, my guess is that they may assume the ordering in the Codelink file preserves some original ordering and assign them automagically. Because the Codelink format is not strictly enforced and may be inconsistent I prefer not to make that assumption, hence I do not assign information that does not exists in the original file. Maybe there is a better way to handle it but so far this is how it is. I think this is well enough explained.

A separate issue is that you are throwing a lot of different questions unrelated to the original posted question. This thwarts the purpose of this support site, where ideally you would write a separate post for each question, so that specific answers can be searched for and found efficiently. Please, keep that in mind.

ADD REPLY
0
Entering edit mode

At any rate, I have added a response to the original question in your post.

ADD REPLY
0
Entering edit mode

Thanks again for explaining in a better way.

ADD REPLY
1
Entering edit mode
Diego Diez ▴ 760
@diego-diez-4520
Last seen 4.1 years ago
Japan

A targets.txt file is just a text file with each row representing a microarray file. I believe the concept was first introduced in the limma package, and it is explained in its user guide (where it is called targets frame). The columns represent information related to each microarray file, and may include the file name itself, the sample name, and other technical (e.g. date of hybridization), experimental (e.g. control, treatment) or phenotypical (e.g. wild type, KO) information. Any information related to the sample hybridized to the microarray can go into the target file. An example targets.txt file may look like this (were FileName, SampleName and Treatment variables are included):

FileName SampleName Treatment
File1.txt S1 Control
File2.txt S2 Control
File3.txt S3 Treatment1
File4.txt S4 Treatment1

This information can be assigned to a phenoData-class object in objects of the class ExpressionSet. Since the CodelinkSet class inherits ExpressionSet, the same logic applies for objects of class CodelinkSet. For example, as explained in the codelink vignette, the array information can be loaded first from a targets frame and assigned to the CodelinkSet object created:

library(codelink)
pdata <- read.AnnotatedDataFrame("targets.txt")
codset <- readCodelinkSet(filename = pdata$FileName, phenoData = pdata)
ADD COMMENT
0
Entering edit mode

Again thanks for answering to my original post.

 

ADD REPLY
0
Entering edit mode
@agaz-hussain-wani-7620
Last seen 6.6 years ago
India

                                   

ADD COMMENT
0
Entering edit mode

                    

ADD REPLY

Login before adding your answer.

Traffic: 921 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6