Question: How to check if the microarray data is from codelink
1
23 months ago by
India
Agaz Hussain Wani260 wrote:

I want to know if the data is from codelink before using library(codelink) to generate expression and p-value. What I was doing is to find the .TXT extension of the file but found that it is not always .TXT and can be also .txt for raw codelink files for example (GSE9490) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9490) VS [GSE9334] (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9334).

I also tried to find keyword CodeLink Expression Analysis in the file but again found that some files do not have that keyword, again [GSE9490](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9490) VS [9334](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9334) . I would like to know what is always in the codelink file which can be used to check accurately if the data belongs to codelink platform.

modified 23 months ago by Sean Davis21k • written 23 months ago by Agaz Hussain Wani260
2
23 months ago by
Sean Davis21k
United States
Sean Davis21k wrote:

The official (supplied by codelink) GEO Platform (GPL) is:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL2895

The GPL record contains information about associated samples and series. The following will provide the series IDs associated with the codelink platform:

gpl = getGEO("GPL2895")
Meta(gpl)$series_id   [1] "GSE3578" "GSE4106" "GSE4609" "GSE4812" "GSE4846" "GSE5108" "GSE5216" "GSE5350" "GSE6213" [10] "GSE6304" "GSE6585" "GSE6630" "GSE6692" "GSE7330" "GSE8353" "GSE8604" "GSE9332" "GSE9490" [19] "GSE10064" "GSE10123" "GSE10145" "GSE12530" "GSE13857" "GSE14797" "GSE14808" "GSE15829" "GSE16523" [28] "GSE16717" "GSE16944" "GSE17470" "GSE18124" "GSE18464" "GSE19834" "GSE20167" "GSE22812" "GSE24519" [37] "GSE24591" "GSE24807" "GSE25431" "GSE26326" "GSE27448" "GSE29002" "GSE29136" "GSE29763" "GSE31075" [46] "GSE32191" "GSE32403" "GSE32902" "GSE33133" "GSE33651" "GSE35499" "GSE36007" "GSE37186" "GSE37187" [55] "GSE38542" "GSE40007" "GSE44172" "GSE44187" "GSE44736" "GSE55768" "GSE56739" "GSE60602" "GSE79189" [64] "GSE80347" "GSE94318"  There are three additional GPLs (alternative--supplied by other submitters) noted on that webpage. GEO adds that information to the GPL as simple text annotations (not ideal, but the information is there). Meta(gpl)$relation

[1] "Alternative to: GPL11010"
[2] "Alternative to: GPL8060"
[3] "Alternative to: GPL18134 ([DISCOVERY PROBE_TYPE])"


Each of these GPL records can be treated the same way to get a complete list of GSEs (or GSMs, if that is the goal).

Alternatively, each GSE record has an associated platform, stored in the annotation slot of an ExpressionSet. More concretely:

gse = getGEO('GSE3578')[[1]]
# gse is an ExpressionSet
gse


Note the Annotation below shows "GPL2895".

ExpressionSet (storageMode: lockedEnvironment)
assayData: 54359 features, 156 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM82284 GSM82285 ... GSM128604 (156 total)
varLabels: title geo_accession ... data_row_count (31 total)
featureData
featureNames: 1001 1002 ... 504109 (54359 total)
fvarLabels: ID LOGICAL_ROW ... GI_LIST (9 total)
experimentData: use 'experimentData(object)'
Annotation: GPL2895


Returning to the original question, checking to see if a GSE belongs to a specific platform is just this check:

annotation(gse) == 'GPL2895'

[1] TRUE


EDIT: This answer is perhaps not a complete answer to the original question, it seems, as the question seems to focus on parsing of text files after reading again. Indeed, matching files to formats is a challenging problem.

1
23 months ago by
Diego Diez750
Japan
Diego Diez750 wrote:

Unfortunately it is not possible to identify Codelink files using the extension. Usually they are named either TXT or txt but that is not very informative because that extension is commonly used for regular text files. A Codelink file has to contain a header formatted in a particular way. Also, it begins with the following text "CodeLink Expression Analysis 5.0.0.18008", although the software version may change. For example, the GEO dataset GSE9490 is Codelink format but GSE9334 is not.

You can use the codelink package to read and preprocess one but not the other. Alternatively you can use the GEOquery package to read them directly from GEO into an ExpressionSet object.

Note that the codelink package will help you to read and preprocess Codelink files, not to "generate a p-value". For that you need some other package for statistical analysis, like the limma package.

EDIT

To clarify my post:

To know if a particular dataset in GEO is from the Codelink platform, using the approach described by Sean is a very good way to go. Once you have some Codelink datasets you may want to import them into R using getGEO() or download the RAW data (e.g. GSE9490_RAW link at the end of the pages) and read it with the codelink package. The advantage of using the codelink package is that you may have more control over what you do with the data. The disadvantage is that this is not always possible because the data uploaded to GEO sometimes does not conform with the Codelink format (even though the extension is TXT or txt). In that case, I feel that using getGEO() is the simplest option.

Regardless, for the datasets mentioned in the OP, one is Codelink and the other not (so obviously using the codelink package with that one is not an option).