Question

TargetsFile. Where it is?

0

Entering edit mode

Gero • 0

@gero

Last seen 2.9 years ago

Spain

Hi, I am trying to learn how to process "Agilent" microarray raw data. I am following Limma's user guide and I am trying to find a file that responds to the TargetFile's structure indicated in this dataset (GEO: GSE52919) without result. Am I loosing something? Could be no TargetsFile available?

gset <- getGEO("GSE52919", GSEMatrix =TRUE, getGPL=TRUE, destdir = working_dir)

Limma ArrayExpress GEO • 1.3k views

ADD COMMENT • link updated 3.0 years ago by James W. MacDonald 65k • written 3.0 years ago by Gero • 0

1

Entering edit mode

Hi, I think for this dataset you will have to create the targets file yourself. The information you need is contained in the series matrix file : the sample specific information like age for this dataset is stored in the rows starting with "!Sample_"

ADD REPLY • link 3.0 years ago Basti ▴ 780

score 3 · Accepted Answer · 2021-04-14

The 'targets file' is just a file that contains relevant phenotypic data about your subjects that you might use to fit a linear model. If you are getting the data from GEO, you have to rely on what the submitter(s) gave you, rather than having your own file.

As a pedantic aside, you don't actually have to specify any arguments to a function if you are planning to use the defaults. For example

getGEO("GSE52919", GSEMatrix =TRUE, getGPL=TRUE)

## is identical to

getGEO("GSE52919")

## because you are specifying existing default values

I mean there's nothing wrong with that, except it sort of implies that you are doing something different than the usual when in fact you aren't. Anyway...

> library(GEOquery)
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
Warning message:
package 'GEOquery' was built under R version 4.0.3 
> z <- getGEO("GSE52919")[[1]]
Found 1 file(s)
GSE52919_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE52nnn/GSE52919/matrix/GSE52919_series_matrix.txt.gz'
Content type 'application/x-gzip' length 3758846 bytes (3.6 MB)
downloaded 3.6 MB


-- Column specification --------------------------------------------------------
cols(
  ID_REF = col_character(),
  GSM1278195 = col_double(),
  GSM1278196 = col_double(),
  GSM1278197 = col_double(),
  GSM1278198 = col_double(),
  GSM1278199 = col_double(),
  GSM1278200 = col_double(),
  GSM1278201 = col_double(),
  GSM1278202 = col_double(),
  GSM1278203 = col_double(),
  GSM1278204 = col_double(),
  GSM1278205 = col_double(),
  GSM1278206 = col_double(),
  GSM1278207 = col_double(),
  GSM1278208 = col_double(),
  GSM1278209 = col_double()
)

File stored at: 
C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpIr34Gz/GPL13252.soft

> z
ExpressionSet (storageMode: lockedEnvironment)
assayData: 50238 features, 15 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1278195 GSM1278196 ... GSM1278209 (15 total)
  varLabels: title geo_accession ... gender:ch1 (35 total)
  varMetadata: labelDescription
featureData
  featureNames: GT_44k_23_P100001 GT_44k_23_P100011 ...
    GT_u92_snmRNA_Homo_00007431 (50238 total)
  fvarLabels: ID GeneName ... SPOT_ID (6 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 26083014 
Annotation: GPL13252 

## the important part here is to note that the phenoData slot contains the phenotypic data, and can
## be accessed using the pData function
## you don't just want to print all that out however, so let's proceed cautiously

> names(pData(z))
 [1] "title"                   "geo_accession"          
 [3] "status"                  "submission_date"        
 [5] "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"        
 [9] "organism_ch1"            "characteristics_ch1"    
[11] "characteristics_ch1.1"   "molecule_ch1"           
[13] "extract_protocol_ch1"    "label_ch1"              
[15] "label_protocol_ch1"      "taxid_ch1"              
[17] "hyb_protocol"            "scan_protocol"          
[19] "description"             "data_processing"        
[21] "platform_id"             "contact_name"           
[23] "contact_email"           "contact_phone"          
[25] "contact_department"      "contact_institute"      
[27] "contact_address"         "contact_city"           
[29] "contact_state"           "contact_zip/postal_code"
[31] "contact_country"         "supplementary_file"     
[33] "data_row_count"          "age:ch1"                
[35] "gender:ch1"             

## most of that information is boring and unimportant for our uses, so let's look at a subset.

> pData(z)[,c(1,34,35)]
                                      title age:ch1 gender:ch1
GSM1278195    drug resistant_Group 2 [A-06]     43y       male
GSM1278196    drug resistant_Group 2 [A-07]     61y     female
GSM1278197    drug resistant_Group 2 [A-08]     32y     female
GSM1278198    drug resistant_Group 2 [A-09]     43y     female
GSM1278199    drug resistant_Group 2 [A-10]     20y       male
GSM1278200 sensitive to AraC_Group 1 [A-01]     43y     female
GSM1278201 sensitive to AraC_Group 1 [A-02]     50y       male
GSM1278202 sensitive to AraC_Group 1 [A-03]     52y       male
GSM1278203 sensitive to AraC_Group 1 [A-04]     33y     female
GSM1278204 sensitive to AraC_Group 1 [A-05]     44y     female
GSM1278205  sensitive to Dnr_Group 3 [A-11]     50y       male
GSM1278206  sensitive to Dnr_Group 3 [A-12]     18y     female
GSM1278207  sensitive to Dnr_Group 3 [A-13]     44y     female
GSM1278208  sensitive to Dnr_Group 3 [A-14]     44y       male
GSM1278209  sensitive to Dnr_Group 3 [A-15]     21y       male

Which looks like the extent of the data supplied. Obviously if you want to use that information you would have to clean up the first and second columns, as the first contains unique entries but should be repeated entries for each group, and the ages should be numeric instead of character.