Obtaining pheno data from CEL files
2
0
Entering edit mode
Kurtis • 0
@805ef212
Last seen 7 months ago
United Kingdom

Hi there,

I am working with the dataset GSE126595. The authors tell us that there is information about sample treatment in the raw datafiles, but not the GSE files. I have tried packages such as affy and oligo but I cannot seem to get hold of the pheno data that I need. I simply need the GSM accession numbers and how the samples were treated. Does anyone know how I can do this?

Thanks

Kurt :)

GEO affy • 952 views
ADD COMMENT
1
Entering edit mode
Kurtis • 0
@805ef212
Last seen 7 months ago
United Kingdom

Thanks James. I have downloaded the CEL files. If I wanted to locate any associated pheno data, how might I do that with this package?

Thank you!

ADD COMMENT
0
Entering edit mode

hi, you have pressed the button "ADD ANSWER", while you should have pressed the "ADD REPLY" button right below Jame's comment. Anyway, you already have the answer to your question right on top, written by Basti, e.g.:

pdat <- pData(phenoData(GSE126595[[1]]))
table(pdat$characteristics_ch1.5)

            clinical course: early progression 
                                           375 
clinical course: intermediate/late progression 
                                           351

GEO stores phenotype data in columns with generic names such as characteristics_ch1.5. This means you need to explore the data by yourself to figure out where the relevant phenotype data is stored. One way to do this is to look at the first few rows of the data.frame object by doing:

head(pdat)

or, if you prefer, write that data.frame into a CSV file using the command write.csv(pdat, "phenodata.csv"), and open it with a spreadsheet browser.

ADD REPLY
0
Entering edit mode

Thanks Robert. I'm fine gaining access to the phenotypic data using GEO query but since the author states that "raw data files include info on CD19 selection..." I'm wanting to check the CEL files, which I as I understand, contain the raw data. I am hoping that in there, there would be an extra column with more phenotypic information, perhaps on the CD19 selection. Any insight would be valuable.

Thanks

Kurtis.

ADD REPLY
0
Entering edit mode

I already responded to this exact question, and told you that the CEL files won't contain anything useful. If you don't believe me, it's easy enough to check for yourself.

> library(affyio)
> z <-  read.celfile("GSM3608790_CLLCCAID2452S11131000002819182017.CEL.gz")
> names(z)
[1] "HEADER"    "INTENSITY"
[3] "MASKS"     "OUTLIERS" 
> lapply(z$INTENSITY, head)
$MEAN
[1] 9710  284 9168  186  159  176

$STDEV
[1] 1160.8   59.0 1631.8   24.7
[5]   11.3   20.7

$NPIXELS
[1] 9 9 9 9 9 9


> head(z$OUTLIERS)
       X Y
[1,]  38 0
[2,] 205 0
[3,] 302 0
[4,] 314 0
[5,] 325 0
[6,] 612 0
> head(z$MASKS)
     X Y
> z$HEADER
$cdfName
[1] "HuEx-1_0-st-v2"

$`CEL dimensions`
[1] 2560 2560

$GridCornerUL
[1] 515 555

$GridCornerUR
[1] 18796   465

$GridCornerLR
[1] 18891 18666

$GridCornerLL
[1]   610 18756

$DatHeader
[1] ""

$Algorithm
[1] "Feature Extraction Cell Generation"

$AlgorithmParameters
[1] "Percentile:75;CellMargin:4;OutlierHigh:1.500000;OutlierLow:1.004000;AlgVersion:;FixedCellSize:TRUE;FullFeatureWidth:7;FullFeatureHeight:7;IgnoreOutliersInShiftRows:;FeatureExtraction:TRUE;PoolWidthExtenstion:;PoolHeightExtension:;UseSubgrids:TRUE;RandomizePixels:FALSE;ErrorBasis:StdvMean;StdMult:1.000000;"
ADD REPLY
0
Entering edit mode

Thanks James. Your insight is appreciated.

ADD REPLY
0
Entering edit mode
Basti ▴ 780
@7d45153c
Last seen 27 minutes ago
France

You can obtain it from GEOquery :

library(GEOquery)
GSE126595 <- getGEO('GSE126595',GSEMatrix=TRUE)
pData(phenoData(GSE126595[[1]]))

Or you can have direct access to the matrix here https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126595/matrix/GSE126595_series_matrix.txt.gz

ADD COMMENT
0
Entering edit mode

Hi Basti

Thank you for your speedy response. I have already tried but it doesn't contain the metadata I am looking for. The authors say " Normalized data is stored with the assigned analysis ID, raw data files include info on CD19 selection...". I would like to know about the CD19 selection specifically. I therefore thought to look in the CEL files, as the info may be there?

Thank you!

Kurtis.

ADD REPLY
0
Entering edit mode

The CEL files almost surely won't contain any information on CD19 selection - those are binary files from the Affy scanner that should only contain information about the chip itself, not any sample inforrmation. There may be something in the supplementary data, but it doesn't seem likely (see the filelist.txt here). You could hypothetically download that tar.gz file and see if there is anything else, but I would use a browser directly rather than getGEOSuppFiles because it's going to take a while to get.

ADD REPLY
0
Entering edit mode

Hi James,

Thanks for your response. I've never worked with the CEL files so wasn't sure exactly what they would contain. I also tried to access the tar.gz file with affy but then I got an error message which told me to use oligo. Still working on that...

Thanks

ADD REPLY
0
Entering edit mode

Yes, you won't be able to do anything with an Exon array using affy. You definitely want oligo, although there are well over 700 arrays, and if you are going to process all of them you will likely need to use aroma.affymetrix instead, which can handle that number of files. Or just use GEOquery to get the summarized data.

ADD REPLY

Login before adding your answer.

Traffic: 813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6