Question: Processing two-colour array data from GEO using limma when no raw data are available
gravatar for Jason
24 months ago by
Jason20 wrote:

Hello all,

I am trying to work with GEO data obtained via getGEO and am running into difficulties. The data retrieval is no problem, but the issue is working with it when it arrives. This is a bit long, but I'm trying to be clear and thorough regarding the stumbling point. What I have thus far,

# install the GEO libraries

# retrieve series record. As simpler case, take a subset of knockdown data
gset <- getGEO("GSE6838")
gsetKD <- gset[[2]]

# only interested in four of these. Trim to records of interest
target_gsm <- c("GSM155604","GSM155605","GSM155602","GSM155603")
gsetKD <- gsetKD[,which(colnames(gsetKD) %in% target_gsm)]

# check the phenodata to make sure the desired entries returned
# (two different kinds of knockdowns for two different miRNAs)

# object returned is an ExpressionSet, with value log10 of Cy5/Cy3 ratio

From the attached notes, the researchers used Cy3 for all the mock (controls) and Cy5 for treatments. The ExpressionSet object that comes back from getGEO has intensities, which are apparently log10 of the ratio of INTENSITY2/INTENSITY1. Should I stop here and work with these log ratios? There is of course no p-value associated with any of these and since I am not interested in direct comparisons among the treatments, I don't see a model I can establish to compare treatment with the mock control since the object doesn't have channel data. I pull the first GSM record:

# manually pull the first GSM record
gsm <- getGEO("GSM155602")

To see that the value in the expression set from the GSE is the log10 of the ratio of INTENSITY2/INTENSITY1. Those intensities range from ~0.04 - 26. There are no raw data files attached to the record. Regarding the headers and processing,

#ID_REF = Rosetta generated unique probe identifier
#VALUE = same as UNF_VALUE but with flagged values removed
#LOGINTENSITY = Corrected average log intensity of channels
#INTENSITY1 = Cy3 intensity (CH1)
#INTENSITY2 = Cy5 intensity (CH2)
#PVALUE = P-value of LogRatio
#QUALITY = 1 - if good and non control, 0 - otherwise
#UNF_VALUE = Corrected Log10 Ratio of channels (CH2/CH1)

Data were processed using the Rosetta Resolver® system. Rosetta's custom feature extraction software performs error modeling before data are loaded into the Resolver system. The Resolver system performs a squeeze operation that combines replicates of the same sequence in an array while applying error weighting. The error weighting consists of adjusting for additive and multiplicative noise. A P-value is generated and propagated throughout the system. The P-value represents the probability that a gene is expressed. The Resolver system allows users to set thresholds, below which genes of a P-value are considered to be significantly expressed. The Resolver system also combines multiple arrays using a squeezing process. If multiple spots reference one sequence, summarization is performed using an error-weighted average as described in Roland Stoughton and Hongyue Dai, Statistical Combining of Cell Expression Profiles. US Patent #6,351,712, February 26, 2002.

so I assume the data have been background subtracted and have had some kind of normalization. I have gone through the limma users guide, but the workflows require data classes not readily creatable. For example, the complete flow starts from raw data, with uncorrected, not normalized Cy3, Cy5 with foreground and background data, which here is not available (to create an RG-type object). I would like to get to a form as described, for example, in section 8.3 of the limma guide (revision sep2015) with something like:

targKD <- data.frame(GSM = c("GSM155602", "GSM155603", "GSM155604","GSM155605"),
                     Cy3 = c("mock", "mock", "mock","mock"),
                     Cy5 = c("KD16", "KD16OME", "KD106b", "KD106bOME"))

design <- modelMatrix(targKD,ref="mock")

but I think I would need an MA type object. Although lmFit will take an ExpressionSet, there's the problem of the lack of channel data therein. The closest responses I have found to this problem are here Using limma to analyze GEO datasets/series from two-channel experiments and neither of which can fit here as there is no raw data. The only thing I can think of would be to brute force retrieve all the GSMs, extract the columns, rename them to the appropriate labels, try to manually create the RG.b object and try to pick up the workflow at the stage after background correction and within array normalization.Is this even worth it? These data are highly cited, but no one presents details of what they've done (eg. used these values as-is or some kind of processing). Any suggestions would be greatly appreciated.

R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] limma_3.26.5         GEOquery_2.36.0      Biobase_2.30.0      
[4] BiocGenerics_0.16.1  BiocInstaller_1.20.1

loaded via a namespace (and not attached):
[1] tools_3.2.3    RCurl_1.95-4.7 bitops_1.0-6   XML_3.98-1.3  


ADD COMMENTlink modified 24 months ago by Gordon Smyth32k • written 24 months ago by Jason20
gravatar for James W. MacDonald
24 months ago by
United States
James W. MacDonald45k wrote:

The default for getGEO() is to return the GSEMatrix, which by definition is just the log ratios. However, if you look on the page for that GEO entry, at the bottom it says that the raw data are included in the sample table. This means you need the soft format file. There is probably some slick way to do this, in which case Sean Davis will be along shortly to say how it's done. For now, my hackalicious methods...

> gset <- getGEO("GSE6838", GSEMatrix = FALSE)

> gsmlst <- GSMList(gset)

> gsmtab <- lapply(gsmlst, Table)

> head(gsmtab[[1]])
1 10019475365  0.0075      -0.6486     0.2230     0.2271 0.87095       1
2 10019481149 -0.0142      -0.2627     0.5552     0.5373 0.76468       1
3 10019495284  0.0442       0.6731     4.4974     4.9613 0.43645       1
4 10019687586  0.0042      -1.1251     0.0750     0.0756 0.92756       1
5 10019713746  0.0206      -0.7388     0.1782     0.1869 0.68370       1
6 10019799479  0.0078      -0.7730     0.1675     0.1704 0.87101       1
1    0.0075
2   -0.0142
3    0.0442
4    0.0042
5    0.0206
6    0.0078

EDIT: forgot this step, to subset to the arrays you care about.

> target_gsm <- c("GSM155604","GSM155605","GSM155602","GSM155603")
> gsmtab <- gsmtab[target_gsm]

So now we have a list, where each list item has what you need. Now we can hack into a MAList for use with limma. Note that the UNF_VALUE is the log ratio, and the LOGINTENSITY is the geometric mean.

> M <-, lapply(gsmtab, "[", 8))
> A <-, lapply(gsmtab, "[", 3))
## get annotation data
> gpls <- GPLList(gset)

> gns <- Table(gpls[[2]])
> dim(gns)
[1] 39302     7
## that's the same number of genes as our M and A, so I think that's the one we want
## now ensure they are in the same order - nobody loves it when
## annotation data is mismatched...
> gns <- gns[match(gsmtab[[1]][,1], gns$ID),1:3]
> dim(gns)
[1] 39014     3
> dim(M)
[1] 39014     4
## check it again
> all.equal(gns$ID, gsmtab[[1]][,1])
[1] TRUE
## OK, proceed

> malst <- new("MAList", list(M = M, A = A, genes = gns))
> malst
An object of class "MAList"
     GSM155604 GSM155605 GSM155602 GSM155603
[1,]   -0.0143   -0.0159    0.0111   -0.0067
[2,]   -0.0124   -0.0066   -0.0162   -0.0066
[3,]   -0.0054   -0.0851    0.1339   -0.0835
[4,]   -0.0205   -0.0182    0.0291   -0.0941
[5,]   -0.0111    0.0003    0.0360   -0.0075
39009 more rows ...

1      -0.6550      -0.6289      -0.6497      -0.6247
2      -0.3337      -0.3376      -0.3092      -0.2973
3       0.6413       0.6018       0.6843       0.5978
4      -1.0961      -1.1002      -1.0294      -1.1724
5      -0.8285      -0.6902      -0.6797      -0.6599
39009 more rows ...

               ID EntrezGeneID ORF
1167  10019475365                 
10874 10019481149                 
9537  10019495284                 
17100 10019687586                 
11685 10019713746                 
39009 more rows ...

If you do plotDensities(malst), then it appears these data are normalized, background corrected data. So you should be able to proceed from this step forward as if you have a background corrected, normalized MAList.

ADD COMMENTlink modified 24 months ago • written 24 months ago by James W. MacDonald45k

Thanks a lot - I appreciate all this effort. I was stuck at the MA object, as they otherwise seem to be automatically generated, and didn't know how to "hack" one.

ADD REPLYlink written 24 months ago by Jason20
gravatar for Gordon Smyth
24 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

Getting the raw data is no doubt better, but limma will conduct a complete differential expression analysis using just the matrix of log-ratios. The advantage of an MAList is to get information about absolute expression levels. While that is very useful, you do not need it to construct the design matrix.

ADD COMMENTlink modified 24 months ago • written 24 months ago by Gordon Smyth32k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 143 users visited in the last hour