Processing two-colour array data from GEO using limma when no raw data are available
2
1
Entering edit mode
Jason ▴ 20
@jason-9567
Last seen 6.1 years ago
Germany

Hello all,

I am trying to work with GEO data obtained via getGEO and am running into difficulties. The data retrieval is no problem, but the issue is working with it when it arrives. This is a bit long, but I'm trying to be clear and thorough regarding the stumbling point. What I have thus far,

source("http://bioconductor.org/biocLite.R")
# install the GEO libraries
biocLite("Biobase")
biocLite("GEOquery")
biocLite("limma")
library(Biobase)
library(GEOquery)
library(limma)

# retrieve series record. As simpler case, take a subset of knockdown data
gset <- getGEO("GSE6838")
gsetKD <- gset[[2]]

# only interested in four of these. Trim to records of interest
target_gsm <- c("GSM155604","GSM155605","GSM155602","GSM155603")
gsetKD <- gsetKD[,which(colnames(gsetKD) %in% target_gsm)]

# check the phenodata to make sure the desired entries returned
# (two different kinds of knockdowns for two different miRNAs)
pData(gsetKD)[,c("characteristics_ch1","characteristics_ch2")]

# object returned is an ExpressionSet, with value log10 of Cy5/Cy3 ratio
head(exprs(gsetKD))

From the attached notes, the researchers used Cy3 for all the mock (controls) and Cy5 for treatments. The ExpressionSet object that comes back from getGEO has intensities, which are apparently log10 of the ratio of INTENSITY2/INTENSITY1. Should I stop here and work with these log ratios? There is of course no p-value associated with any of these and since I am not interested in direct comparisons among the treatments, I don't see a model I can establish to compare treatment with the mock control since the object doesn't have channel data. I pull the first GSM record:

# manually pull the first GSM record
gsm <- getGEO("GSM155602")
head(Table(gsm))

To see that the value in the expression set from the GSE is the log10 of the ratio of INTENSITY2/INTENSITY1. Those intensities range from ~0.04 - 26. There are no raw data files attached to the record. Regarding the headers and processing,

#ID_REF = Rosetta generated unique probe identifier #VALUE = same as UNF_VALUE but with flagged values removed #LOGINTENSITY = Corrected average log intensity of channels #INTENSITY1 = Cy3 intensity (CH1) #INTENSITY2 = Cy5 intensity (CH2) #PVALUE = P-value of LogRatio #QUALITY = 1 - if good and non control, 0 - otherwise #UNF_VALUE = Corrected Log10 Ratio of channels (CH2/CH1)

Data were processed using the Rosetta Resolver® system. Rosetta's custom feature extraction software performs error modeling before data are loaded into the Resolver system. The Resolver system performs a squeeze operation that combines replicates of the same sequence in an array while applying error weighting. The error weighting consists of adjusting for additive and multiplicative noise. A P-value is generated and propagated throughout the system. The P-value represents the probability that a gene is expressed. The Resolver system allows users to set thresholds, below which genes of a P-value are considered to be significantly expressed. The Resolver system also combines multiple arrays using a squeezing process. If multiple spots reference one sequence, summarization is performed using an error-weighted average as described in Roland Stoughton and Hongyue Dai, Statistical Combining of Cell Expression Profiles. US Patent #6,351,712, February 26, 2002.

so I assume the data have been background subtracted and have had some kind of normalization. I have gone through the limma users guide, but the workflows require data classes not readily creatable. For example, the complete flow starts from raw data, with uncorrected, not normalized Cy3, Cy5 with foreground and background data, which here is not available (to create an RG-type object). I would like to get to a form as described, for example, in section 8.3 of the limma guide (revision sep2015) with something like:

targKD <- data.frame(GSM = c("GSM155602", "GSM155603", "GSM155604","GSM155605"),
Cy3 = c("mock", "mock", "mock","mock"),
Cy5 = c("KD16", "KD16OME", "KD106b", "KD106bOME"))

print(targKD)
design <- modelMatrix(targKD,ref="mock")
print(design)

but I think I would need an MA type object. Although lmFit will take an ExpressionSet, there's the problem of the lack of channel data therein. The closest responses I have found to this problem are here Using limma to analyze GEO datasets/series from two-channel experiments and https://www.biostars.org/p/9372/ neither of which can fit here as there is no raw data. The only thing I can think of would be to brute force retrieve all the GSMs, extract the columns, rename them to the appropriate labels, try to manually create the RG.b object and try to pick up the workflow at the stage after background correction and within array normalization.Is this even worth it? These data are highly cited, but no one presents details of what they've done (eg. used these values as-is or some kind of processing). Any suggestions would be greatly appreciated.

R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_3.26.5         GEOquery_2.36.0      Biobase_2.30.0
[4] BiocGenerics_0.16.1  BiocInstaller_1.20.1

loaded via a namespace (and not attached):
[1] tools_3.2.3    RCurl_1.95-4.7 bitops_1.0-6   XML_3.98-1.3


limma geoquery • 2.9k views
4
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

The default for getGEO() is to return the GSEMatrix, which by definition is just the log ratios. However, if you look on the page for that GEO entry, at the bottom it says that the raw data are included in the sample table. This means you need the soft format file. There is probably some slick way to do this, in which case Sean Davis will be along shortly to say how it's done. For now, my hackalicious methods...

> gset <- getGEO("GSE6838", GSEMatrix = FALSE)

> gsmlst <- GSMList(gset)

> gsmtab <- lapply(gsmlst, Table)

ID_REF   VALUE LOGINTENSITY INTENSITY1 INTENSITY2  PVALUE QUALITY
1 10019475365  0.0075      -0.6486     0.2230     0.2271 0.87095       1
2 10019481149 -0.0142      -0.2627     0.5552     0.5373 0.76468       1
3 10019495284  0.0442       0.6731     4.4974     4.9613 0.43645       1
4 10019687586  0.0042      -1.1251     0.0750     0.0756 0.92756       1
5 10019713746  0.0206      -0.7388     0.1782     0.1869 0.68370       1
6 10019799479  0.0078      -0.7730     0.1675     0.1704 0.87101       1
UNF_VALUE
1    0.0075
2   -0.0142
3    0.0442
4    0.0042
5    0.0206
6    0.0078

EDIT: forgot this step, to subset to the arrays you care about.

> target_gsm <- c("GSM155604","GSM155605","GSM155602","GSM155603")
> gsmtab <- gsmtab[target_gsm]

So now we have a list, where each list item has what you need. Now we can hack into a MAList for use with limma. Note that the UNF_VALUE is the log ratio, and the LOGINTENSITY is the geometric mean.

> M <- do.call(cbind, lapply(gsmtab, "[", 8))
> A <- do.call(cbind, lapply(gsmtab, "[", 3))
## get annotation data
> gpls <- GPLList(gset)

> gns <- Table(gpls[[2]])
> dim(gns)
[1] 39302     7
## that's the same number of genes as our M and A, so I think that's the one we want
## now ensure they are in the same order - nobody loves it when
## annotation data is mismatched...
> gns <- gns[match(gsmtab[[1]][,1], gns$ID),1:3] > dim(gns) [1] 39014 3 > dim(M) [1] 39014 4 ## check it again > all.equal(gns$ID, gsmtab[[1]][,1])
[1] TRUE
## OK, proceed

> malst <- new("MAList", list(M = M, A = A, genes = gns))
> malst
An object of class "MAList"
$M GSM155604 GSM155605 GSM155602 GSM155603 [1,] -0.0143 -0.0159 0.0111 -0.0067 [2,] -0.0124 -0.0066 -0.0162 -0.0066 [3,] -0.0054 -0.0851 0.1339 -0.0835 [4,] -0.0205 -0.0182 0.0291 -0.0941 [5,] -0.0111 0.0003 0.0360 -0.0075 39009 more rows ...$A
1      -0.6550      -0.6289      -0.6497      -0.6247
2      -0.3337      -0.3376      -0.3092      -0.2973
3       0.6413       0.6018       0.6843       0.5978
4      -1.0961      -1.1002      -1.0294      -1.1724
5      -0.8285      -0.6902      -0.6797      -0.6599
39009 more rows ...

\$genes
ID EntrezGeneID ORF
1167  10019475365
10874 10019481149
9537  10019495284
17100 10019687586
11685 10019713746
39009 more rows ...

If you do plotDensities(malst), then it appears these data are normalized, background corrected data. So you should be able to proceed from this step forward as if you have a background corrected, normalized MAList.

0
Entering edit mode

Thanks a lot - I appreciate all this effort. I was stuck at the MA object, as they otherwise seem to be automatically generated, and didn't know how to "hack" one.

1
Entering edit mode
@gordon-smyth
Last seen just now
WEHI, Melbourne, Australia

Getting the raw data is no doubt better, but limma will conduct a complete differential expression analysis using just the matrix of log-ratios. The advantage of an MAList is to get information about absolute expression levels. While that is very useful, you do not need it to construct the design matrix.

0
Entering edit mode

Hi, I am very interested in your saying that limma can do DEG using matrix of log-ratios. I do have log10(ratio) data and don't know how limma should be performed. Any idea? If you have reference, it is even better.

0
Entering edit mode

It would be best to spend some time reading the documentation for limma. If you have specific questions after going through the limma vignette, consider asking a new question rather than commenting on old questions.