Hello all,
I am trying to work with GEO data obtained via getGEO and am running into difficulties. The data retrieval is no problem, but the issue is working with it when it arrives. This is a bit long, but I'm trying to be clear and thorough regarding the stumbling point. What I have thus far,
source("http://bioconductor.org/biocLite.R") # install the GEO libraries biocLite("Biobase") biocLite("GEOquery") biocLite("limma") library(Biobase) library(GEOquery) library(limma)
# retrieve series record. As simpler case, take a subset of knockdown data gset <- getGEO("GSE6838") gsetKD <- gset[[2]]
# only interested in four of these. Trim to records of interest target_gsm <- c("GSM155604","GSM155605","GSM155602","GSM155603") gsetKD <- gsetKD[,which(colnames(gsetKD) %in% target_gsm)]
# check the phenodata to make sure the desired entries returned # (two different kinds of knockdowns for two different miRNAs) pData(gsetKD)[,c("characteristics_ch1","characteristics_ch2")]
# object returned is an ExpressionSet, with value log10 of Cy5/Cy3 ratio head(exprs(gsetKD))
From the attached notes, the researchers used Cy3 for all the mock (controls) and Cy5 for treatments. The ExpressionSet object that comes back from getGEO has intensities, which are apparently log10 of the ratio of INTENSITY2/INTENSITY1. Should I stop here and work with these log ratios? There is of course no p-value associated with any of these and since I am not interested in direct comparisons among the treatments, I don't see a model I can establish to compare treatment with the mock control since the object doesn't have channel data. I pull the first GSM record:
# manually pull the first GSM record
gsm <- getGEO("GSM155602")
head(Table(gsm))
To see that the value in the expression set from the GSE is the log10 of the ratio of INTENSITY2/INTENSITY1. Those intensities range from ~0.04 - 26. There are no raw data files attached to the record. Regarding the headers and processing,
#ID_REF = Rosetta generated unique probe identifier
#VALUE = same as UNF_VALUE but with flagged values removed
#LOGINTENSITY = Corrected average log intensity of channels
#INTENSITY1 = Cy3 intensity (CH1)
#INTENSITY2 = Cy5 intensity (CH2)
#PVALUE = P-value of LogRatio
#QUALITY = 1 - if good and non control, 0 - otherwise
#UNF_VALUE = Corrected Log10 Ratio of channels (CH2/CH1)
Data were processed using the Rosetta Resolver® system. Rosetta's custom feature extraction software performs error modeling before data are loaded into the Resolver system. The Resolver system performs a squeeze operation that combines replicates of the same sequence in an array while applying error weighting. The error weighting consists of adjusting for additive and multiplicative noise. A P-value is generated and propagated throughout the system. The P-value represents the probability that a gene is expressed. The Resolver system allows users to set thresholds, below which genes of a P-value are considered to be significantly expressed. The Resolver system also combines multiple arrays using a squeezing process. If multiple spots reference one sequence, summarization is performed using an error-weighted average as described in Roland Stoughton and Hongyue Dai, Statistical Combining of Cell Expression Profiles. US Patent #6,351,712, February 26, 2002.
so I assume the data have been background subtracted and have had some kind of normalization. I have gone through the limma users guide, but the workflows require data classes not readily creatable. For example, the complete flow starts from raw data, with uncorrected, not normalized Cy3, Cy5 with foreground and background data, which here is not available (to create an RG-type object). I would like to get to a form as described, for example, in section 8.3 of the limma guide (revision sep2015) with something like:
targKD <- data.frame(GSM = c("GSM155602", "GSM155603", "GSM155604","GSM155605"), Cy3 = c("mock", "mock", "mock","mock"), Cy5 = c("KD16", "KD16OME", "KD106b", "KD106bOME")) print(targKD) design <- modelMatrix(targKD,ref="mock") print(design)
but I think I would need an MA type object. Although lmFit will take an ExpressionSet, there's the problem of the lack of channel data therein. The closest responses I have found to this problem are here Using limma to analyze GEO datasets/series from two-channel experiments and https://www.biostars.org/p/9372/ neither of which can fit here as there is no raw data. The only thing I can think of would be to brute force retrieve all the GSMs, extract the columns, rename them to the appropriate labels, try to manually create the RG.b object and try to pick up the workflow at the stage after background correction and within array normalization.Is this even worth it? These data are highly cited, but no one presents details of what they've done (eg. used these values as-is or some kind of processing). Any suggestions would be greatly appreciated.
R version 3.2.3 (2015-12-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C [5] LC_TIME=English_Canada.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.26.5 GEOquery_2.36.0 Biobase_2.30.0 [4] BiocGenerics_0.16.1 BiocInstaller_1.20.1 loaded via a namespace (and not attached): [1] tools_3.2.3 RCurl_1.95-4.7 bitops_1.0-6 XML_3.98-1.3 |
|
|
Thanks a lot - I appreciate all this effort. I was stuck at the MA object, as they otherwise seem to be automatically generated, and didn't know how to "hack" one.