Question

Getting annotated and normalized data.

0

Entering edit mode

jslow • 0

@jslow-7582

Last seen 3.2 years ago

United States

Hi,

I am trying to analyze microarray data with R and am stuck at annotation step. I was wondering if anyone could help.

Here's the code i have so far

library(affycoretools)

library(oligo)

OligoRaw<-read.celfiles(filenames=list.celfiles()) OligoEset<-rma(OligoRaw) # 35556 features, 24 samples data.oligo<-exprs(OligoEset)

library(mogene10sttranscriptcluster.db) library(pd.mogene.1.0.st.v1)

ID <- getMainProbes(OligoEset) annot <- select(mogene10sttranscriptcluster.db, featureNames(ID), c("SYMBOL","GENENAME","ENTREZID")) # 36631

I am stuck at trying to merge the "annot" with "OligoEset". I would like to have annotated and normalized data set in a dataframe/.txt/.xls files to analyze.

I'd very much appreciate any help.

Thanks,

Jun

microarray rstudio annotation • 1.1k views

ADD COMMENT • link updated 9.0 years ago by James W. MacDonald 65k • written 9.0 years ago by jslow • 0

score 0 · Answer 1 · 2015-04-13

You actually don't want the annotated and normalized data any of those forms. If you are going to use Bioconductor to analyze, then you need to learn to use the tools that are supplied.

The ExpressionSet containing your data is a perfect input to say, the limma package. So you now need to define what comparisons you want to make, and express that as a design matrix. See the limma user's guide.

What you would tend to do is something like

design <- model.matrix(~<args go here>)

fit <- lmFit(data.oligo, design)

fit2 <- eBayes(fit)

You will have duplicates in your annot data.frame, so you have to deal with that. The most naive thing you could do is choose the first one:

annot <- annot[!duplicated(annot[,1]),]

fit2$genes <- annot

Now your topTable() output will have annotations, as well as statistics.

topTable(fit2, coef = 2)

score 0 · Answer 2 · 2015-04-14

Sure. If you are already using affycoretools, see ?writeFit.

I am not in general enthused with giving normalized data to 'laymen' so they can make their own analyses. In other words, generating summarized data from raw celfiles is not usually the part of the analysis that requires the most sophistication (although the QC part does take some base knowledge). Instead, fitting models to the data and ensuring that statistically unsophisticated collaborators understand what was done and why is the main deliverable for my line of work.

Because of that, I much prefer giving people either HTML or Excel spreadsheets that already contain the comparisons they wanted. The ReportingTools package makes it very easy to generate HTML tables that are easy to work with. The openxlsx package makes it easy to output Excel spreadsheets directly, which allows you to circumvent Excel's tendency to convert gene symbols that look like dates into actual dates, when people import data incorrectly (as an example, SEPT1 is helpfully converted to 9/1/2015, because obviously).