Question

Doubts about manipulation and annotation of microarray files deposited at GEO

0

Entering edit mode

Michele Claire • 0

@c6e10469

Last seen 2.9 years ago

Portugal

Dear colleagues, I need help.

I have no experience with microarray files deposited in GEO and I have some doubts.

I don't want to do a differential expression analysis, I want to name the genes and come up with an expression value for each one, in each sample. I intend to plot a heatmap graph of all samples for some genes. For the graphics I already have a script.

I made a "manual annotation", using the procv function of the spreadsheet and observed that some genes are represented by more than one probe, with different expression values. How do I analyze this type of data?

Another question is how do I annotate banks like GSE77930 in which the IDs of the probes in the file with expression values are different from the IDs in the identification file of the GPL21289 genes?

Thanks in advance to anyone willing to help me. Best regards,

Michele Breton ```

MicroarrayData Annotation GEOdata • 1.1k views

ADD COMMENT • link updated 2.9 years ago by James W. MacDonald 68k • written 2.9 years ago by Michele Claire • 0

score 2 · Answer 1 · 2023-02-08

Your first question is something that you will have to answer for yourself. There are any number of reasons that an array manufacturer will add multiple probes for the same gene. You could hypothetically do a deep dive on the array and inspect each of the duplicated probes and decide for yourself which one is to be preferred (or if they are equivalent) and use that information to decide which one(s) to retain. Or you could use the probes with the highest overall intensity (they bind better, so maybe they ARE better?). Or you could just average them. Or just randomly exclude one. Each has tradeoffs, and since you are doing the work, it's up to you to decide.

For your second question, the experiment you link to used two different Agilent arrays. There are 320 total arrays, and some unknown (to me) number were run on one platform, and some on the other. It's unclear to me what might be in the series matrix file, particularly since one array has over 411K probes and the other has around 38K probes. That sounds like a situation where downloading the raw data and processing separately is the smart play. The limma package is your friend in that case.