We wish to analyse an Exon Array dataset we obtained from a public source (unfortunately not GEO). The data we have is a matrix of RMA normalised expression values from some 400 Exon arrays summarized at the probeset level. We are only interested in the gene level and wondered if there is any way to summarize to the gene level from this starting point?
Do you think if I took the mean of probes for each gene, that the resulting values would be valid for downstream limma analysis?
I actually have no idea. It's certainly one thing you can do, and it might not be the worst idea in the world, but ideally you would do some conventional EDA (exploratory data analysis) first to see if it looks like taking means is a reasonable thing to do.
An alternative would be to make comparisons at the probeset (exon-ish) level, and look for consistent differences over the set of probesets for each gene. The downsides to that approach are that the probesets only have (usually) four probes each, and the Exon arrays are much dimmer than the old 3'-biased arrays, so you have to wonder about the signal to noise ratio with just four dim probes per probeset. You also increase the multiplicity burden quite a bit, which will not help things at all.
Ideally you would go back to whomever submitted the data, and they would be oh so happy to supply you with the original celfiles. Is that in the cards?
I'm going to ask. But the data comes from a massive consortium, and has been around for some time without being uploaded to GEO or similar, or even published. They have lots of data sets I'd like to get my hands on, but only make summarized versions of all of them availible, which is annoying because I'd really like to study genes that they have excluded from their analyses.