Question

How to combine expression values of multiple probes for one gene

0

Entering edit mode

ayanava18 ▴ 10

@ayanava18-8418

Last seen 10.5 years ago

India

I am a bit new to R Bioconductor and microarray analysis.

I have loaded a GEO series matrix file (GSE2990) from GEO database in R Bioconductor. This dataset contain expression values of 22283 probes. I wish to get the expression values for the genes for the dataset. Since, there are multiple probes for an individual gene in many cases, I would like to know if there is a package /R code that can combine the expression values of multiple probes for the same gene. Also does oneChannel GUI has this feature?

probe convert geneexpression onechannelgui • 13k views

ADD COMMENT • link updated 10.5 years ago by svlachavas ▴ 840 • written 10.5 years ago by ayanava18 ▴ 10

score 0 · Answer 1 · 2015-07-19

Dear Ayanava,

firstly, which specific platform does your experiment use ? Secondly, handling duplicate probesets(and i believe you meant probesets after normalization)-that is probesets that map to the same gene- is a very complex procedure by the way that there are so many options, but which is the most promiscuous remains challenging. For instance, you could use the average or the median value of these duplicated probesets. On the other hand, in my opinion[and also many people will also provide other more useful or alternative suggestions] every probeset represents in a simple way a gene, since every probeset(with its associate probes) interogate an expressed sequence. Some probes may not be annotated with any of them or may also associate with multiple potential target sequences, and most important other genes may be represented by different probesets, each of them more possibly interogates a different mRNA transcript of these genes. Thus, in my opinion, i believe it is more wise to not choose average across different probesets mapping to the same gene, because it may be also possible that these probesets map to the same gene, but recognize a different transcript-so they could correspond to alternative transcripts or splice forms, which may not absolutely correlate-. Personally, with Affymetrix and Illumina oligonucleotide arrays i have worked i used the Median Absolut Deviation, which is a measure of dispersion, robust to outliers. So you can use the following example:

Probesets=paste("a",1:200,sep="") # "fake probesets"
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probes,Genes,Value)
X=X[order(X$Value,decreasing=T),]
Y=X[which(!duplicated(X$Genes)),]

So in the end Y is a data.frame which includes unique gene symbols linked to the probesets, with the highest MAD in each case of the duplicates

To sum up, it is sure that the best option is far more complex and depends also on the study and the various tools you could use