How to combine expression values of multiple probes for one gene
1
0
Entering edit mode
ayanava18 ▴ 10
@ayanava18-8418
Last seen 6.2 years ago
India

I am a bit new to R Bioconductor and microarray analysis.

I have loaded a GEO series matrix file (GSE2990) from GEO database in R Bioconductor.  This dataset contain expression values of 22283 probes. I wish to get the expression values for the genes for the dataset. Since, there are multiple probes for an individual gene in many cases, I would like to know if there is a package /R code that can combine the expression values of multiple probes for the same gene. Also does oneChannel GUI has this feature?

probe convert geneexpression onechannelgui • 8.4k views
0
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 6 weeks ago
Germany/Heidelberg/German Cancer Resear…

Dear Ayanava,

firstly, which specific platform does your experiment use ? Secondly, handling duplicate probesets(and i believe you meant probesets after normalization)-that is probesets that map to the same gene- is a very complex procedure by the way that there are so many options, but which is the most promiscuous remains challenging. For instance, you could use the average or the median value of these duplicated probesets. On the other hand, in my opinion[and also many people will also provide other more useful or alternative suggestions] every probeset represents in a simple way a gene, since every probeset(with its associate probes) interogate an expressed sequence. Some probes may not be annotated with any of them or may also associate with multiple potential target sequences, and most important other genes may be represented by different probesets, each of them more possibly interogates a different mRNA transcript of these genes. Thus, in my opinion, i believe it is more wise to not choose average across different probesets mapping to the same gene, because it may be also possible that these probesets map to the same gene, but recognize a different transcript-so they could correspond to alternative transcripts or splice forms, which may not absolutely correlate-. Personally, with Affymetrix and Illumina oligonucleotide arrays i have worked i used the Median Absolut Deviation, which is a measure of dispersion, robust to outliers. So you can use the following example:

Probesets=paste("a",1:200,sep="") # "fake probesets"
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probes,Genes,Value)
X=X[order(X$Value,decreasing=T),] Y=X[which(!duplicated(X$Genes)),]

So in the end Y is a data.frame which includes unique gene symbols linked to the probesets, with the highest MAD in each case of the duplicates

To sum up, it is sure that the best option is far more complex and depends also on the study and the various tools you could use

0
Entering edit mode

Also you can check this post for more "formal options" : A: eset annotation issues, plus generate heatmap with correct gene symbol as row la

0
Entering edit mode

Also i forgot to mention that the argument Value linked to MAD you can acquire from your expression set like this: