2.0 years ago by
Greece/Athens/National Hellenic Research Foundation
firstly, which specific platform does your experiment use ? Secondly, handling duplicate probesets(and i believe you meant probesets after normalization)-that is probesets that map to the same gene- is a very complex procedure by the way that there are so many options, but which is the most promiscuous remains challenging. For instance, you could use the average or the median value of these duplicated probesets. On the other hand, in my opinion[and also many people will also provide other more useful or alternative suggestions] every probeset represents in a simple way a gene, since every probeset(with its associate probes) interogate an expressed sequence. Some probes may not be annotated with any of them or may also associate with multiple potential target sequences, and most important other genes may be represented by different probesets, each of them more possibly interogates a different mRNA transcript of these genes. Thus, in my opinion, i believe it is more wise to not choose average across different probesets mapping to the same gene, because it may be also possible that these probesets map to the same gene, but recognize a different transcript-so they could correspond to alternative transcripts or splice forms, which may not absolutely correlate-. Personally, with Affymetrix and Illumina oligonucleotide arrays i have worked i used the Median Absolut Deviation, which is a measure of dispersion, robust to outliers. So you can use the following example:
Probesets=paste("a",1:200,sep="") # "fake probesets"
So in the end Y is a data.frame which includes unique gene symbols linked to the probesets, with the highest MAD in each case of the duplicates
To sum up, it is sure that the best option is far more complex and depends also on the study and the various tools you could use