Question: Duplicate gene ID's returned from limma microarray analysis
0
2.9 years ago by
mat14940
mat14940 wrote:

Hello,

I am using the limma package to detect differentially expressed probesets between three groups of samples (knockdown, rescue, and control). When I pass my topTable arguement, probesets with the same gene symbol identifier are returned which also have (near) identical fold changes + p.values.  I would like to remove multiplicates of these probesets such that these genes are represented by only one probeset.  I am unsure on how to proceed with the analysis -  these probesets do not have different accession numbers and keeping multiples does not seem informative.  Can anyone provide me with a means to remove these "extra" probesets or provide a reference to help me solve this issue?  The code I am using is attached below as well as an example of the topTable results.  Thanks for any help you can provide.

- Matt

library(limma)
design = model.matrix(~ 0 + f)
colnames(design)=c("control","morphant","rescue")
contrast.matrix <- makeContrasts(morphant-control,rescue-morphant,rescue-control,levels=design)
data.fit.con <- contrasts.fit(data.fit,contrast.matrix)
data.fit.eb <- eBayes(data.fit.con,trend=TRUE)

 PROBEID ID SYMBOL GENENAME ENTREZID logFC AveExpr t P.Value adj.P.Val B 13217667 BC067708 aamp angio-associated, migratory cell protein 405874 0.81886 6.472058 5.203286 8.44E-05 0.003206 1.561555 13063164 NM_001044310 aars alanyl-tRNA synthetase 324940 1.755493 8.443648 8.862132 1.33E-07 5.31E-05 7.861228 13282364 BC074030 abca4a ATP-binding cassette, sub-family A (ABC1), member 4a 798993 -1.12315 5.229148 -6.05818 1.59E-05 0.001102 3.2049 13284182 BC074030 abca4a ATP-binding cassette, sub-family A (ABC1), member 4a 798993 -1.12315 5.229148 -6.05818 1.59E-05 0.001102 3.2049 13079785 XM_678031 abca4b ATP-binding cassette, sub-family A (ABC1), member 4b 555506 -1.30577 5.217277 -5.98837 1.82E-05 0.001187 3.074387 13156949 NM_001172647 abcc8 ATP-binding cassette, sub-family C (CFTR/MRP), member 8 553281 -0.88072 6.273975 -4.98922 0.00013 0.004351 1.135853 13075730 BC068351 abcf1 ATP-binding cassette, sub-family F (GCN20), member 1 406467 1.968939 7.719813 4.911665 0.000152 0.004842 0.980408 13018254 BC139542 abcg1 ATP-binding cassette, sub-family G (WHITE), member 1 556979 -0.74389 4.904658 -4.57459 0.000304 0.007725 0.298189 13161486 BC124444 abhd2b abhydrolase domain containing 2b 559290 -0.87335 5.137598 -4.58277 0.000299 0.007636 0.314847 13281306 ENSDART00000143986 ABI3BP (2 of 2) ABI family, member 3 (NESH) binding protein #N/A -1.2779 5.590848 -5.43329 5.34E-05 0.002326 2.013004 13276814 ENSDART00000133367 ablim1b actin binding LIM protein 1b 541550 0.892601 8.408559 6.322132 9.69E-06 0.000794 3.692016 13276806 ENSDART00000133367 ablim1b actin binding LIM protein 1b 541550 0.839638 8.040964 4.526412 0.000336 0.008245 0.199901

microarray limma • 1.6k views
modified 2.9 years ago • written 2.9 years ago by mat14940

Can you please explain what microarray platform this is and how it has been processed? For most microarray platforms it is virtually impossible to get identical results for two different probes, as you seem to have here, even if they relate to the same gene.

Also I note that the table of DE results you show cannot be the output from the topTable() call immediately above it, because the table is sorted alphabetically by symbol instead of by significance.

Answer: Duplicate gene ID's returned from limma microarray analysis
1
2.9 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

When I want to remove duplicate probes for the same gene symbol, I usually just keep the one with the largest overall expression value. You can do this by:

o <- order(data.fit$Amean, decreasing=TRUE) data.fit2 <- data.fit[o,] d <- duplicated(data.fit2$genes\$SYMBOL)
data.fit2 <- data.fit2[!d,]

Now you can continue with

data.fit2.con <- contasts.fit(data.fit2, contrast.matrix)

etc.

Thank you for your insight. It is an Affymetrix 1.1st zebrafish gene array strip.  I wrote the toptable out to an .xlsx and sorted them alphabetically, then copied/pasted the first few probesets just for illustration purposes.

edit:

I processed the dataset with RMA ("core") and derived annotations from 'affycoretools' for pd.zebgene.1.1.st. Gene symbols were mapped to Entrez ID's available in org.Dr.eg.db.  I can paste the full code if you would like

1

You might also consider using getMainProbes. The two probesets with identical results (13282364 and 13284182) are made up of the same probes, which is why the results are identical. But both probesets are so-called 'rescue' probesets that are, um, used to rescue things and whatnot.

There are any number of probesets on the random primer based arrays like this (that are not 'main' probesets), and that have some mysterious purpose, and are often not annotated. I usually just get rid of them, which is why getMainProbes exists.

Answer: Duplicate gene ID's returned from limma microarray analysis
0
2.9 years ago by
mat14940
mat14940 wrote:

Thank you for your comment, James. I have implemented both getMainProbes and Gordon's suggested code into my analysis and it has really helped to "clean up" my dataset.

Problem solved!