Search
Question: Duplicate gene ID's returned from limma microarray analysis
0
gravatar for mat149
8 days ago by
mat14910
mat14910 wrote:

Hello,

 

I am using the limma package to detect differentially expressed probesets between three groups of samples (knockdown, rescue, and control). When I pass my topTable arguement, probesets with the same gene symbol identifier are returned which also have (near) identical fold changes + p.values.  I would like to remove multiplicates of these probesets such that these genes are represented by only one probeset.  I am unsure on how to proceed with the analysis -  these probesets do not have different accession numbers and keeping multiples does not seem informative.  Can anyone provide me with a means to remove these "extra" probesets or provide a reference to help me solve this issue?  The code I am using is attached below as well as an example of the topTable results.  Thanks for any help you can provide.

- Matt

 

library(limma)
design = model.matrix(~ 0 + f)
colnames(design)=c("control","morphant","rescue")
contrast.matrix <- makeContrasts(morphant-control,rescue-morphant,rescue-control,levels=design)
data.fit.con <- contrasts.fit(data.fit,contrast.matrix)
data.fit.eb <- eBayes(data.fit.con,trend=TRUE)

MOtabWT <- topTable(data.fit.eb,coef=1,number=Inf,adjust="BH",p.value=0.01,lfc=0.5)

 

 

PROBEID ID SYMBOL GENENAME ENTREZID logFC AveExpr t P.Value adj.P.Val B
13217667 BC067708 aamp angio-associated, migratory cell protein 405874 0.81886 6.472058 5.203286 8.44E-05 0.003206 1.561555
13063164 NM_001044310 aars alanyl-tRNA synthetase 324940 1.755493 8.443648 8.862132 1.33E-07 5.31E-05 7.861228
13282364 BC074030 abca4a ATP-binding cassette, sub-family A (ABC1), member 4a 798993 -1.12315 5.229148 -6.05818 1.59E-05 0.001102 3.2049
13284182 BC074030 abca4a ATP-binding cassette, sub-family A (ABC1), member 4a 798993 -1.12315 5.229148 -6.05818 1.59E-05 0.001102 3.2049
13079785 XM_678031 abca4b ATP-binding cassette, sub-family A (ABC1), member 4b 555506 -1.30577 5.217277 -5.98837 1.82E-05 0.001187 3.074387
13156949 NM_001172647 abcc8 ATP-binding cassette, sub-family C (CFTR/MRP), member 8 553281 -0.88072 6.273975 -4.98922 0.00013 0.004351 1.135853
13075730 BC068351 abcf1 ATP-binding cassette, sub-family F (GCN20), member 1 406467 1.968939 7.719813 4.911665 0.000152 0.004842 0.980408
13018254 BC139542 abcg1 ATP-binding cassette, sub-family G (WHITE), member 1 556979 -0.74389 4.904658 -4.57459 0.000304 0.007725 0.298189
13161486 BC124444 abhd2b abhydrolase domain containing 2b 559290 -0.87335 5.137598 -4.58277 0.000299 0.007636 0.314847
13281306 ENSDART00000143986 ABI3BP (2 of 2) ABI family, member 3 (NESH) binding protein #N/A -1.2779 5.590848 -5.43329 5.34E-05 0.002326 2.013004
13276814 ENSDART00000133367 ablim1b actin binding LIM protein 1b 541550 0.892601 8.408559 6.322132 9.69E-06 0.000794 3.692016
13276806 ENSDART00000133367 ablim1b actin binding LIM protein 1b 541550 0.839638 8.040964 4.526412 0.000336 0.008245 0.199901

 

 

 

ADD COMMENTlink modified 7 days ago • written 8 days ago by mat14910

Can you please explain what microarray platform this is and how it has been processed? For most microarray platforms it is virtually impossible to get identical results for two different probes, as you seem to have here, even if they relate to the same gene.

Also I note that the table of DE results you show cannot be the output from the topTable() call immediately above it, because the table is sorted alphabetically by symbol instead of by significance.

ADD REPLYlink modified 7 days ago • written 7 days ago by Gordon Smyth29k
1
gravatar for Gordon Smyth
7 days ago by
Gordon Smyth29k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth29k wrote:

When I want to remove duplicate probes for the same gene symbol, I usually just keep the one with the largest overall expression value. You can do this by:

o <- order(data.fit$Amean, decreasing=TRUE)
data.fit2 <- data.fit[o,]
d <- duplicated(data.fit2$genes$SYMBOL)
data.fit2 <- data.fit2[!d,]

Now you can continue with

data.fit2.con <- contasts.fit(data.fit2, contrast.matrix)

etc.

ADD COMMENTlink written 7 days ago by Gordon Smyth29k

Thank you for your insight. It is an Affymetrix 1.1st zebrafish gene array strip.  I wrote the toptable out to an .xlsx and sorted them alphabetically, then copied/pasted the first few probesets just for illustration purposes.

 

 

edit:

I processed the dataset with RMA ("core") and derived annotations from 'affycoretools' for pd.zebgene.1.1.st. Gene symbols were mapped to Entrez ID's available in org.Dr.eg.db.  I can paste the full code if you would like

 

 

ADD REPLYlink modified 7 days ago • written 7 days ago by mat14910
1

You might also consider using getMainProbes. The two probesets with identical results (13282364 and 13284182) are made up of the same probes, which is why the results are identical. But both probesets are so-called 'rescue' probesets that are, um, used to rescue things and whatnot.

There are any number of probesets on the random primer based arrays like this (that are not 'main' probesets), and that have some mysterious purpose, and are often not annotated. I usually just get rid of them, which is why getMainProbes exists.

ADD REPLYlink modified 7 days ago • written 7 days ago by James W. MacDonald41k
0
gravatar for mat149
7 days ago by
mat14910
mat14910 wrote:

Thank you for your comment, James. I have implemented both getMainProbes and Gordon's suggested code into my analysis and it has really helped to "clean up" my dataset.

Problem solved!

ADD COMMENTlink written 7 days ago by mat14910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 104 users visited in the last hour