Duplicate gene ID's returned from limma microarray analysis
2
0
Entering edit mode
mat149 ▴ 70
@mat149-11450
Last seen 21 days ago
United States

Hello,

 

I am using the limma package to detect differentially expressed probesets between three groups of samples (knockdown, rescue, and control). When I pass my topTable arguement, probesets with the same gene symbol identifier are returned which also have (near) identical fold changes + p.values.  I would like to remove multiplicates of these probesets such that these genes are represented by only one probeset.  I am unsure on how to proceed with the analysis -  these probesets do not have different accession numbers and keeping multiples does not seem informative.  Can anyone provide me with a means to remove these "extra" probesets or provide a reference to help me solve this issue?  The code I am using is attached below as well as an example of the topTable results.  Thanks for any help you can provide.

- Matt

 

library(limma)
design = model.matrix(~ 0 + f)
colnames(design)=c("control","morphant","rescue")
contrast.matrix <- makeContrasts(morphant-control,rescue-morphant,rescue-control,levels=design)
data.fit.con <- contrasts.fit(data.fit,contrast.matrix)
data.fit.eb <- eBayes(data.fit.con,trend=TRUE)

MOtabWT <- topTable(data.fit.eb,coef=1,number=Inf,adjust="BH",p.value=0.01,lfc=0.5)

 

 

PROBEID ID SYMBOL GENENAME ENTREZID logFC AveExpr t P.Value adj.P.Val B
13217667 BC067708 aamp angio-associated, migratory cell protein 405874 0.81886 6.472058 5.203286 8.44E-05 0.003206 1.561555
13063164 NM_001044310 aars alanyl-tRNA synthetase 324940 1.755493 8.443648 8.862132 1.33E-07 5.31E-05 7.861228
13282364 BC074030 abca4a ATP-binding cassette, sub-family A (ABC1), member 4a 798993 -1.12315 5.229148 -6.05818 1.59E-05 0.001102 3.2049
13284182 BC074030 abca4a ATP-binding cassette, sub-family A (ABC1), member 4a 798993 -1.12315 5.229148 -6.05818 1.59E-05 0.001102 3.2049
13079785 XM_678031 abca4b ATP-binding cassette, sub-family A (ABC1), member 4b 555506 -1.30577 5.217277 -5.98837 1.82E-05 0.001187 3.074387
13156949 NM_001172647 abcc8 ATP-binding cassette, sub-family C (CFTR/MRP), member 8 553281 -0.88072 6.273975 -4.98922 0.00013 0.004351 1.135853
13075730 BC068351 abcf1 ATP-binding cassette, sub-family F (GCN20), member 1 406467 1.968939 7.719813 4.911665 0.000152 0.004842 0.980408
13018254 BC139542 abcg1 ATP-binding cassette, sub-family G (WHITE), member 1 556979 -0.74389 4.904658 -4.57459 0.000304 0.007725 0.298189
13161486 BC124444 abhd2b abhydrolase domain containing 2b 559290 -0.87335 5.137598 -4.58277 0.000299 0.007636 0.314847
13281306 ENSDART00000143986 ABI3BP (2 of 2) ABI family, member 3 (NESH) binding protein #N/A -1.2779 5.590848 -5.43329 5.34E-05 0.002326 2.013004
13276814 ENSDART00000133367 ablim1b actin binding LIM protein 1b 541550 0.892601 8.408559 6.322132 9.69E-06 0.000794 3.692016
13276806 ENSDART00000133367 ablim1b actin binding LIM protein 1b 541550 0.839638 8.040964 4.526412 0.000336 0.008245 0.199901

 

 

 

limma microarray • 4.9k views
ADD COMMENT
0
Entering edit mode

Can you please explain what microarray platform this is and how it has been processed? For most microarray platforms it is virtually impossible to get identical results for two different probes, as you seem to have here, even if they relate to the same gene.

Also I note that the table of DE results you show cannot be the output from the topTable() call immediately above it, because the table is sorted alphabetically by symbol instead of by significance.

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

When I want to remove duplicate probes for the same gene symbol, I usually just keep the one with the largest overall expression value. You can do this by:

o <- order(data.fit$Amean, decreasing=TRUE)
data.fit2 <- data.fit[o,]
d <- duplicated(data.fit2$genes$SYMBOL)
data.fit2 <- data.fit2[!d,]

Now you can continue with

data.fit2.con <- contasts.fit(data.fit2, contrast.matrix)

etc.

ADD COMMENT
0
Entering edit mode

Thank you for your insight. It is an Affymetrix 1.1st zebrafish gene array strip.  I wrote the toptable out to an .xlsx and sorted them alphabetically, then copied/pasted the first few probesets just for illustration purposes.

 

 

edit:

I processed the dataset with RMA ("core") and derived annotations from 'affycoretools' for pd.zebgene.1.1.st. Gene symbols were mapped to Entrez ID's available in org.Dr.eg.db.  I can paste the full code if you would like

 

 

ADD REPLY
1
Entering edit mode

You might also consider using getMainProbes. The two probesets with identical results (13282364 and 13284182) are made up of the same probes, which is why the results are identical. But both probesets are so-called 'rescue' probesets that are, um, used to rescue things and whatnot.

There are any number of probesets on the random primer based arrays like this (that are not 'main' probesets), and that have some mysterious purpose, and are often not annotated. I usually just get rid of them, which is why getMainProbes exists.

ADD REPLY
0
Entering edit mode

Thank you Gordon Smyth and James W. MacDonald for your replies

I know it's been a while since you wrote this answer, but I'm a bit new to analysing microarrays (It's my first microarray analysis). Following your steps, I'm stuck on the data.fit2@genes$SYMBOL part, because I don't have any genes$SYMBOL column. How can I add it?

I already did the gene annotation with:

eset <- annotateEset(norm_data, pd.hta.2.0) 
eset <- getMainProbes(eset)

In another guide I have read that I can use the Toptable function. What do you think?

res.a <- topTable(data.fit.eb, sort.by = "A")

In ?topTable said:

Sort by: character string specifying which statistic to rank the genes by. Possible values for topTable are "logFC", "AveExpr", "t", "P", "p", "B" or "none". (Permitted synonyms are "M" for "logFC", "A" or "Amean" for "AveExpr", "T" for "t" and "p" for "P".) Possible values for topTableF are "F" or "none". topTreat accepts the same values as topTable except for "B".

Thank you in advance.

ADD REPLY
1
Entering edit mode

You use topTable to get the top set of genes, which is a step you do after fitting the model (which comes after removing duplicate genes). So it's a thing you will do, but later in the process.

You also have the two steps switched. You should do

eset <- getMainProbes(norm_data)
eset <- annotateEset(eset, pd.hta.2.0)

And then you can do the rest of the analysis. But do note that your code for checking the MArrayLM object for the genes is wrong. You want to use data.fit2$genes$SYMBOL (two $ accessors), not data.fit2@genes$symbol (an @ accessor then a $ accessor) because the MArrayLM object is just a list, not an S4 object.

ADD REPLY
0
Entering edit mode

Thank you James for your response. I have already made the change you told me, but I still can't find the data.fit2$genes$SYMBOL in data.fit2

Following the instructions I found on the internet and in your post, the analysis should look like this (My analysis is a paired analysis. )

eset <- getMainProbes(norm_data)
eset <- annotateEset(eset, pd.hta.2.0)

Add patient data to Phenom Data

pData(eset)[,2]=rep(c("Before", "Treated"),63)
colnames(pData(eset))[2]="Treatment"

Add info from same patient:

pData(eset)[,3]=list_of_patients #(P1,P1,P2,P2,P3,P3...etc)
colnames(pData(eset))[3]="Patients"

Factor the variables we have created

fp= factor(pData(eset)[,3],levels = c(unique(pData(eset)[,3])))
ft=factor(pData(eset)[,2],levels = c("Before","Treated"))

Create the design

paired.design = model.matrix(~ fp + ft)
data.fit <- lmFit(exprs(eset), paired.design)

Up to here everything is fine. It is here where the created variable data.fit, is supposed to have the genes$SYMBOL but I haven't got it.

o <- order(data.fit$Amean, decreasing=TRUE)
data.fit2 <- data.fit[o,]
d <- duplicated(data.fit2$genes$SYMBOL) #Doens't Exist and D=logical (empty)

Do you know how I can add the SYMBOL information to that dataframe?

Thank you in advance.

ADD REPLY
1
Entering edit mode

You have removed all the annotation from the object by using exprs(eset). You are supposed to use simply:

data.fit <- lmFit(eset, paired.design)
ADD REPLY
0
Entering edit mode

YES! That definitely worked!. What an idiot I am...

Thank you both for the great work you do in helping students around the world. Greetings from Spain.

ADD REPLY
0
Entering edit mode

Glad it worked. You can see what annotation is in the object by typing

head(data.fit$genes)
ADD REPLY
0
Entering edit mode
mat149 ▴ 70
@mat149-11450
Last seen 21 days ago
United States

Thank you for your comment, James. I have implemented both getMainProbes and Gordon's suggested code into my analysis and it has really helped to "clean up" my dataset.

Problem solved!

ADD COMMENT

Login before adding your answer.

Traffic: 937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6