Question

Duplicate gene ID's returned from limma microarray analysis

0

Entering edit mode

mat149 ▴ 80

@mat149-11450

Last seen 24 days ago

United States

Hello,

I am using the limma package to detect differentially expressed probesets between three groups of samples (knockdown, rescue, and control). When I pass my topTable arguement, probesets with the same gene symbol identifier are returned which also have (near) identical fold changes + p.values. I would like to remove multiplicates of these probesets such that these genes are represented by only one probeset. I am unsure on how to proceed with the analysis - these probesets do not have different accession numbers and keeping multiples does not seem informative. Can anyone provide me with a means to remove these "extra" probesets or provide a reference to help me solve this issue? The code I am using is attached below as well as an example of the topTable results. Thanks for any help you can provide.

- Matt

library(limma)
design = model.matrix(~ 0 + f)
colnames(design)=c("control","morphant","rescue")
contrast.matrix <- makeContrasts(morphant-control,rescue-morphant,rescue-control,levels=design)
data.fit.con <- contrasts.fit(data.fit,contrast.matrix)
data.fit.eb <- eBayes(data.fit.con,trend=TRUE)

MOtabWT <- topTable(data.fit.eb,coef=1,number=Inf,adjust="BH",p.value=0.01,lfc=0.5)

PROBEID	ID	SYMBOL	GENENAME	ENTREZID	logFC	AveExpr	t	P.Value	adj.P.Val	B
13217667	BC067708	aamp	angio-associated, migratory cell protein	405874	0.81886	6.472058	5.203286	8.44E-05	0.003206	1.561555
13063164	NM_001044310	aars	alanyl-tRNA synthetase	324940	1.755493	8.443648	8.862132	1.33E-07	5.31E-05	7.861228
13282364	BC074030	abca4a	ATP-binding cassette, sub-family A (ABC1), member 4a	798993	-1.12315	5.229148	-6.05818	1.59E-05	0.001102	3.2049
13284182	BC074030	abca4a	ATP-binding cassette, sub-family A (ABC1), member 4a	798993	-1.12315	5.229148	-6.05818	1.59E-05	0.001102	3.2049
13079785	XM_678031	abca4b	ATP-binding cassette, sub-family A (ABC1), member 4b	555506	-1.30577	5.217277	-5.98837	1.82E-05	0.001187	3.074387
13156949	NM_001172647	abcc8	ATP-binding cassette, sub-family C (CFTR/MRP), member 8	553281	-0.88072	6.273975	-4.98922	0.00013	0.004351	1.135853
13075730	BC068351	abcf1	ATP-binding cassette, sub-family F (GCN20), member 1	406467	1.968939	7.719813	4.911665	0.000152	0.004842	0.980408
13018254	BC139542	abcg1	ATP-binding cassette, sub-family G (WHITE), member 1	556979	-0.74389	4.904658	-4.57459	0.000304	0.007725	0.298189
13161486	BC124444	abhd2b	abhydrolase domain containing 2b	559290	-0.87335	5.137598	-4.58277	0.000299	0.007636	0.314847
13281306	ENSDART00000143986	ABI3BP (2 of 2)	ABI family, member 3 (NESH) binding protein	#N/A	-1.2779	5.590848	-5.43329	5.34E-05	0.002326	2.013004
13276814	ENSDART00000133367	ablim1b	actin binding LIM protein 1b	541550	0.892601	8.408559	6.322132	9.69E-06	0.000794	3.692016
13276806	ENSDART00000133367	ablim1b	actin binding LIM protein 1b	541550	0.839638	8.040964	4.526412	0.000336	0.008245	0.199901

limma microarray • 5.4k views

ADD COMMENT • link updated 2.6 years ago by Gordon Smyth 51k • written 7.9 years ago by mat149 ▴ 80

0

Entering edit mode

Can you please explain what microarray platform this is and how it has been processed? For most microarray platforms it is virtually impossible to get identical results for two different probes, as you seem to have here, even if they relate to the same gene.

Also I note that the table of DE results you show cannot be the output from the topTable() call immediately above it, because the table is sorted alphabetically by symbol instead of by significance.

ADD REPLY • link 7.9 years ago Gordon Smyth 51k

score 2 · Answer 1 · 2017-01-10

2

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 12 minutes ago

WEHI, Melbourne, Australia

When I want to remove duplicate probes for the same gene symbol, I usually just keep the one with the largest overall expression value. You can do this by:

o <- order(data.fit$Amean, decreasing=TRUE)
data.fit2 <- data.fit[o,]
d <- duplicated(data.fit2$genes$SYMBOL)
data.fit2 <- data.fit2[!d,]

Now you can continue with

data.fit2.con <- contasts.fit(data.fit2, contrast.matrix)

etc.

ADD COMMENT • link 7.9 years ago Gordon Smyth 51k

0

Entering edit mode

Thank you for your insight. It is an Affymetrix 1.1st zebrafish gene array strip. I wrote the toptable out to an .xlsx and sorted them alphabetically, then copied/pasted the first few probesets just for illustration purposes.

edit:

I processed the dataset with RMA ("core") and derived annotations from 'affycoretools' for pd.zebgene.1.1.st. Gene symbols were mapped to Entrez ID's available in org.Dr.eg.db. I can paste the full code if you would like

ADD REPLY • link 7.9 years ago mat149 ▴ 80

1

Entering edit mode

You might also consider using getMainProbes. The two probesets with identical results (13282364 and 13284182) are made up of the same probes, which is why the results are identical. But both probesets are so-called 'rescue' probesets that are, um, used to rescue things and whatnot.

There are any number of probesets on the random primer based arrays like this (that are not 'main' probesets), and that have some mysterious purpose, and are often not annotated. I usually just get rid of them, which is why getMainProbes exists.

ADD REPLY • link 7.9 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you Gordon Smyth and James W. MacDonald for your replies

I know it's been a while since you wrote this answer, but I'm a bit new to analysing microarrays (It's my first microarray analysis). Following your steps, I'm stuck on the data.fit2@genes$SYMBOL part, because I don't have any genes$SYMBOL column. How can I add it?

I already did the gene annotation with:

eset <- annotateEset(norm_data, pd.hta.2.0) 
eset <- getMainProbes(eset)

In another guide I have read that I can use the Toptable function. What do you think?

res.a <- topTable(data.fit.eb, sort.by = "A")

In ?topTable said:

Sort by: character string specifying which statistic to rank the genes by. Possible values for topTable are "logFC", "AveExpr", "t", "P", "p", "B" or "none". (Permitted synonyms are "M" for "logFC", "A" or "Amean" for "AveExpr", "T" for "t" and "p" for "P".) Possible values for topTableF are "F" or "none". topTreat accepts the same values as topTable except for "B".

Thank you in advance.

ADD REPLY • link 2.6 years ago yaxade2481 • 0

1

Entering edit mode

You use topTable to get the top set of genes, which is a step you do after fitting the model (which comes after removing duplicate genes). So it's a thing you will do, but later in the process.

You also have the two steps switched. You should do

eset <- getMainProbes(norm_data)
eset <- annotateEset(eset, pd.hta.2.0)

And then you can do the rest of the analysis. But do note that your code for checking the MArrayLM object for the genes is wrong. You want to use data.fit2$genes$SYMBOL (two $ accessors), not data.fit2@genes$symbol (an @ accessor then a $ accessor) because the MArrayLM object is just a list, not an S4 object.

ADD REPLY • link 2.6 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you James for your response. I have already made the change you told me, but I still can't find the data.fit2$genes$SYMBOL in data.fit2

Following the instructions I found on the internet and in your post, the analysis should look like this (My analysis is a paired analysis. )

eset <- getMainProbes(norm_data)
eset <- annotateEset(eset, pd.hta.2.0)

Add patient data to Phenom Data

pData(eset)[,2]=rep(c("Before", "Treated"),63)
colnames(pData(eset))[2]="Treatment"

Add info from same patient:

pData(eset)[,3]=list_of_patients #(P1,P1,P2,P2,P3,P3...etc)
colnames(pData(eset))[3]="Patients"

Factor the variables we have created

fp= factor(pData(eset)[,3],levels = c(unique(pData(eset)[,3])))
ft=factor(pData(eset)[,2],levels = c("Before","Treated"))

Create the design

paired.design = model.matrix(~ fp + ft)
data.fit <- lmFit(exprs(eset), paired.design)

Up to here everything is fine. It is here where the created variable data.fit, is supposed to have the genes$SYMBOL but I haven't got it.

o <- order(data.fit$Amean, decreasing=TRUE)
data.fit2 <- data.fit[o,]
d <- duplicated(data.fit2$genes$SYMBOL) #Doens't Exist and D=logical (empty)

Do you know how I can add the SYMBOL information to that dataframe?

Thank you in advance.

ADD REPLY • link 2.6 years ago yaxade2481 • 0

1

Entering edit mode

You have removed all the annotation from the object by using exprs(eset). You are supposed to use simply:

data.fit <- lmFit(eset, paired.design)

ADD REPLY • link 2.6 years ago Gordon Smyth 51k

0

Entering edit mode

YES! That definitely worked!. What an idiot I am...

Thank you both for the great work you do in helping students around the world. Greetings from Spain.

ADD REPLY • link 2.6 years ago yaxade2481 • 0

0

Entering edit mode

Glad it worked. You can see what annotation is in the object by typing

head(data.fit$genes)

ADD REPLY • link 2.6 years ago Gordon Smyth 51k

score 0 · Answer 2 · 2017-01-10

0

Entering edit mode

mat149 ▴ 80

@mat149-11450

Last seen 24 days ago

United States

Thank you for your comment, James. I have implemented both getMainProbes and Gordon's suggested code into my analysis and it has really helped to "clean up" my dataset.

Problem solved!

ADD COMMENT • link 7.9 years ago mat149 ▴ 80