Question

Map between Manufacturer Identifiers and Gene Symbol

0

Entering edit mode

maedakus ▴ 10

@maedakus-9484

Last seen 8.9 years ago

Hi, all

i am very new to bioconductor, now i am analysing data for RNA data based on Illumina TruSeq DNA

Data which i am analysing is GSE36924.

when mapping illumina ID(e.g.ILMN_1651199) to Gene Symbol, many gene symbol resulted in "NA" .

it means i failed to map as below.

ILMN_3311170                 SKCG-1
ILMN_3311175                   <NA>
ILMN_3311180                   <NA>
ILMN_3311185                   <NA>
ILMN_3311190              LINC00173

the script is as below.

library(illuminaHumanv4.db)

genesymbol <-data.frame(Gene=unlist(mget(x = rownames(data),envir = illuminaHumanv4SYMBOL)))

if you know how to match perfectly, would you teach me ?

thanks in advance

illuminaHumanv4.db • 4.2k views

ADD COMMENT • link 9.3 years ago • updated 9.2 years ago maedakus ▴ 10

score 1 · Answer 1 · 2016-04-13

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 11 hours ago

United States

I wouldn't use the mget interface, as that is old tech that doesn't necessarily work as you might expect. Instead I would use mapIds.

symbols <- mapIds(illuminaHumanv4.db, rownames(data), "SYMBOL","PROBEID")

which will give you the first instance of any multiple mappings. If you want all the gene symbols you can do

symbold <- mapIds(illuminaHumanv4.db, rownames(data), "SYMBOL","PROBEID", multiVals = "list")

But do note that getting an NA when you use mget can mean one of two things. First, if there are multiple symbols, then mget will return NA, whereas mapIds will not. Second, if the manufacturer doesn't specify a mapping for a given probe to an Entrez Gene ID, then you will get an NA as well. We are just passing on what the manufacturer says, so if they don't map the probe, then we won't either.

ADD COMMENT • link 9.2 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James,

nice to meet to you , thank you so much, anyway.

i would like you to know that i am now analysing data from NCBI public data GSE36924.

this probe data is measured by TrueSeq RNA total.

a7 <- getGEO(filename="D:/NCBI GEO/......./GSE36924_series_matrix.txt.gz")
data<-exprs(a7)

genesymbol <-data.frame(Gene=unlist(mget(x = rownames(data),envir = illuminaHumanv4SYMBOL)))
symbols <- mapIds(illuminaHumanv4.db, rownames(data), "SYMBOL","PROBEID")
sum(is.na(genesymbol))
sum(is.na(symbols))

> sumis.na(genesymbol))
[1] 13163
> sumis.na(symbols))
[1] 11437

ADD REPLY • link 9.2 years ago maedakus ▴ 10

score 0 · Answer 2 · 2016-04-14

0

Entering edit mode

maedakus ▴ 10

@maedakus-9484

Last seen 8.9 years ago

symbols <- mapIds(illuminaHumanv4.db, rownames(data), "SYMBOL","PROBEID")

Thank you so much for teaching me a new function of bioconductor.

this code really works well and i can reduce the number of NA for mapping, but still i found approximately 10,000 probe NA among 40,000 probe.

it really means that those illumina ID is not mapped to Gene ? in this case, i can delete all probe data for NA ?

in such a case, what do you do usually ?

thanks

ADD COMMENT • link 9.2 years ago maedakus ▴ 10

0

Entering edit mode

When commenting, please use ADD COMMENT, rather than the Add your answer dialog box, because you aren't adding an answer.

How you deal with un-annotated data is up to you. I never remove a probe because it's un-annotated, only if it is a control. The manufacturer isn't in the business of putting random probes on the array, so they must have had some reason to add it. And if an un-annotated probe comes up as appearing to be really different, I want to know that and may then try to figure out what it is measuring.

As an example, if I download the annotation file for the array you are measuring (from GEO - it's GPL10558) and grep the first NA in the table you show (ILMN_3311175), I get this:

ILMN_3311175    Homo sapiens    RefSeq  XR_079076.1     ILMN_381617     ESP33XR_079076.1        XR_079076.1             100188917       239755206       XR_079076.1     ESP33           ILMN_3311175    6350441 I       1181    ATGTCCCTAAGCCAACACGCTGCTCATCAACTCCGCATTCAAGAAGAGAC                                      "PREDICTED: Homo sapiens hypothetical locus ESP33 (ESP33), miscRNA."            XR_079076.1

So the illuminaHumanv4.db package says it's nothing, but the GEO file says maybe it is something. So it's not as simple as just deleting the NA probes and calling it good.

ADD REPLY • link 9.2 years ago James W. MacDonald 68k

0

Entering edit mode

dear James,

i am very grateful for your kind support.

i found all information regarding NA probe ID gene successfully.

ADD REPLY • link 9.2 years ago maedakus ▴ 10

0

Entering edit mode

Hello Maedakus,

Could I know how did you found all the NA prob ID gene information ??

ADD REPLY • link 8.6 years ago gamal.elkomy • 0