Question

annotation conversion from old microarray to latest Entrez and Ensembl ID

0

Entering edit mode

Didi ▴ 10

@didi-10905

Last seen 2.2 years ago

Spain

Hi,

I'm trying to convert the annotation of an old microarray results to the latest Entrez and Ensembl IDs.

I'm using biomaRt for that.

library(biomaRt)
library(tidyverse)
d=as.character(data[ ,1])
mart.current <- useMart(biomart="ensembl",dataset="mmusculus_gene_ensembl")
att=listAttributes(mart.current)
current.results <- getBM(attributes=c("affy_mogene_1_0_st_v1", "entrezgene_id","ensembl_gene_id"),filter="affy_mogene_1_0_st_v1",values=d, mart=mart.current)
current.results

I'm getting many NAs in both entrezgeneid and ensemblgene_id.

Is it the correct way of doing it? Is there any other way? Can we use annotationHub package? Why am I getting many NAs?

Thanks.

D.

biomaRt annotation Ensembl annotationHub • 2.5k views

ADD COMMENT • link updated 3.6 years ago by Kevin Blighe ★ 3.9k • written 3.6 years ago by Didi ▴ 10

score 0 · Answer 1 · 2020-09-01

0

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 12 days ago

Republic of Ireland

To help, can you share some of the values contained in d?

ADD COMMENT • link 3.6 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

This is the header of d : 10338002 10338008 10338011 10338016 10338018 10338019 10338024 10338027 10338028 10338033 10338034 10338040 10338046 10338051 10338054 10338055 10338061 10338162 10338214 10338244 10338336 10338396 10338446 10338654 10338760 10338890 10338969 10338979 10338988 10338994 10339070

ADD REPLY • link 3.6 years ago Didi ▴ 10

0

Entering edit mode

I see. I am not sure at which level are these IDs (probe or transcript cluster IDs).

You could try either of the following commands to see if you get a better mapping

ids <- c('10338002', '10338008', '10338011', '10338016', '10338018', '10338019', '10338024', '10338027', '10338028', '10338033', '10338034', '10338040', '10338046', '10338051', '10338054', '10338055', '10338061', '10338162', '10338214', '10338244', '10338336', '10338396', '10338446', '10338654', '10338760' '10338890', '10338969', '10338979', '10338988', '10338994', '10339070')

library(mogene10stprobeset.db)
select(mogene10stprobeset.db, keys = ids, columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))

library(mogene10sttranscriptcluster.db)
select(mogene10sttranscriptcluster.db, keys = ids, columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))

ADD REPLY • link 3.6 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Thanks a lot. I'm still having many NAs.

PROBEID            ENSEMBL        SYMBOL

1 10338002 <na> <na> 2 10338008 <na> <na> 3 10338011 <na> <na> 4 10338016 <na> <na> 5 10338018 <na> <na> 6 10338019 <na> <na> 7 10338024 <na> <na> 8 10338027 <na> <na>

ADD REPLY • link 3.6 years ago Didi ▴ 10

0

Entering edit mode

Thanks a lot. I'm still having many NAs.

PROBEID            ENSEMBL        SYMBOL

1 10338002 <na> <na> 2 10338008 <na> <na> 3 10338011 <na> <na> 4 10338016 <na> <na> 5 10338018 <na> <na> 6 10338019 <na> <na> 7 10338024 <na> <na> 8 10338027 <na> <na>

ADD REPLY • link 3.6 years ago Didi ▴ 10

1

Entering edit mode

There's a post from a 2008 at https://stat.ethz.ch/pipermail/bioconductor/2008-September/024167.html which seems to mention most of the transcript cluster IDs you're reporting as NA, along with a note from Affymetrix saying:

These tend to be very short, non-biologically interesting sequences and were excluded from the PGF with the intent that they should not be analyzed by users. So the advice is that you can safely ignore them.

It's been a long time since I worked with Affy ST arrays, so I'd suggest reading that post yourself, but maybe you don't need to worry about missing annotation for those transcript clusters if that was Affymetrix's opinion at the time.

ADD REPLY • link 3.6 years ago Mike Smith ★ 6.5k

2

Entering edit mode

We can go straight to Affy for more info as well.

## See the vignette for AffyCompatible if you care to understand this part
> library(AffyCompatible)
> rsrc <- NetAffxResource("jmacdon@med.umich.edu", pwd)
> grep("MoGene", names(rsrc), value = TRUE)
[1] "MoGene-1_0-st-v1" "MoGene-1_1-st-v1" "MoGene-2_0-st-v1" "MoGene-2_1-st-v1"
> annodf <- readAnnotation(rsrc, annotation = rsrc[["MoGene-1_0-st-v1", 4]], comment.char = "#")
## get your ids
> ids <- scan("clipboard")
Read 31 items
> ids
 [1] 10338002 10338008 10338011 10338016 10338018 10338019 10338024 10338027
 [9] 10338028 10338033 10338034 10338040 10338046 10338051 10338054 10338055
[17] 10338061 10338162 10338214 10338244 10338336 10338396 10338446 10338654
[25] 10338760 10338890 10338969 10338979 10338988 10338994 10339070

> names(annodf)
 [1] "transcript_cluster_id" "probeset_id"           "seqname"              
 [4] "strand"                "start"                 "stop"                 
 [7] "total_probes"          "gene_assignment"       "mrna_assignment"      
[10] "swissprot"             "unigene"               "GO_biological_process"
[13] "GO_cellular_component" "GO_molecular_function" "pathway"              
[16] "protein_domains"       "crosshyb_type"         "category"           
## NOTE THAT the mrna_assignment column contains the, um, mRNA assignments
> annodf[annodf[,1] %in% ids,9]
 [1] "---"                                           
 [2] "---"                                           
 [3] "---"                                           
 [4] "---"                                           
 [5] "---"                                           
 [6] "---"                                           
 [7] "---"                                           
 [8] "---"                                           
 [9] "---"                                           
[10] "---"                                           
[11] "---"                                           
[12] "---"                                           
[13] "---"                                           
[14] "---"                                           
[15] "---"                                           
[16] "---"                                           
[17] "---"                                           
[18] "--- // pos_control // --- // --- // --- // ---"
[19] "--- // pos_control // --- // --- // --- // ---"
[20] "--- // pos_control // --- // --- // --- // ---"
[21] "--- // pos_control // --- // --- // --- // ---"
[22] "--- // pos_control // --- // --- // --- // ---"
[23] "--- // pos_control // --- // --- // --- // ---"
[24] "--- // pos_control // --- // --- // --- // ---"
[25] "--- // pos_control // --- // --- // --- // ---"
[26] "--- // neg_control // --- // --- // --- // ---"
[27] "--- // neg_control // --- // --- // --- // ---"
[28] "--- // neg_control // --- // --- // --- // ---"
[29] "--- // neg_control // --- // --- // --- // ---"
[30] "--- // neg_control // --- // --- // --- // ---"
[31] "--- // neg_control // --- // --- // --- // ---"
## And these should be long boring strings, starting with some sort of transcript ID:
> substr(annodf[1:5,9], 1, 60)
[1] "XR_877151 // RefSeq // PREDICTED: Mus musculus RIKEN cDNA G4"
[2] "ENSMUST00000082908 // ENSEMBL // predicted gene, 26206 [gene"
[3] "KnowTID_00000002 // Luo lincRNA // Non-coding transcript ide"
[4] "ENSMUST00000193658 // ENSEMBL // predicted gene 6123 [gene_b"
[5] "NM_008866 // RefSeq // Mus musculus lysophospholipase 1 (Lyp"
## How many start with --- (e.g., have no transcript)?
> length(grep("^---", annodf[,11]))/nrow(annodf)
[1]  0.3638205
## So about 36% of the probesets on this array have no mRNA transcript assigned!
## which +/- lines up with what we get from the ChipDb for this array
> sumis.na(mapIds(mogene10sttranscriptcluster.db, keys(mogene10sttranscriptcluster.db), "ENTREZID", "PROBEID"))/length(keys(mogene10sttranscriptcluster.db)))
'select()' returned 1:many mapping between keys and columns
[1] 0.3146586

ADD REPLY • link 3.6 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks a lot. It's quite clear now. How to use the AffyCompatible library to get the ensembl annotation? I didn't know about this package. I tried

strsplit(annodf[annodf[,1] %in% ids,9], "//")

It didn't work.

ADD REPLY • link 3.6 years ago Didi ▴ 10

0

Entering edit mode

It's a bit difficult to work with the Affy data directly, so I am not sure I would recommend doing that. The column with the mRNA targets is formatted as ID // Source // Name // Chr // Other stuff /// Repeat. So as a random example,

> annodf[444,9]
[1] "NM_001077425 // RefSeq // Mus musculus contactin associated protein-like 5A (Cntnap5a), mRNA. // chr1 // 100 // 100 // 47 // 47 // 0 /// XM_006529773 // RefSeq // PREDICTED: Mus musculus contactin associated protein-like 5A (Cntnap5a), transcript variant X1, mRNA. // chr1 // 100 // 100 // 47 // 47 // 0 /// XM_006529774 // RefSeq // PREDICTED: Mus musculus contactin associated protein-like 5A (Cntnap5a), transcript variant X2, mRNA. // chr1 // 100 // 100 // 47 // 47 // 0 /// XM_006529775 // RefSeq // PREDICTED: Mus musculus contactin associated protein-like 5A (Cntnap5a), transcript variant X3, mRNA. // chr1 // 100 // 70 // 33 // 33 // 0 /// ENSMUST00000043725 // ENSEMBL // contactin associated protein-like 5A [gene_biotype:protein_coding transcript_biotype:protein_coding] // chr1 // 100 // 100 // 47 // 47 // 0 /// OTTMUST00000052468 // Havana transcript // contactin associated protein-like 5A[gene_biotype:protein_coding transcript_biotype:protein_coding] // chr1 // 100 // 100 // 47 // 47 // 0 /// GENSCAN00000035677 // ENSEMBL // cdna:genscan chromosome:GRCm38:1:115900590:115949429:1 transcript_biotype:protein_coding // chr1 // 100 // 9 // 4 // 4 // 0 /// GENSCAN00000035681 // ENSEMBL // cdna:genscan chromosome:GRCm38:1:116032988:116580674:1 transcript_biotype:protein_coding // chr1 // 100 // 83 // 39 // 39 // 0 /// uc007cia.1 // UCSC Genes // --- // chr1 // 100 // 100 // 47 // 47 // 0"

Where there are four RefSeq IDs, then an Ensembl ID, then a Havana transcript ID, then two GenCode IDs, and finally a UCSC transcript.

And if I pick another random one, it's different:

> annodf[4444,9]
[1] "NM_019409 // RefSeq // Mus musculus oligodendrocyte myelin glycoprotein (Omg), mRNA. // chr11 // 100 // 100 // 26 // 26 // 0 /// ENSMUST00000164465 // ENSEMBL // oligodendrocyte myelin glycoprotein [gene_biotype:protein_coding transcript_biotype:protein_coding] // chr11 // 100 // 100 // 26 // 26 // 0 /// BC024757 // GenBank // Mus musculus oligodendrocyte myelin glycoprotein, mRNA (cDNA clone MGC:35932 IMAGE:5369312), complete cds. // chr11 // 100 // 96 // 25 // 25 // 0 /// OTTMUST00000000577 // Havana transcript // oligodendrocyte myelin glycoprotein[gene_biotype:protein_coding transcript_biotype:protein_coding] // chr11 // 100 // 100 // 26 // 26 // 0 /// GENSCAN00000056512 // ENSEMBL // cdna:genscan chromosome:GRCm38:11:79501699:79502712:-1 transcript_biotype:protein_coding // chr11 // 100 // 46 // 12 // 12 // 0 /// uc007kkn.2 // UCSC Genes // --- // chr11 // 100 // 100 // 26 // 26 // 0"

So you could parse that and get the Ensembl or GenCode IDs, but that sounds boring and hard and pointless when you could just use the mogene10sttranscriptcluster.db package instead.

OR you could realize that Affy haven't even updated their data for years now, so it's unlikely whatever you will get is materially different from what you already have (I <del>make</del> made these annotation packages for every release up until a few years ago, and now they just get re-processed with a new version since they are essentially static now).

ADD REPLY • link 3.6 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks a lot.

So what's the best strategy you advice to get the latest Ensembl annotation from an old affy annotation coming from an old microarray analysis?

We want to compare those results with new RNA-seq results.

Thanks again.

ADD REPLY • link 3.6 years ago Didi ▴ 10

0

Entering edit mode

I think I already answered that question. Use the mogene10sttranscriptcluster.db package.

ADD REPLY • link 3.6 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you so much for your help.

ADD REPLY • link 3.6 years ago Didi ▴ 10

0

Entering edit mode

One of our collaborators suggested to use the updated annotation from the BRAINARRAY project. Is this annotation included in the mogene10sttranscriptcluster.db package?

If not is there a package that takes into account this annotation.

Thanks a lot.

ADD REPLY • link 3.6 years ago Didi ▴ 10

1

Entering edit mode

The MBNI groups makes the annotation available for entrez gene-based remappings (only); see the columns labelled 'A' on their website with downloads. Note that the remapped probeset IDs simply consist of (in this case) the entrez ID with an '_at' suffix, If you strip these off, you can also use/query the org.Mm.eg.db package; I usually use this approach.

If you would like to use ENSEMBL-based remappings, I recommend to use the EnsDb packages that can be retrieved from the AnnotationHub. To get you started with this, see for example here and here. Note that you need to strip off the '_at' suffix from these probesets IDs as well!

ADD REPLY • link 3.6 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Just specifically on the BrainArray part: I recently re-analysed a study that used a custom BrainArray. For annotation, I downloaded the *.db package from the following location, and install locally: http://brainarray.mbni.med.umich.edu/bioc/src/contrib/