illuminaHumanv4.db : annotation source, and mapping to transcript names
1
0
Entering edit mode
VSM ▴ 120
@vsm-6019
Last seen 5.9 years ago
United Kingdom

Hi there

I have been using the R package illuminaHumanv4.db to annotate our HT12 v4 array probes, and I have two questions:

* The description of the package says the data is assembled from public repositories. However, the reference manual notes that extensive reannotation has been carried out for the illumina probes. Am I right in thinking that the reannotation (ie genomic location, EnsemblReannotated ids etc) are from the paper:
A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data
http://nar.oxfordjournals.org/content/38/3/e17/F1.expansion.html
If not, can someone shed some light as to where this reannotation is coming from / citation of how it was redone?

* I am looking to map the probe IDs to ensembl transcript names, not just the gene names. The package doesn't have this information (only ensembl gene names). Could I obtain this somewhere? Perhaps information on the first point might help .. I know ensembl has these, but if the annotation differs, I can't go this route .. 

Many thanks!
Vicky

 

microarray annotation illumina human ht-12 v4 • 2.5k views
ADD COMMENT
1
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 5 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Victoria,

regarding the second part of your question: i have also used the above specific platform of Illumina recently to annotate my probesets to gene symbols and Entrez Gene IDs. But im not sure that the info you searched is not present in the current annotation package:

columns(illuminaHumanv4.db)
 [1] "PROBEID"      "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
 [6] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND"   
[11] "ENZYME"       "MAP"          "PATH"         "PMID"         "REFSEQ"      
[16] "SYMBOL"       "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
[21] "GENENAME"     "UNIPROT"      "GO"           "EVIDENCE"     "ONTOLOGY"    
[26] "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"  "OMIM"         "UCSCKG"    

Except you are searching something different and i misunderstood your question

Best,

Efstathios

ADD COMMENT
0
Entering edit mode

Thanks Efstathios, 

I didn't notice that, I mostly looked at the documentation. Would you happen to know what the difference between ENSEMBL and EnsemblReannotated ( illuminaHumanv4ENSEMBLREANNOTATED) is?

I would assume that the former is actually directly derived from the ensembl annotation, while the latter from a custom reannotation (and for which I cannot find transcript names). There are quite a few discrepancies between the two Ensembl gene names above.

A quick glance at the top few, the ENSEMBL names are not exactly the same as the ones returned by Biomart, either ..

Thanks,Vicky

 

ADD REPLY
0
Entering edit mode

You mean you tried to use different annotations from the above options as "columns" ? Well, im not sure about your assumpion, as in my case i mostly used gene symbols and Entrez IDs(and in my naive opinion, i believe that are enough). On the other hand, if for your specific experimental design you need in particular ensembl annotations, this is another thing

ADD REPLY
0
Entering edit mode

Hi Efstathios

There are various HT12 re-annotations that are getting published all the time, so the challenge is to find one that is most reliable. For the illuminaHumanv4.db package, I am trying to identify what this reannotation is, and how it differs to something like standard ensembl. I am unclear which columns correspond to which reannotation (and where this comes from), so I can't come to any conclusions .. 

ADD REPLY
0
Entering edit mode

Have you considered looking at the help pages? Does ?illuminaHumanv4ENSEMBLREANNOTATED not answer your questions?

ADD REPLY
0
Entering edit mode

Hi James, 

Yes, that is where I obtained the information in my original post (point 1) above. I hadn't noticed the ENSEMBL only annotation (mentioned by Efstathios), which clarifies my assumption. 

I still cannot find transcript ids for the re-annotated piepline (EnsemblReannotated, etc), though. Maybe it's plainly obvious and I am just not seeing it?

ADD REPLY
1
Entering edit mode

I don't think there are any transcript IDs annotated in that package, and given that the probes are 50-mers, I sort of doubt many of them can be inferred to be transcript-specific anyway. But do note that the package does give the re-mapped genomic locations.

> z <- illuminaHumanv4fullReannotation()
> z[5000:5010,]
       IlluminaID ArrayAddress               NuID ProbeQuality      CodingZone
5000 ILMN_1824016      2190088 rXqf5ofj79Tcp.4Xu0   Perfect*** Transcriptomic?
5001 ILMN_1709092      3990441 61RbnpngeHtZ5QqC4Q      Perfect  Transcriptomic
5002 ILMN_2321292      1410075 ZoF6LQAgefS1WexZeU          Bad  Transcriptomic
5003 ILMN_2324998      5910138 EiJOOub_JtLCYiAoaY      Perfect  Transcriptomic
5004 ILMN_1662334      6560445 Tl3nrJO2S.U7v0o32o      Perfect  Transcriptomic
5005 ILMN_1715417      4810468 Z6OkUinkgpQg0JyClE      Perfect  Transcriptomic
5006 ILMN_1795218      1340193 Nnhs.TpLQtQ6SbfbWo      Perfect  Transcriptomic
5007 ILMN_2289093      7400743 KVb81O7U_GlNzvn32k          Bad  Transcriptomic
5008 ILMN_1821127      7550600 l1JAYAnj_uQYfoOVUo          Bad Transcriptomic?
5009 ILMN_1739751      7040647 oRXtR_rjT1IVdVATkw      Perfect  Transcriptomic
5010 ILMN_1674650      1980180 rdXdSomTIxJInRLi0g      Perfect  Transcriptomic
                                          ProbeSequence SecondMatches
5000 CCTGGGCTTTGCGGACTTGATTGTTTCCATCTAGGCTTTTGACCTGTGTC          <NA>
5001 TCCCACCGTGCTGGCGCTGAACTGACTGTCCGCTGCCAAGGGAAGTGACA          <NA>
5002 GGAACCTGGAGTCAAAAAGAACTGCTTCAGTCCCCGCTGTACCGCCTGCC          <NA>
5003 GAGAGCATGATGGTGCGTTTGAGCGTCAGTAAGCGAGAGAAAGGACGGCG          <NA>
5004 GCCTCTGCTGGTAGCATGTCGCAGTTTCCATGTGTTTCAGGATCTTCGGG          <NA>
5005 TGGATGGCACCAGAGGCTGCAGAAGGCCAAGAATCAAGCTAGAAGGCCAC          <NA>
5006 GCTGACGTATTTCATGGCAGTCAAGTCCAATGGCAGCGTCTTCGTCCGGG          <NA>
5007 CCCCGTTTATCCATGTGTCCATTGACGGCCATCTATGTTGCTTCTTCGGC          <NA>
5008 TCCAGCAAACGAAAAGCTGATTTGGTGCAACGACTTGGAATGCCCCCAGG          <NA>
5009 CACCCTGTCCACTTGGGTGATCATTCCAGACCCCTCCCCAAACATGCATA          <NA>
5010 CTCCCTCTCCAGGGAGCGCATAGATACAGCAGAGCTCACAGTGAGTCAGA          <NA>
     OtherGenomicMatches       RepeatMask          OverlappingSNP
5000                <NA>             <NA>                    <NA>
5001                <NA>             <NA>                    <NA>
5002                <NA> MIRb_SINE_MIR:50             rs114937162
5003                <NA>             <NA>                    <NA>
5004                <NA>             <NA>                    <NA>
5005                <NA>             <NA>             rs111784512
5006                <NA>             <NA>                    <NA>
5007                <NA> L1MB7_LINE_L1:50              rs79267010
5008                <NA>             <NA>              rs12902628
5009                <NA>             <NA>             rs116742961
5010                <NA>             <NA> rs111428370 rs117366822
     EntrezReannotated            GenomicLocation SymbolReannotated
5000              <NA>   chrX:48365282:48365331:-          BG119374
5001            150000  chr21:15646319:15646368:+            ABCC13
5002             26100     chr7:5273232:5273281:+             WIPI2
5003             25983  chr14:23946422:23946471:+              NGDN
5004              9093    chr16:4506559:4506608:+            DNAJA3
5005              6403 chr1:169558180:169558229:-              SELP
5006             22907   chr3:47891147:47891196:+             DHX30
5007             57674  chr17:78295191:78295240:+            RNF213
5008              <NA>  chr15:59392374:59392423:+          CK905457
5009            284129  chr17:78227104:78227153:+          SLC26A11
5010             54981   chr9:77676212:77676261:-           C9orf95
     ReporterGroupName ReporterGroupID EnsemblReannotated
5000              <NA>            <NA>    ENSG00000224292
5001              <NA>            <NA>    ENSG00000243064
5002              <NA>            <NA>    ENSG00000157954
5003              <NA>            <NA>    ENSG00000129460
5004              <NA>            <NA>    ENSG00000103423
5005              <NA>            <NA>    ENSG00000174175
5006              <NA>            <NA>    ENSG00000132153
5007              <NA>            <NA>    ENSG00000173821
5008              <NA>            <NA>               <NA>
5009              <NA>            <NA>    ENSG00000181045
5010              <NA>            <NA>    ENSG00000106733

And you could pretty easily create a GRanges with those data, and then use findOverlaps() on the transcripts() from a TxDb that you could get by running makeTxDbFromBiomart(), to decide which transcript(s) a given probe will bind to.

ADD REPLY
0
Entering edit mode

Hi James, 

Thank you for the tips, I thought that might be too time consuming, but your suggestions should get me there faster!

ADD REPLY

Login before adding your answer.

Traffic: 876 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6