Question

Can't match on some, but not all, Illumina probe IDs using biomaRt

0

Entering edit mode

charkerrhodes • 0

@charkerrhodes-9364

Last seen 8.4 years ago

United States

I'm using biomaRt to find the Ensembl transcripts detected by the Illumina ht-12 v4 probes. For roughly 1/4 of the probes the probe ID is not being recognized. In the example below, the first probe is recognized, but the second is not.

Any suggestions?

> library(biomaRt)
> ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host='www.ensembl.org')
>
> getBM(attributes=c("illumina_humanht_12_v4", "hgnc_symbol", "ensembl_transcript_id"),
+ filters = "illumina_humanht_12_v4",
+ values = "ILMN_1789991",
+ mart = ensembl)
illumina_humanht_12_v4 hgnc_symbol ensembl_transcript_id
1 ILMN_1789991 MARCH4 ENST00000273067
>
> getBM(attributes=c("illumina_humanht_12_v4", "hgnc_symbol", "ensembl_transcript_id"),
+ filters = "illumina_humanht_12_v4",
+ values = "ILMN_1735038",
+ mart = ensembl)
[1] illumina_humanht_12_v4 hgnc_symbol ensembl_transcript_id
<0 rows> (or 0-length row.names)
>
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] biomaRt_2.26.1

loaded via a namespace (and not attached):
[1] IRanges_2.4.4 parallel_3.2.2 DBI_0.3.1
[4] RCurl_1.95-4.7 Biobase_2.30.0 AnnotationDbi_1.32.0
[7] RSQLite_1.0.0 S4Vectors_0.8.3 BiocGenerics_0.16.1
[10] stats4_3.2.2 bitops_1.0-6 XML_3.98-1.3

R biomart illumina human ht-12 v4 • 4.6k views

ADD COMMENT • link updated 8.4 years ago by Thomas Maurel ▴ 800 • written 8.4 years ago by charkerrhodes • 0

score 0 · Answer 1 · 2015-12-14

0

Entering edit mode

Thomas Maurel ▴ 800

@thomas-maurel-5295

Last seen 14 months ago

United Kingdom

Hello,

The probe "ILMN_1735038" is not returned by BioMart because it has not mapped to any of our Ensembl transcripts. The Ensembl mart and website will only return probes that have successfully been mapped to Ensembl transcripts, you can find more information regarding our probe mapping on the following page: http://www.ensembl.org/info/genome/microarray_probe_set_mapping.html

You can still get probes that have not mapped to our transcript either by:

Using our Mysql databases, the following query will return ILMN_1735038 with the reason why it hasn't been mapped.
- mysql --host ensembldb.ensembl.org --port 5306 --user anonymous homo_sapiens_funcgen_83_38 -e "select * from probe join unmapped_object on (probe.probe_id=unmapped_object.ensembl_id and unmapped_object.ensembl_object_type=\"Probe\") join unmapped_reason using(unmapped_reason_id) where probe.name = \"ILMN_1735038\""
Using the Ensembl Perl API: http://www.ensembl.org/info/docs/api/funcgen/regulation_tutorial.html#microarray
- Using the "get_all_UnmappedObjects" should work.
  http://www.ensembl.org/info/docs/Doxygen/funcgen-api/classBio_1_1EnsEMBL_1_1Funcgen_1_1Storable.html#a887e9c8cd583cac43d50aec21ae42ab0

Hope this helps,

Regards,

Thomas

ADD COMMENT • link 8.4 years ago Thomas Maurel ▴ 800

0

Entering edit mode

Dear Thomas,

Thank you for your prompt reply.

Is result to my biomaRt query that only 35,319 out of the 47,231 Illumina ht-12 v4 probes mapped to an Ensembl transcript a reasonable one? Or does the fact that over a quarter of them failed to map mean that I've done something wrong?

Harker

ADD REPLY • link 8.4 years ago charkerrhodes • 0

0

Entering edit mode

Dear Thomas,

Thank you for your prompt reply.

Is result to my biomaRt query that only 35,319 out of the 47,231 Illumina ht-12 v4 probes mapped to an Ensembl transcript a reasonable one? Or does the fact that over a quarter of them failed to map mean that I've done something wrong?

Harker

ADD REPLY • link 8.4 years ago charkerrhodes • 0

0

Entering edit mode

Dear Harker,

I've checked with my colleague doing the probe mapping in Ensembl and this number seems reasonable to them. Please find the most probable explanation for it below:

Illumina sometimes uses probes with one base mismatch to gauge the difference in intensity:

"The mismatch probe differs from the perfect match probe by a single base substitution at the center base position, disturbing the binding of the target gene transcript. This helps to determine the background and nonspecific hybridization that contributes to the signal measured for the perfect match oligo." (https://medicine.yale.edu/keck/ycga/microarrays/affymetrix/)

Regarding the the Illumina ht-12 v4 is a bead chip:

"HumanHT-12 v4 Expression BeadChip Kit": (http://www.illumina.com/products/humanht_12_expression_beadchip_kits_v4.html)

They don't provide information on the controls used on that page, but digging a bit: https://www.dkfz.de/gpcf/illumina_beadchips.html. This page mentions that such controls are present:

"Low Stringency Hyb Control: [..] In this case, each probe has two mismatch bases distributed in its sequence. If stringency is adequate, these controls yield very low signal. If stringency is too low, they yield signal approaching that of their perfect match counterparts"

From this it is a bit contradictory to the yale webpage whether there are one or two base mismatch controls. But it would explain why we would get a large number of mismatches.

Hope this helps,

Regards,

Thomas

ADD REPLY • link 8.4 years ago Thomas Maurel ▴ 800

0

Entering edit mode

Hi Thomas,

The first link you supply is for Affymetrix arrays, not Illumina. While there are some low stringency controls on the array, those are a vanishingly small proportion of the total probes, and wouldn't come close to the number of 'missing' probes the OP is talking about.

In addition, the probe the OP is asking about is a regular probe that is intended to measure MARCH3. The annotation file from Illumina has this to say about that probe:

ILMN_1735038    Homo sapiens    RefSeq  NM_178450.2     ILMN_25008      MARCH3   NM_178450.2     NM_178450.2             115123  31341961        NM_178450.2

So by all rights one would expect that this probe should be mapped to the five Ensembl transcripts that exist for MARCH3:

> select(illuminaHumanv4.db, "ILMN_1735038", c("ENSEMBL", "ENSEMBLTRANS"))
'select()' returned 1:many mapping between keys and columns
       PROBEID         ENSEMBL    ENSEMBLTRANS
1 ILMN_1735038 ENSG00000173926 ENST00000308660
2 ILMN_1735038 ENSG00000173926 ENST00000506088
3 ILMN_1735038 ENSG00000173926 ENST00000504239
4 ILMN_1735038 ENSG00000173926 ENST00000515241
5 ILMN_1735038 ENSG00000173926 ENST00000502289

However, this does bring up the fact that annotating things is a non-trivial task, and depending on who is doing the annotation, you will inevitably get differences.

> z <- mapIds(illuminaHumanv4.db, keys(illuminaHumanv4.db), "ENSEMBLTRANS", "PROBEID")
> length(z)
[1] 48107
> sum(!is.na(z))
[1] 32554

> mart <- useMart("ENSEMBL_MART_ENSEMBL","hsapiens_gene_ensembl", host = "www.ensembl.org")
> zz <- getBM(c("illumina_humanht_12_v4", "ensembl_gene_id"), "illumina_humanht_12_v4",keys(illuminaHumanv4.db), mart)

> zz <- split(zz, zz[,1])
> length(zz)
[1] 35354

> sum(names(z)[!is.na(z)] %in% names(zz))
[1] 28429

So the annotation package that is generated by Mark Dunning (based I presume on the annotation package supplied by Illumina) says there are 32,554 Illumina -> Ensembl transcript mappings, whereas the Ensembl Biomart says there are 35,354 such mappings.

And there are only 28,429 Illumina probes where both agree that there are any Illumina -> Ensembl transcript mappings at all, not to mention agreeing on what those mappings might be!

ADD REPLY • link 8.4 years ago James W. MacDonald 65k

0

Entering edit mode

Dear James,

The first link you supply is for Affymetrix arrays, not Illumina. While there are some low stringency controls on the array, those are a vanishingly small proportion of the total probes, and wouldn't come close to the number of 'missing' probes the OP is talking about.

You are very right, thanks for correcting me.

So the annotation package that is generated by Mark Dunning (based I presume on the annotation package supplied by Illumina) says there are 32,554 Illumina -> Ensembl transcript mappings, whereas the Ensembl Biomart says there are 35,354 such mappings.

And there are only 28,429 Illumina probes where both agree that there are any Illumina -> Ensembl transcript mappings at all, not to mention agreeing on what those mappings might be!

Thanks a lot for bringing this to our attention and as you said the probemapping is a complex mapping task and the different cutoffs and settings can generate different results. I did have a look at the Illumina files: https://support.illumina.com/array/array_kits/humanht-12_v4_expression_beadchip_kit/downloads.html and I've noticed that the date for the text version is “05/23/2013” But when you open the file, you get a date of 15/4/2010. This is going back to Ensembl release 58 for May 2010 when the human assembly was still GRCh37.p7 or release 72 for Juin 2012 when the human assembly was GRCh37.p11. Ensembl.org on Ensembl release 83 is on the human assembly GRCh38.p5 so as you can imagine the sequences and geneset have changed quite a lot so I hope that we are comparing the same thing. I also don't know how the illuminaHumanv4.db package is doing the mapping from the Illumina file to Ensembl, is it mapping using RefSeq?. In Ensembl we take the probe sequences from Illumina and run Exonerate to compare and align to our genome (more information on the following page: http://www.ensembl.org/info/genome/microarray_probe_set_mapping.html).

I will contact Mark Dunning and investigate this issue further.

Thanks again for your feedback.

Regards,

Thomas

ADD REPLY • link 8.4 years ago Thomas Maurel ▴ 800

0

Entering edit mode

Mark does some extra annotation where he re-aligns the probes. These data are in the 'ExtraInfo' table:

> con <- illuminaHumanv4_dbconn()
> dbGetQuery(con, "select * from ExtraInfo limit 10;")
     IlluminaID ArrayAddress               NuID ProbeQuality CodingZone
1  ILMN_3166687      5270161 fVO7UPeDN.UnC595UU     No match       <NA>
2  ILMN_3165565      4230037 iEi8IfI0mJK7GVaF_o     No match       <NA>
3  ILMN_3164808        60372 xhGZ_EYeDUnwiV5BmM     No match       <NA>
4  ILMN_3165363      5260356 ieQ5TwQXyRrP1LJ64k     No match       <NA>
5  ILMN_3166504      6060692 cnbRQdtGY2AyfocO1o     No match       <NA>
6  ILMN_3164750      6370471 9PgL_oqHPRLMjivvko     No match       <NA>
7  ILMN_3166430      1710435 HUSTVLbg8LMCQdRQ70     No match       <NA>
8  ILMN_3165745      1400612 ZdnuDceSBzq.nJmWs0     No match       <NA>
9  ILMN_3164915      5130189 rqBYPEQbChsXnXFVgU     No match       <NA>
10 ILMN_3165415        70278 rVszqdr4Xc_pvvhWD0     No match       <NA>
                                        ProbeSequence SecondMatches
1  CCCATGTGTCCAATTCTGAATATCTTTCCAGCTAAGTGCTTCTGCCCACC          <NA>
2  ACAGAGTTAAGACTTAGATCAGCGAGCAGGTGTACGCCCCGGACCTTGGG          <NA>
3  GACACGCGCTTGACACGACTGAATCCAGCTTAAGAGCCCTGCAACGCGAT          <NA>
4  CTGCAATGCCATTAACAACCTTAGCACGGTATTTCCAGTAGCTGGTGAGC          <NA>
5  GCTCGTCACCAACTCGTCACGCGATCGAAATAGCTTGGACTAATGTCCGG          <NA>
6  ATTGAAAGTTTGGGAGGGACTATTCACAGTATAGATGAGGTTGTTGCAGG          <NA>
7  CCACAGCATCCCAGTCGTGAATTAAGTATAAAGCAACTCCACCAATGTTC          <NA>
8  CTCGCTGTGAATCTACTGCAGAACTATGGGTTTGCTAGCGCGCCGGTATC          <NA>
9  GGGAACCGAATTACACAACGTAAGGACGTACCTGCTCCTACCCCCGAACC          <NA>
10 CCCGTATATGGGCTCGGTTGACCTCTATTGGGCGTTGTTGACCCGAATTC          <NA>
   OtherGenomicMatches RepeatMask OverlappingSNP EntrezReannotated
1                 <NA>       <NA>           <NA>              <NA>
2                 <NA>       <NA>           <NA>              <NA>
3                 <NA>       <NA>           <NA>              <NA>
4                 <NA>       <NA>           <NA>              <NA>
5                 <NA>       <NA>           <NA>              <NA>
6                 <NA>       <NA>           <NA>              <NA>
7                 <NA>       <NA>           <NA>              <NA>
8                 <NA>       <NA>           <NA>              <NA>
9                 <NA>       <NA>           <NA>              <NA>
10                <NA>       <NA>           <NA>              <NA>
   GenomicLocation SymbolReannotated ReporterGroupName ReporterGroupID
1             <NA>              <NA>        ERCC-00162      ERCC-00162
2             <NA>              <NA>        ERCC-00071      ERCC-00071
3             <NA>              <NA>        ERCC-00009      ERCC-00009
4             <NA>              <NA>        ERCC-00053      ERCC-00053
5             <NA>              <NA>        ERCC-00144      ERCC-00144
6             <NA>              <NA>        ERCC-00003      ERCC-00003
7             <NA>              <NA>        ERCC-00138      ERCC-00138
8             <NA>              <NA>        ERCC-00084      ERCC-00084
9             <NA>              <NA>        ERCC-00017      ERCC-00017
10            <NA>              <NA>        ERCC-00057      ERCC-00057
   EnsemblReannotated
1                <NA>
2                <NA>
3                <NA>
4                <NA>
5                <NA>
6                <NA>
7                <NA>
8                <NA>
9                <NA>
10               <NA>

So I think he does something much more sophisticated than using the Illumina mappings straight away. However, for this and all other ChipDb packages, there is only one mapping, in the end, that matters, and that is from Illumina ID -> Entrez Gene ID. The ChipDb packages may contain other tables, but when you use them to do a given mapping, the AnnotationDbi package will attach the 'probes' table of the ChipDb package to the required table in the respective OrgDb package (org.Hs.eg.db in this case) and then do a SQL query using the two tables.

So for instance if you want the Ensembl Trans ID for an Illumina probe, you will end up attaching the 'ensembl_trans' table from org.Hs.eg.db and then doing an inner join between the probes table in the illuminaHumanv4.db package and the ensembl_trans table from org.Hs.eg.db, using the Entrez Gene ID (the table index) to do the join. So we are dependent on what NCBI (and probably Ensembl) think are the correct mappings between Entrez Gene and Ensembl trans, as we just use what we can get from NCBI to set up the org.Hs.eg.db package.

ADD REPLY • link 8.4 years ago James W. MacDonald 65k