I'm using biomaRt to find the Ensembl transcripts detected by the Illumina ht-12 v4 probes. For roughly 1/4 of the probes the probe ID is not being recognized. In the example below, the first probe is recognized, but the second is not.
Any suggestions?
> library(biomaRt)
> ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host='www.ensembl.org')
>
> getBM(attributes=c("illumina_humanht_12_v4", "hgnc_symbol", "ensembl_transcript_id"),
+ filters = "illumina_humanht_12_v4",
+ values = "ILMN_1789991",
+ mart = ensembl)
illumina_humanht_12_v4 hgnc_symbol ensembl_transcript_id
1 ILMN_1789991 MARCH4 ENST00000273067
>
> getBM(attributes=c("illumina_humanht_12_v4", "hgnc_symbol", "ensembl_transcript_id"),
+ filters = "illumina_humanht_12_v4",
+ values = "ILMN_1735038",
+ mart = ensembl)
[1] illumina_humanht_12_v4 hgnc_symbol ensembl_transcript_id
<0 rows> (or 0-length row.names)
>
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.26.1
loaded via a namespace (and not attached):
[1] IRanges_2.4.4 parallel_3.2.2 DBI_0.3.1
[4] RCurl_1.95-4.7 Biobase_2.30.0 AnnotationDbi_1.32.0
[7] RSQLite_1.0.0 S4Vectors_0.8.3 BiocGenerics_0.16.1
[10] stats4_3.2.2 bitops_1.0-6 XML_3.98-1.3
Dear Thomas,
Thank you for your prompt reply.
Is result to my biomaRt query that only 35,319 out of the 47,231 Illumina ht-12 v4 probes mapped to an Ensembl transcript a reasonable one? Or does the fact that over a quarter of them failed to map mean that I've done something wrong?
Harker
Dear Thomas,
Thank you for your prompt reply.
Is result to my biomaRt query that only 35,319 out of the 47,231 Illumina ht-12 v4 probes mapped to an Ensembl transcript a reasonable one? Or does the fact that over a quarter of them failed to map mean that I've done something wrong?
Harker
Dear Harker,
I've checked with my colleague doing the probe mapping in Ensembl and this number seems reasonable to them. Please find the most probable explanation for it below:
Hope this helps,
Regards,
Thomas
Hi Thomas,
The first link you supply is for Affymetrix arrays, not Illumina. While there are some low stringency controls on the array, those are a vanishingly small proportion of the total probes, and wouldn't come close to the number of 'missing' probes the OP is talking about.
In addition, the probe the OP is asking about is a regular probe that is intended to measure MARCH3. The annotation file from Illumina has this to say about that probe:
So by all rights one would expect that this probe should be mapped to the five Ensembl transcripts that exist for MARCH3:
However, this does bring up the fact that annotating things is a non-trivial task, and depending on who is doing the annotation, you will inevitably get differences.
So the annotation package that is generated by Mark Dunning (based I presume on the annotation package supplied by Illumina) says there are 32,554 Illumina -> Ensembl transcript mappings, whereas the Ensembl Biomart says there are 35,354 such mappings.
And there are only 28,429 Illumina probes where both agree that there are any Illumina -> Ensembl transcript mappings at all, not to mention agreeing on what those mappings might be!
Dear James,
The first link you supply is for Affymetrix arrays, not Illumina. While there are some low stringency controls on the array, those are a vanishingly small proportion of the total probes, and wouldn't come close to the number of 'missing' probes the OP is talking about.
You are very right, thanks for correcting me.
So the annotation package that is generated by Mark Dunning (based I presume on the annotation package supplied by Illumina) says there are 32,554 Illumina -> Ensembl transcript mappings, whereas the Ensembl Biomart says there are 35,354 such mappings.
And there are only 28,429 Illumina probes where both agree that there are any Illumina -> Ensembl transcript mappings at all, not to mention agreeing on what those mappings might be!
Thanks a lot for bringing this to our attention and as you said the probemapping is a complex mapping task and the different cutoffs and settings can generate different results. I did have a look at the Illumina files: https://support.illumina.com/array/array_kits/humanht-12_v4_expression_beadchip_kit/downloads.html and I've noticed that the date for the text version is “05/23/2013” But when you open the file, you get a date of 15/4/2010. This is going back to Ensembl release 58 for May 2010 when the human assembly was still GRCh37.p7 or release 72 for Juin 2012 when the human assembly was GRCh37.p11. Ensembl.org on Ensembl release 83 is on the human assembly GRCh38.p5 so as you can imagine the sequences and geneset have changed quite a lot so I hope that we are comparing the same thing. I also don't know how the illuminaHumanv4.db package is doing the mapping from the Illumina file to Ensembl, is it mapping using RefSeq?. In Ensembl we take the probe sequences from Illumina and run Exonerate to compare and align to our genome (more information on the following page: http://www.ensembl.org/info/genome/microarray_probe_set_mapping.html).
I will contact Mark Dunning and investigate this issue further.
Thanks again for your feedback.
Regards,
Thomas
Mark does some extra annotation where he re-aligns the probes. These data are in the 'ExtraInfo' table:
So I think he does something much more sophisticated than using the Illumina mappings straight away. However, for this and all other ChipDb packages, there is only one mapping, in the end, that matters, and that is from Illumina ID -> Entrez Gene ID. The ChipDb packages may contain other tables, but when you use them to do a given mapping, the AnnotationDbi package will attach the 'probes' table of the ChipDb package to the required table in the respective OrgDb package (org.Hs.eg.db in this case) and then do a SQL query using the two tables.
So for instance if you want the Ensembl Trans ID for an Illumina probe, you will end up attaching the 'ensembl_trans' table from org.Hs.eg.db and then doing an inner join between the probes table in the illuminaHumanv4.db package and the ensembl_trans table from org.Hs.eg.db, using the Entrez Gene ID (the table index) to do the join. So we are dependent on what NCBI (and probably Ensembl) think are the correct mappings between Entrez Gene and Ensembl trans, as we just use what we can get from NCBI to set up the org.Hs.eg.db package.