Hi,
Anyone anyone have a suggestion on how to know if a given probe in a list is either 5' UTR, coding or 3' UTR using data from illuminaHumanv4.db?
Thanks in advance
Sylvain
Hi,
Anyone anyone have a suggestion on how to know if a given probe in a list is either 5' UTR, coding or 3' UTR using data from illuminaHumanv4.db?
Thanks in advance
Sylvain
Hi Sylvain,
Not quite; the best you can do is to tell whether it is inronic, intergenic or coding
library(illuminaHumanv4.db)
prbs <- mappedkeys(illuminaHumanv4CODINGZONE)
table(unlist(mget(prbs, illuminaHumanv4CODINGZONE))) Intergenic Intronic Transcriptomic Transcriptomic? 2019 1057 37611 6290
Hope this helps,
Mark
Hi Sylvain,
Here is one way to go that makes use of the gene model stored in the TxDb object for hg19. It requires a little bit of coding. But first we need to get the genomic ranges of your probes. We'll use the CHRLOC and CHRLOCEND maps from the illuminaHumanv4.db package for that. Note that a given probe can be mapped to more than 1 genomic location:
library(GenomicRanges) probesToGRanges <- function(chipdb, probe_ids) { pkgname <- map_prefix <- get("packageName", chipdb@.xData) nc <- nchar(map_prefix) if (substr(map_prefix, nc - 2L, nc) == ".db") map_prefix <- substr(map_prefix, 1, nc - 3L) CHRLOC_map_name <- paste0(map_prefix, "CHRLOC") CHRLOCEND_map_name <- paste0(map_prefix, "CHRLOCEND") CHRLOC_map <- get(CHRLOC_map_name, envir=asNamespace(pkgname)) CHRLOCEND_map <- get(CHRLOCEND_map_name, envir=asNamespace(pkgname)) chrloc <- IntegerList(mget(probe_ids, CHRLOC_map)) chrlocend <- IntegerList(mget(probe_ids, CHRLOCEND_map)) chrloc_eltlens <- elementLengths(chrloc) chrlocend_eltlens <- elementLengths(chrlocend) ## Sanity check. stopifnot(identical(chrloc_eltlens, chrlocend_eltlens)) chrloc <- unlist(chrloc, use.names=FALSE) chrlocend <- unlist(chrlocend, use.names=FALSE) probe_id <- rep.int(names(chrloc_eltlens), chrloc_eltlens) is_na <- is.na(chrloc) ## Sanity check. stopifnot(identical(is_na, is.na(chrlocend))) if (any(is_na)) { keep_idx <- which(!is_na) chrloc <- chrloc[keep_idx] chrlocend <- chrlocend[keep_idx] probe_id <- probe_id[keep_idx] } is_neg <- chrloc < 0L ## Sanity check. stopifnot(identical(is_neg, chrlocend < 0L)) ans_seqnames <- names(chrloc) ans_ranges <- IRanges(abs(chrloc), abs(chrlocend)) ans_strand <- strand(is_neg) GRanges(ans_seqnames, ans_ranges, ans_strand, probe_id) } library(illuminaHumanv4.db) my_probes <- sample(keys(illuminaHumanv4.db), 10) my_probe_ranges <- probesToGRanges(illuminaHumanv4.db, my_probes) my_probe_ranges # GRanges object with 13 ranges and 1 metadata column: # seqnames ranges strand | probe_id # <Rle> <IRanges> <Rle> | <character> # [1] 9 [128509617, 128729655] + | ILMN_1810100 # [2] 9 [128510478, 128729655] + | ILMN_1810100 # [3] 7 [ 40172342, 40174251] - | ILMN_1745119 # [4] 10 [ 63166401, 63213208] - | ILMN_1809639 # [5] X [ 10413350, 10645779] - | ILMN_2371433 # ... ... ... ... ... ... # [9] X [ 10437305, 10535643] - | ILMN_2371433 # [10] X [ 10473381, 10535643] - | ILMN_2371433 # [11] 7 [140396481, 140406446] + | ILMN_2117330 # [12] 8 [ 2792875, 4852328] - | ILMN_1746945 # [13] 2 [110300374, 110371783] - | ILMN_2258958 # ------- # seqinfo: 6 sequences from an unspecified genome; no seqlengths
Then get the CDS, 5' and 3' UTRs for hg19:
library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene cds <- cdsBy(txdb, by="tx") five_utrs <- fiveUTRsByTranscript(txdb) three_utrs <- threeUTRsByTranscript(txdb)
Finally find where your probes are landing:
seqlevels(my_probe_ranges) <- paste0("chr", seqlevels(my_probe_ranges)) seqinfo(my_probe_ranges) <- merge(seqinfo(my_probe_ranges), seqinfo(txdb)) mcols(my_probe_ranges)$is_coding <- my_probe_ranges %within% cds mcols(my_probe_ranges)$is_in_5utr <- my_probe_ranges %within% five_utrs mcols(my_probe_ranges)$is_in_3utr <- my_probe_ranges %within% three_utrs my_probe_ranges # GRanges object with 13 ranges and 4 metadata columns: # seqnames ranges strand | probe_id is_coding # <Rle> <IRanges> <Rle> | <character> <logical> # [1] chr9 [128509617, 128729655] + | ILMN_1810100 FALSE # [2] chr9 [128510478, 128729655] + | ILMN_1810100 FALSE # [3] chr7 [ 40172342, 40174251] - | ILMN_1745119 FALSE # [4] chr10 [ 63166401, 63213208] - | ILMN_1809639 FALSE # [5] chrX [ 10413350, 10645779] - | ILMN_2371433 FALSE # ... ... ... ... ... ... ... # [9] chrX [ 10437305, 10535643] - | ILMN_2371433 FALSE # [10] chrX [ 10473381, 10535643] - | ILMN_2371433 FALSE # [11] chr7 [140396481, 140406446] + | ILMN_2117330 FALSE # [12] chr8 [ 2792875, 4852328] - | ILMN_1746945 FALSE # [13] chr2 [110300374, 110371783] - | ILMN_2258958 FALSE # is_in_5utr is_in_3utr # <logical> <logical> # [1] FALSE FALSE # [2] FALSE FALSE # [3] FALSE FALSE # [4] FALSE FALSE # [5] FALSE FALSE # ... ... ... # [9] FALSE FALSE # [10] FALSE FALSE # [11] FALSE FALSE # [12] FALSE FALSE # [13] FALSE FALSE # ------- # seqinfo: 93 sequences (1 circular) from hg19 genome
You can use %over%
instead of %within%
, in which case a TRUE is returned when
a probe range has an overlap with one of the ranges in the GRangesList on the
right (with %within%
the probe range has to be fully contained in one of the
ranges on the right).
Cheers,
H.
SHi to all,
Thanks for the pointers ;-) Like Mike said, using GENOMICLOCATION is a better way since it gives actual position of the probe on the genome; every once in a while, a probe will span two axons. And as Mike said, Illumina does not use the concept of probesets: a probe is a unique 50bp oligo on the bead. A few genes have 2 or 3 probes but they are usually scattered across the exons.
Hervé: I'll try your suggestion with Mike's inputs ASAP and I'll provide the results ;-)
Thanks for the help!
S
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
But let's step back a little bit. I didn't realize immediately that the CHRLOC and CHRLOCEND maps provide the locations of the *gene* that a probe id is mapped to (via the ENSEMBL map), and not the location of the probe itself (which is not a well defined concept anyway since a probe id is actually a probe set id and generally consists of more than 1 probe).
So going from probe ids to genomic ranges could actually be performed in a different way: first go from probe ids to Ensembl gene ids (via the ENSEMBL map), then use the TxDb object for hg19 (but this time the one obtained from the Ensembl Genes track):
Note that my_probe_ranges2 is similar to my_probe_ranges but not exactly the same. This could be due to different versions of Ensembl that were used to make the illuminaHumanv4.db and the Ensembl Genes track.
Anyway, the point is that since these ranges are gene ranges, it doesn't really make sense to find out where they land with respect to their CDS, 5' or 3' UTR regions.
Cheers,
H.
I think your original approach would have worked if you used the GENOMICLOCATION map, which provides the actual genomic location of the probes, rather than the borders of gene they map within. Again, probes can map to more than one location if there are several equally good matches.
Also, for Illumina arrays it probably is fair to consider individual probes separately. I don't think they follow the Affy paradigm of probes and probesets.