Question

Getting CDS positioning for IlluminaHumanv4 annotations

0

Entering edit mode

Sylvain Foisy ▴ 70

@sylvain-foisy-5539

Last seen 4.6 years ago

Canada

Hi,

Anyone anyone have a suggestion on how to know if a given probe in a list is either 5' UTR, coding or 3' UTR using data from illuminaHumanv4.db?

Thanks in advance

Sylvain

illumina annotation • 1.3k views

ADD COMMENT • link 9.3 years ago Sylvain Foisy ▴ 70

score 0 · Answer 1 · 2014-12-17

Hi Sylvain,

Not quite; the best you can do is to tell whether it is inronic, intergenic or coding

library(illuminaHumanv4.db)

prbs <- mappedkeys(illuminaHumanv4CODINGZONE)

table(unlist(mget(prbs, illuminaHumanv4CODINGZONE)))

     Intergenic        Intronic  Transcriptomic Transcriptomic? 
           2019            1057           37611            6290

Hope this helps,

Mark

score 0 · Answer 2 · 2014-12-17

Hi Sylvain,

Here is one way to go that makes use of the gene model stored in the TxDb object for hg19. It requires a little bit of coding. But first we need to get the genomic ranges of your probes. We'll use the CHRLOC and CHRLOCEND maps from the illuminaHumanv4.db package for that. Note that a given probe can be mapped to more than 1 genomic location:

library(GenomicRanges)
probesToGRanges <- function(chipdb, probe_ids)
{
  pkgname <- map_prefix <- get("packageName", chipdb@.xData)
  nc <- nchar(map_prefix)
  if (substr(map_prefix, nc - 2L, nc) == ".db")
    map_prefix <- substr(map_prefix, 1, nc - 3L)
  CHRLOC_map_name <- paste0(map_prefix, "CHRLOC")
  CHRLOCEND_map_name <- paste0(map_prefix, "CHRLOCEND")
  CHRLOC_map <- get(CHRLOC_map_name, envir=asNamespace(pkgname))
  CHRLOCEND_map <- get(CHRLOCEND_map_name, envir=asNamespace(pkgname))
  chrloc <- IntegerList(mget(probe_ids, CHRLOC_map))
  chrlocend <- IntegerList(mget(probe_ids, CHRLOCEND_map))
  chrloc_eltlens <- elementLengths(chrloc)
  chrlocend_eltlens <- elementLengths(chrlocend)
  ## Sanity check.
  stopifnot(identical(chrloc_eltlens, chrlocend_eltlens))
  chrloc <- unlist(chrloc, use.names=FALSE)
  chrlocend <- unlist(chrlocend, use.names=FALSE)
  probe_id <- rep.int(names(chrloc_eltlens), chrloc_eltlens)
  is_na <- is.na(chrloc)
  ## Sanity check.
  stopifnot(identical(is_na, is.na(chrlocend)))
  if (any(is_na)) {
    keep_idx <- which(!is_na)
    chrloc <- chrloc[keep_idx]
    chrlocend <- chrlocend[keep_idx]
    probe_id <- probe_id[keep_idx]
  }
  is_neg <- chrloc < 0L
  ## Sanity check.
  stopifnot(identical(is_neg, chrlocend < 0L))
  ans_seqnames <- names(chrloc)
  ans_ranges <- IRanges(abs(chrloc), abs(chrlocend))
  ans_strand <- strand(is_neg)
  GRanges(ans_seqnames, ans_ranges, ans_strand, probe_id)
}

library(illuminaHumanv4.db)
my_probes <- sample(keys(illuminaHumanv4.db), 10)
my_probe_ranges <- probesToGRanges(illuminaHumanv4.db, my_probes)
my_probe_ranges
# GRanges object with 13 ranges and 1 metadata column:
#       seqnames                 ranges strand   |     probe_id
#          <Rle>              <IRanges>  <Rle>   |  <character>
#   [1]        9 [128509617, 128729655]      +   | ILMN_1810100
#   [2]        9 [128510478, 128729655]      +   | ILMN_1810100
#   [3]        7 [ 40172342,  40174251]      -   | ILMN_1745119
#   [4]       10 [ 63166401,  63213208]      -   | ILMN_1809639
#   [5]        X [ 10413350,  10645779]      -   | ILMN_2371433
#   ...      ...                    ...    ... ...          ...
#   [9]        X [ 10437305,  10535643]      -   | ILMN_2371433
#  [10]        X [ 10473381,  10535643]      -   | ILMN_2371433
#  [11]        7 [140396481, 140406446]      +   | ILMN_2117330
#  [12]        8 [  2792875,   4852328]      -   | ILMN_1746945
#  [13]        2 [110300374, 110371783]      -   | ILMN_2258958
#  -------
#  seqinfo: 6 sequences from an unspecified genome; no seqlengths

Then get the CDS, 5' and 3' UTRs for hg19:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
cds <- cdsBy(txdb, by="tx")
five_utrs <- fiveUTRsByTranscript(txdb)
three_utrs <- threeUTRsByTranscript(txdb)

Finally find where your probes are landing:

seqlevels(my_probe_ranges) <- paste0("chr", seqlevels(my_probe_ranges))
seqinfo(my_probe_ranges) <- merge(seqinfo(my_probe_ranges), seqinfo(txdb))
mcols(my_probe_ranges)$is_coding <- my_probe_ranges %within% cds
mcols(my_probe_ranges)$is_in_5utr <- my_probe_ranges %within% five_utrs
mcols(my_probe_ranges)$is_in_3utr <- my_probe_ranges %within% three_utrs
my_probe_ranges
# GRanges object with 13 ranges and 4 metadata columns:
#        seqnames                 ranges strand   |     probe_id is_coding
#           <Rle>              <IRanges>  <Rle>   |  <character> <logical>
#    [1]     chr9 [128509617, 128729655]      +   | ILMN_1810100     FALSE
#    [2]     chr9 [128510478, 128729655]      +   | ILMN_1810100     FALSE
#    [3]     chr7 [ 40172342,  40174251]      -   | ILMN_1745119     FALSE
#    [4]    chr10 [ 63166401,  63213208]      -   | ILMN_1809639     FALSE
#    [5]     chrX [ 10413350,  10645779]      -   | ILMN_2371433     FALSE
#    ...      ...                    ...    ... ...          ...       ...
#    [9]     chrX [ 10437305,  10535643]      -   | ILMN_2371433     FALSE
#   [10]     chrX [ 10473381,  10535643]      -   | ILMN_2371433     FALSE
#   [11]     chr7 [140396481, 140406446]      +   | ILMN_2117330     FALSE
#   [12]     chr8 [  2792875,   4852328]      -   | ILMN_1746945     FALSE
#   [13]     chr2 [110300374, 110371783]      -   | ILMN_2258958     FALSE
#        is_in_5utr is_in_3utr
#         <logical>  <logical>
#    [1]      FALSE      FALSE
#    [2]      FALSE      FALSE
#    [3]      FALSE      FALSE
#    [4]      FALSE      FALSE
#    [5]      FALSE      FALSE
#    ...        ...        ...
#    [9]      FALSE      FALSE
#   [10]      FALSE      FALSE
#   [11]      FALSE      FALSE
#   [12]      FALSE      FALSE
#   [13]      FALSE      FALSE
#   -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome

You can use %over% instead of %within%, in which case a TRUE is returned when
a probe range has an overlap with one of the ranges in the GRangesList on the
right (with %within% the probe range has to be fully contained in one of the
ranges on the right).

Cheers,
H.

score 0 · Answer 3 · 2014-12-19

0

Entering edit mode

Sylvain Foisy ▴ 70

@sylvain-foisy-5539

Last seen 4.6 years ago

Canada

SHi to all,

Thanks for the pointers ;-) Like Mike said, using GENOMICLOCATION is a better way since it gives actual position of the probe on the genome; every once in a while, a probe will span two axons. And as Mike said, Illumina does not use the concept of probesets: a probe is a unique 50bp oligo on the bead. A few genes have 2 or 3 probes but they are usually scattered across the exons.

Hervé: I'll try your suggestion with Mike's inputs ASAP and I'll provide the results ;-)

Thanks for the help!

S

ADD COMMENT • link 9.3 years ago Sylvain Foisy ▴ 70

0

Entering edit mode

You're welcome. Hopefully probesToGRanges() code is not too hard to adapt to work with the GENOMICLOCATION map.

H.

PS: Please use the ADD COMMENT link so your comment appears like a comment and not like an answer to your own question.

ADD REPLY • link 9.3 years ago Hervé Pagès 16k