How to pick Illumina EPIC Array probe IDs within a specific genomic region?
1
0
Entering edit mode
@poojithastemcell-11859
Last seen 3.1 years ago
United Kingdom

I have a list of genes for which I need to pull out related DNA methylation data from my EPIC array dataset. I would like to visualise this information in a genomic browser if possible. I want to pick CpGs from different genomic regions in and around the specific genes i.e. gene body, 5'UTR, 3'UTR, TSS, promoter, etc. Can anyone help me with this? Is there any online tool I can use to create a track of my data in a genome browser?

IlluminaHumanMethylationEPICmanifest • 2.1k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

You are asking a pretty general question, which is difficult to answer. Here is a specific answer, which may be helpful in orienting you.

# example data
> library(minfiData)
> data(MsetEx)
## need to map to the genome - if you use minfi it usually happens as part of the preprocessing
> MsetEx <- mapToGenome(MsetEx)
> MsetEx
class: GenomicMethylSet 
dim: 485512 6 
metadata(0):
assays(2): Meth Unmeth
rownames(485512): cg13869341 cg14008030 ... cg08265308 cg14273923
rowData names(0):
colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
  5723646053_R06C02
colData names(13): Sample_Name Sample_Well ... Basename filenames
Annotation
  array: IlluminaHumanMethylation450k
  annotation: ilmn12.hg19
Preprocessing
  Method: Raw (no normalization or bg correction)
  minfi version: 1.21.2
  Manifest version: 0.4.0

## now we can extract positional data using a GRanges object to say what positions we care about I'll just fake something up

> fakeGR <- GRanges(c("chr1", "chr2"), IRanges(c(15000, 15000), c(30000, 30000)))

> ms <- subsetByOverlaps(MsetEx, fakeGR)
> ms
class: GenomicMethylSet 
dim: 5 6 
metadata(0):
assays(2): Meth Unmeth
rownames(5): cg13869341 cg14008030 cg12045430 cg20826792 cg00381604
rowData names(0):
colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
  5723646053_R06C02
colData names(13): Sample_Name Sample_Well ... Basename filenames
Annotation
  array: IlluminaHumanMethylation450k
  annotation: ilmn12.hg19
Preprocessing
  Method: Raw (no normalization or bg correction)
  minfi version: 1.21.2
  Manifest version: 0.4.0

## See? It's smaller now

## and now we could convert the data into a GRanges
> z <- rowRanges(ms)

> mcols(z) <- cbind(getMeth(ms), getUnmeth(ms))
> z
GRanges object with 5 ranges and 12 metadata columns:
             seqnames    ranges strand | 5723646052_R02C02 5723646052_R04C01
                <Rle> <IRanges>  <Rle> |         <numeric>         <numeric>
  cg13869341     chr1     15865      * |             37782             43291
  cg14008030     chr1     18827      * |             13119             21718
  cg12045430     chr1     29407      * |              1221              1759
  cg20826792     chr1     29425      * |              2708              2967
  cg00381604     chr1     29435      * |               587               555
             5723646052_R05C02 5723646053_R04C02 5723646053_R05C02
                     <numeric>         <numeric>         <numeric>
  cg13869341             46988             36686             40965
  cg14008030             21113             19502             23834
  cg12045430              1162              3836              1543
  cg20826792              2180              3906              3035
  cg00381604               506               873               686
             5723646053_R06C02 5723646052_R02C02 5723646052_R04C01
                     <numeric>         <numeric>         <numeric>
  cg13869341             44036              7270              2778
  cg14008030             22968              7122             13020
  cg12045430              1381             16802             20690
  cg20826792              2807             17744             18621
  cg00381604               945             16059             17723
             5723646052_R05C02 5723646053_R04C02 5723646053_R05C02
                     <numeric>         <numeric>         <numeric>
  cg13869341              4909              7187              8529
  cg14008030              8597              8185             11111
  cg12045430             17308             17867             19840
  cg20826792             17640             17201             19277
  cg00381604             14534             15602             17430
             5723646053_R06C02
                     <numeric>
  cg13869341              1899
  cg14008030              9067
  cg12045430             19557
  cg20826792             18413
  cg00381604             15559
  -------
  seqinfo: 24 sequences from hg19 genome; no seqlengths

And you could use e.g., Gviz to make plots of that.

ADD COMMENT
0
Entering edit mode

Thank you James. That is very helpful. I will give it a try.

ADD REPLY
0
Entering edit mode

Hi James! I have similar question with the original post. But after I got the mapped probes, how should I calculate the mean methylation value of a specific gene from the probes mapped to the gene location? Seems there's no information on which probes mapped to which genes?

ADD REPLY
1
Entering edit mode

Here's an example. Your homework is to decipher what I did.

> library(minfi)
> library(minfiData)
> z <- preprocessFunnorm(RGsetEx)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> pp <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene)
> pp
GRanges object with 82960 ranges and 2 metadata columns:
                   seqnames        ranges strand |     tx_id     tx_name
                      <Rle>     <IRanges>  <Rle> | <integer> <character>
  uc001aaa.3           chr1    9874-12073      + |         1  uc001aaa.3
  uc010nxq.1           chr1    9874-12073      + |         2  uc010nxq.1
  uc010nxr.1           chr1    9874-12073      + |         3  uc010nxr.1
  uc001aal.1           chr1   67091-69290      + |         4  uc001aal.1
  uc001aaq.2           chr1 319084-321283      + |         5  uc001aaq.2

## Now I already said that you need to use subsetByOverlaps, and I just got all the promoter regions
## you might want other genomic regions, but this is just an example so you have to figure out how to do what you want if it's something different

## Now get the mean M-value, per sample, per region, for the first 20
> fakeO <- do.call(rbind, lapply(1:20, function(x) colMeans(getM(subsetByOverlaps(z, pp[x])))))
> fakeO
      5723646052_R02C02 5723646052_R04C01 5723646052_R05C02 5723646053_R04C02
 [1,]               NaN               NaN               NaN               NaN
 [2,]               NaN               NaN               NaN               NaN
 [3,]               NaN               NaN               NaN               NaN
 [4,]         0.4886808         0.9086010        -0.2618946        -0.1448504
 [5,]               NaN               NaN               NaN               NaN
 [6,]               NaN               NaN               NaN               NaN
 [7,]               NaN               NaN               NaN               NaN
 [8,]               NaN               NaN               NaN               NaN
 [9,]               NaN               NaN               NaN               NaN
[10,]               NaN               NaN               NaN               NaN
[11,]               NaN               NaN               NaN               NaN
[12,]               NaN               NaN               NaN               NaN
[13,]        -2.0552704        -1.9338784        -2.4166071        -2.2670802
[14,]        -5.0269002        -4.8170834        -4.9346891        -4.5480297
[15,]        -5.0269002        -4.8170834        -4.9346891        -4.5480297
[16,]        -5.0269002        -4.8170834        -4.9346891        -4.5480297
[17,]        -5.0269002        -4.8170834        -4.9346891        -4.5480297
[18,]        -5.0269002        -4.8170834        -4.9346891        -4.5480297
[19,]        -5.0269002        -4.8170834        -4.9346891        -4.5480297
[20,]        -0.2155665        -0.2172631        -0.8050240        -0.2081479
      5723646053_R05C02 5723646053_R06C02
 [1,]               NaN               NaN
 [2,]               NaN               NaN
 [3,]               NaN               NaN
 [4,]        0.42406292        -0.5063752
 [5,]               NaN               NaN
 [6,]               NaN               NaN
 [7,]               NaN               NaN
 [8,]               NaN               NaN
 [9,]               NaN               NaN
[10,]               NaN               NaN
[11,]               NaN               NaN
[12,]               NaN               NaN
[13,]       -1.99018087        -2.5155456
[14,]       -4.72432917        -4.9795912
[15,]       -4.72432917        -4.9795912
[16,]       -4.72432917        -4.9795912
[17,]       -4.72432917        -4.9795912
[18,]       -4.72432917        -4.9795912
[19,]       -4.72432917        -4.9795912
[20,]       -0.03463982        -1.0646269

Having shown how to do what you specifically asked about, I will now caution you against doing that sort of thing. I mean it might be a legit thing to do, but it's non-standard. Usually you would look for regions that look like peaks of differential methylation and then try to determine if they are different. The general consensus is that these regions should look like peaks, and there are different ways of detecting them. minfi has the bumphunter method, DMRcate uses a Gaussian kernel, etc. Those are more sophisticated than a simple mean over a region.

ADD REPLY
0
Entering edit mode

Hi James! Thank you very much for your detailed example and kind suggestions. I'll look into the bumphunter method.

ADD REPLY

Login before adding your answer.

Traffic: 1126 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6