I'm not really sure why Kasper rearranged
the IlluminaHumanMethylation450kmanifest package; it's more of a pain
to
use now. In the first release of minfi, there was a nice little
example of
how to cast the source sequences, strand, and interrogated site from
the
HM450k manifest into a GRanges object and go from there. Perhaps
another
project stole all of his time. Anyways, the following should be
useful for
people using either 'minfi' or 'methylumi'; I would suggest that, for
visualization and annotation purposes, a SummarizedExperiment is the
logical terminus for processed DNA methylation array data, as it
segues
more naturally into BSseq, ChIPseq, and RNAseq integration.
In order to level the playing ground, I will construct a Ranges object
using the annotations at GEO which the TCGA ADF now references.
Since I'm refreshing both the 450k.db and probe packages (the latter
has a
bug in the example script; don't use it), I will put something like
the
following into the documentation. The bottleneck for refreshing the
.db
package is the slowness of build checks with db0-type chip-style
schemas
and my slow learning of how to use a FeatureDb instead. I strongly
believe
that GRanges are the appropriate data structure for this type of
annotation
data. Personally I've been combining 450k results with RNAseq and
ChIPseq
data to look at some (IMHO) rather interesting phenomena, and having
the
flexibility of a GRanges-based object (a SummarizedExperiment, to be
specific) is enormously useful. On the other hand, I haven't found
that
segmentation-based approaches seem to be very useful for the 450k
data, as
some fine-scale changes get lost when you compare to BS-seq data in
similar
cell types. In my experience, it makes more sense to define genomic
regions and sweep up changes in those, rather than segmenting DNA
methylation, but that's just one data point. So, here goes.
require(GEOquery)
require(GenomicRanges)
require(GenomicFeatures)
require(BSgenome.Hsapiens.UCSC.hg19)
hm450k <- getGEO('GPL13534')
## Here's a gotcha that you might like to avoid:
## (one of the reasons I'm bothering with a 450k.db update is that I
have
resolved it)
##
anyMissing(as.numeric(hm450k@dataTable@table$MAPINFO))
## TRUE
##
## exclude these (SNP probes)
noMap <- whichis.na(as.numeric(hm450k@dataTable@table$MAPINFO)))
hm450 <- with( hm450k@dataTable@table[ -noMap, ],
GRanges( paste0('chr', CHR),
IRanges(as.numeric(MAPINFO), width=1),
probe=Name,
probeStrand=Strand,
sourceSeq=DNAStringSet(SourceSeq),
)
)
names(hm450) <- values(hm450)$probe
values(hm450)$probeType = Rle(ifelse(substr(names(hm450),1,2)=='cg',
'CpG',
'CpH'))
Now we have something resembling annotation (but not really, for
reasons
I'll demonstrate later).
## inspect it:
head(hm450, 1)
GRanges with 1 range and 4 elementMetadata cols:
seqnames ranges strand | probe
probeStrand
sourceSeq probeType
<rle> <iranges> <rle> | <character>
<character>
<dnastringset> <rle>
cg00035864 chrY [8553009, 8553009] * | cg00035864
F
AGACACTAGCAGTCTTGTCCACATAGACCCTTGAATTTATCTCAAATTCG CpG
I propose that you always sanity check your work at each step. Here's
one
way I typically do it:
hm450.subset <- hm450[ sample(names(hm450), 10) ]
values(hm450.subset)$siteSeq <- getSeq(Hsapiens, resize(hm450.subset,
2))
show(hm450.subset)
## GRanges with 10 ranges and 5 elementMetadata cols:
## seqnames ranges strand | probe
probeStrand ## sourceSeq
probeType
siteSeq
## <rle> <iranges> <rle> | <character>
<character> ## <dnastringset>
<rle>
<dnastringset>
## ch.X.665616F chrX [ 41078352, 41078352] * |
ch.X.665616F
F CAGGCTTATGACAACTTAAGCTTGAGTGATCACTTACTAAGAGCAGTACT CpH
CA
## cg19794706 chr14 [ 75643232, 75643232] * |
cg19794706
R GGAGATGGCAAGGACCAATCTGGGGCCGAGCAGGAACAAAAGCAGCAACG CpG
CG
## cg15391651 chr1 [165132273, 165132273] * |
cg15391651
F CGGAGCCCTGAGTGTGCACAAAGCACCACTATGCCAGAGTGATGTTATCA CpG
CG
Suppose you wanted to mask probes whose interrogated site overlaps
with
repeats in the genome. This is easy:
library(parallel)
REPEATS <- do.call(c, mclapply(seqlevels(HM450.HG19), function(ch) {
GRanges(ch, union(masks(Hsapiens[[ch]])$RM,
masks(Hsapiens[[ch]])$TRF))
}))
rpts.450 <- subsetByOverlaps(hm450, REPEATS)
summary(rpts.450)
## Length Class Mode
## 78164 GRanges S4
The example gets to the point where the Ranges/FeatureDb-based version
is;
I'll post another in a while, using the distanceToNearest
functionality to
see how far a given probe is from a TSS, either UCSC or FANTOM-
annotated,
and assorted other things that use a TxDb object to compute quickly. I
am
never going to put up a PDF with a "protocol" (the #$%@?!) that I'm
not
1000% sure of ever again...
I strongly suggest that you take whatever container your data is in,
and
turn it into a SummarizedExperiment with the hm450 object as the
rowData.
I believe you will thank me later on, to the point that I've written
a
generic to do this for MethyLumiSet objects, and could do the same for
MethyLumiM and MethylSet objects if there is demand. I would like to
patch
Gviz to handle SummarizedExperiments soon, too.
One thing I have not done here is to mask SNPs, because without MAF
information (or, better, SNP information on a subject), I feel like it
may
be swatting a fly with a Buick to mask all CpG/CpH overlaps with dbSNP
regardless of minor allele frequency. You can use the Probe_SNPs
and Probe_SNPs_10 columns in the manifest if you want to mask in that
fashion against dbSNP build 131, but there are other things that I
find
wanting in that approach (i.e. you should widen the interrogated
sequences
appropriately). Anyways...
Hope this helps. Basically, if you want to look at the probes on the
450k
platform which fall into a given feature or class of features, you
assemble
a GRanges object matching those data (perhaps ChIPseq peaks, perhaps
UCSC
tracks, perhaps FeatureDb output) and use subsetByOverlaps with the
appropriate arguments to isolate the desired probes in those regions.
The
trick is finding feature data.
--t
On Fri, May 4, 2012 at 7:15 AM, Ed Siefker <ebs15242@gmail.com> wrote:
> I've run some arrays through minfi and have a table of
differentially
> methylated positions. I can get the probe sequence from
> IlluminaHumanMethylation450kmanifest, but it's not clear to me how
to
> translate that into genes, promoters, chromosomes or anything
> biologically informative.
>
> What do people usually do at this point?
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]