minfi -- now what?

0

Entering edit mode

Ed Siefker ▴ 230

@ed-siefker-5136

Last seen 6 months ago

United States

I've run some arrays through minfi and have a table of differentially methylated positions. I can get the probe sequence from IlluminaHumanMethylation450kmanifest, but it's not clear to me how to translate that into genes, promoters, chromosomes or anything biologically informative. What do people usually do at this point?

probe minfi probe minfi • 1.6k views

ADD COMMENT • link updated 12.0 years ago by Tim Triche ★ 4.2k • written 12.0 years ago by Ed Siefker ▴ 230

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 3.7 years ago

United States

I'm not really sure why Kasper rearranged the IlluminaHumanMethylation450kmanifest package; it's more of a pain to use now. In the first release of minfi, there was a nice little example of how to cast the source sequences, strand, and interrogated site from the HM450k manifest into a GRanges object and go from there. Perhaps another project stole all of his time. Anyways, the following should be useful for people using either 'minfi' or 'methylumi'; I would suggest that, for visualization and annotation purposes, a SummarizedExperiment is the logical terminus for processed DNA methylation array data, as it segues more naturally into BSseq, ChIPseq, and RNAseq integration. In order to level the playing ground, I will construct a Ranges object using the annotations at GEO which the TCGA ADF now references. Since I'm refreshing both the 450k.db and probe packages (the latter has a bug in the example script; don't use it), I will put something like the following into the documentation. The bottleneck for refreshing the .db package is the slowness of build checks with db0-type chip-style schemas and my slow learning of how to use a FeatureDb instead. I strongly believe that GRanges are the appropriate data structure for this type of annotation data. Personally I've been combining 450k results with RNAseq and ChIPseq data to look at some (IMHO) rather interesting phenomena, and having the flexibility of a GRanges-based object (a SummarizedExperiment, to be specific) is enormously useful. On the other hand, I haven't found that segmentation-based approaches seem to be very useful for the 450k data, as some fine-scale changes get lost when you compare to BS-seq data in similar cell types. In my experience, it makes more sense to define genomic regions and sweep up changes in those, rather than segmenting DNA methylation, but that's just one data point. So, here goes. require(GEOquery) require(GenomicRanges) require(GenomicFeatures) require(BSgenome.Hsapiens.UCSC.hg19) hm450k <- getGEO('GPL13534') ## Here's a gotcha that you might like to avoid: ## (one of the reasons I'm bothering with a 450k.db update is that I have resolved it) ## anyMissing(as.numeric(hm450k@dataTable@table$MAPINFO)) ## TRUE ## ## exclude these (SNP probes) noMap <- whichis.na(as.numeric(hm450k@dataTable@table$MAPINFO))) hm450 <- with( hm450k@dataTable@table[ -noMap, ], GRanges( paste0('chr', CHR), IRanges(as.numeric(MAPINFO), width=1), probe=Name, probeStrand=Strand, sourceSeq=DNAStringSet(SourceSeq), ) ) names(hm450) <- values(hm450)$probe values(hm450)$probeType = Rle(ifelse(substr(names(hm450),1,2)=='cg', 'CpG', 'CpH')) Now we have something resembling annotation (but not really, for reasons I'll demonstrate later). ## inspect it: head(hm450, 1) GRanges with 1 range and 4 elementMetadata cols: seqnames ranges strand | probe probeStrand sourceSeq probeType <rle> <iranges> <rle> | <character> <character> <dnastringset> <rle> cg00035864 chrY [8553009, 8553009] * | cg00035864 F AGACACTAGCAGTCTTGTCCACATAGACCCTTGAATTTATCTCAAATTCG CpG I propose that you always sanity check your work at each step. Here's one way I typically do it: hm450.subset <- hm450[ sample(names(hm450), 10) ] values(hm450.subset)$siteSeq <- getSeq(Hsapiens, resize(hm450.subset, 2)) show(hm450.subset) ## GRanges with 10 ranges and 5 elementMetadata cols: ## seqnames ranges strand | probe probeStrand ## sourceSeq probeType siteSeq ## <rle> <iranges> <rle> | <character> <character> ## <dnastringset> <rle> <dnastringset> ## ch.X.665616F chrX [ 41078352, 41078352] * | ch.X.665616F F CAGGCTTATGACAACTTAAGCTTGAGTGATCACTTACTAAGAGCAGTACT CpH CA ## cg19794706 chr14 [ 75643232, 75643232] * | cg19794706 R GGAGATGGCAAGGACCAATCTGGGGCCGAGCAGGAACAAAAGCAGCAACG CpG CG ## cg15391651 chr1 [165132273, 165132273] * | cg15391651 F CGGAGCCCTGAGTGTGCACAAAGCACCACTATGCCAGAGTGATGTTATCA CpG CG Suppose you wanted to mask probes whose interrogated site overlaps with repeats in the genome. This is easy: library(parallel) REPEATS <- do.call(c, mclapply(seqlevels(HM450.HG19), function(ch) { GRanges(ch, union(masks(Hsapiens[[ch]])$RM, masks(Hsapiens[[ch]])$TRF)) })) rpts.450 <- subsetByOverlaps(hm450, REPEATS) summary(rpts.450) ## Length Class Mode ## 78164 GRanges S4 The example gets to the point where the Ranges/FeatureDb-based version is; I'll post another in a while, using the distanceToNearest functionality to see how far a given probe is from a TSS, either UCSC or FANTOM- annotated, and assorted other things that use a TxDb object to compute quickly. I am never going to put up a PDF with a "protocol" (the #$%@?!) that I'm not 1000% sure of ever again... I strongly suggest that you take whatever container your data is in, and turn it into a SummarizedExperiment with the hm450 object as the rowData. I believe you will thank me later on, to the point that I've written a generic to do this for MethyLumiSet objects, and could do the same for MethyLumiM and MethylSet objects if there is demand. I would like to patch Gviz to handle SummarizedExperiments soon, too. One thing I have not done here is to mask SNPs, because without MAF information (or, better, SNP information on a subject), I feel like it may be swatting a fly with a Buick to mask all CpG/CpH overlaps with dbSNP regardless of minor allele frequency. You can use the Probe_SNPs and Probe_SNPs_10 columns in the manifest if you want to mask in that fashion against dbSNP build 131, but there are other things that I find wanting in that approach (i.e. you should widen the interrogated sequences appropriately). Anyways... Hope this helps. Basically, if you want to look at the probes on the 450k platform which fall into a given feature or class of features, you assemble a GRanges object matching those data (perhaps ChIPseq peaks, perhaps UCSC tracks, perhaps FeatureDb output) and use subsetByOverlaps with the appropriate arguments to isolate the desired probes in those regions. The trick is finding feature data. --t On Fri, May 4, 2012 at 7:15 AM, Ed Siefker <ebs15242@gmail.com> wrote: > I've run some arrays through minfi and have a table of differentially > methylated positions. I can get the probe sequence from > IlluminaHumanMethylation450kmanifest, but it's not clear to me how to > translate that into genes, promoters, chromosomes or anything > biologically informative. > > What do people usually do at this point? > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD COMMENT • link 12.0 years ago Tim Triche ★ 4.2k

0

Entering edit mode

In the previous example, the code to pull REPEATS will return an error (I had the data structure from the 450k.db rebuild in RAM). The fix is trivial, and anyone following this conversation may have figured it out, but the solution is to use hm450 in place of HM450.HG19. ## was: REPEATS <- do.call(c, mclapply(seqlevels(HM450.HG19), function(ch) { ## ## should instead be the following: ## REPEATS <- do.call(c, mclapply(seqlevels(hm450), function(ch) { GRanges(ch, union(masks(Hsapiens[[ch]])$RM, masks(Hsapiens[[ch]])$TRF)) })) ## ## this step will now work properly: ## rpts.450 <- subsetByOverlaps(hm450, REPEATS) summary(rpts.450) ## ## Length Class Mode ## 78164 GRanges S4 ## A followup: a colleague suggested using the UCSC common SNPs track to mask on a MAF of 1% or greater. This gives me an excuse to use a FeatureDb from UCSC and thereby segue directly into what I have started with the rebuild of the 450k.db package. Unfortunately, it looks like I'm not going to be able to use the handy makeFeatureDbFromUCSC() function to do it, so I'll check back in later with a somewhat clunkier example accomplishing the same thing. Similarly for distance to TSS (annotated by UCSC or otherwise), blah blah. Apologies for the earlier bug. --t On Fri, May 4, 2012 at 1:55 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > I'm not really sure why Kasper rearranged > the IlluminaHumanMethylation450kmanifest package; it's more of a pain to > use now. In the first release of minfi, there was a nice little example of > how to cast the source sequences, strand, and interrogated site from the > HM450k manifest into a GRanges object and go from there. Perhaps another > project stole all of his time. Anyways, the following should be useful for > people using either 'minfi' or 'methylumi'; I would suggest that, for > visualization and annotation purposes, a SummarizedExperiment is the > logical terminus for processed DNA methylation array data, as it segues > more naturally into BSseq, ChIPseq, and RNAseq integration. > > In order to level the playing ground, I will construct a Ranges object > using the annotations at GEO which the TCGA ADF now references. > > Since I'm refreshing both the 450k.db and probe packages (the latter has a > bug in the example script; don't use it), I will put something like the > following into the documentation. The bottleneck for refreshing the .db > package is the slowness of build checks with db0-type chip-style schemas > and my slow learning of how to use a FeatureDb instead. I strongly believe > that GRanges are the appropriate data structure for this type of annotation > data. Personally I've been combining 450k results with RNAseq and ChIPseq > data to look at some (IMHO) rather interesting phenomena, and having the > flexibility of a GRanges-based object (a SummarizedExperiment, to be > specific) is enormously useful. On the other hand, I haven't found that > segmentation-based approaches seem to be very useful for the 450k data, as > some fine-scale changes get lost when you compare to BS-seq data in similar > cell types. In my experience, it makes more sense to define genomic > regions and sweep up changes in those, rather than segmenting DNA > methylation, but that's just one data point. So, here goes. > > require(GEOquery) > require(GenomicRanges) > require(GenomicFeatures) > require(BSgenome.Hsapiens.UCSC.hg19) > > hm450k <- getGEO('GPL13534') > > ## Here's a gotcha that you might like to avoid: > ## (one of the reasons I'm bothering with a 450k.db update is that I have > resolved it) > ## > anyMissing(as.numeric(hm450k@dataTable@table$MAPINFO)) > ## TRUE > ## > ## exclude these (SNP probes) > noMap <- whichis.na(as.numeric(hm450k@dataTable@table$MAPINFO))) > hm450 <- with( hm450k@dataTable@table[ -noMap, ], > GRanges( paste0('chr', CHR), > IRanges(as.numeric(MAPINFO), width=1), > probe=Name, > probeStrand=Strand, > sourceSeq=DNAStringSet(SourceSeq), > ) > ) > names(hm450) <- values(hm450)$probe > values(hm450)$probeType = Rle(ifelse(substr(names(hm450),1,2)=='cg', > 'CpG', 'CpH')) > > Now we have something resembling annotation (but not really, for reasons > I'll demonstrate later). > > ## inspect it: > head(hm450, 1) > GRanges with 1 range and 4 elementMetadata cols: > seqnames ranges strand | probe probeStrand > sourceSeq probeType > <rle> <iranges> <rle> | <character> <character> > <dnastringset> <rle> > cg00035864 chrY [8553009, 8553009] * | cg00035864 F > AGACACTAGCAGTCTTGTCCACATAGACCCTTGAATTTATCTCAAATTCG CpG > > I propose that you always sanity check your work at each step. Here's one > way I typically do it: > > hm450.subset <- hm450[ sample(names(hm450), 10) ] > values(hm450.subset)$siteSeq <- getSeq(Hsapiens, resize(hm450.subset, 2)) > show(hm450.subset) > ## GRanges with 10 ranges and 5 elementMetadata cols: > ## seqnames ranges strand | probe > probeStrand ## sourceSeq probeType > siteSeq > ## <rle> <iranges> <rle> | <character> > <character> ## <dnastringset> <rle> > <dnastringset> > ## ch.X.665616F chrX [ 41078352, 41078352] * | ch.X.665616F > F CAGGCTTATGACAACTTAAGCTTGAGTGATCACTTACTAAGAGCAGTACT CpH > CA > ## cg19794706 chr14 [ 75643232, 75643232] * | cg19794706 > R GGAGATGGCAAGGACCAATCTGGGGCCGAGCAGGAACAAAAGCAGCAACG CpG > CG > ## cg15391651 chr1 [165132273, 165132273] * | cg15391651 > F CGGAGCCCTGAGTGTGCACAAAGCACCACTATGCCAGAGTGATGTTATCA CpG > CG > > > Suppose you wanted to mask probes whose interrogated site overlaps with > repeats in the genome. This is easy: > > library(parallel) > REPEATS <- do.call(c, mclapply(seqlevels(HM450.HG19), function(ch) { > GRanges(ch, union(masks(Hsapiens[[ch]])$RM, masks(Hsapiens[[ch]])$TRF)) > })) > rpts.450 <- subsetByOverlaps(hm450, REPEATS) > summary(rpts.450) > ## Length Class Mode > ## 78164 GRanges S4 > > The example gets to the point where the Ranges/FeatureDb-based version is; > I'll post another in a while, using the distanceToNearest functionality to > see how far a given probe is from a TSS, either UCSC or FANTOM- annotated, > and assorted other things that use a TxDb object to compute quickly. I am > never going to put up a PDF with a "protocol" (the #$%@?!) that I'm not > 1000% sure of ever again... > > I strongly suggest that you take whatever container your data is in, and > turn it into a SummarizedExperiment with the hm450 object as the rowData. > I believe you will thank me later on, to the point that I've written a > generic to do this for MethyLumiSet objects, and could do the same for > MethyLumiM and MethylSet objects if there is demand. I would like to patch > Gviz to handle SummarizedExperiments soon, too. > > One thing I have not done here is to mask SNPs, because without MAF > information (or, better, SNP information on a subject), I feel like it may > be swatting a fly with a Buick to mask all CpG/CpH overlaps with dbSNP > regardless of minor allele frequency. You can use the Probe_SNPs > and Probe_SNPs_10 columns in the manifest if you want to mask in that > fashion against dbSNP build 131, but there are other things that I find > wanting in that approach (i.e. you should widen the interrogated sequences > appropriately). Anyways... > > Hope this helps. Basically, if you want to look at the probes on the 450k > platform which fall into a given feature or class of features, you assemble > a GRanges object matching those data (perhaps ChIPseq peaks, perhaps UCSC > tracks, perhaps FeatureDb output) and use subsetByOverlaps with the > appropriate arguments to isolate the desired probes in those regions. The > trick is finding feature data. > > --t > > > > > On Fri, May 4, 2012 at 7:15 AM, Ed Siefker <ebs15242@gmail.com> wrote: > >> I've run some arrays through minfi and have a table of differentially >> methylated positions. I can get the probe sequence from >> IlluminaHumanMethylation450kmanifest, but it's not clear to me how to >> translate that into genes, promoters, chromosomes or anything >> biologically informative. >> >> What do people usually do at this point? >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Tim Triche ★ 4.2k

Login before adding your answer.