Question

How to link methylation code from UCSC (hg19-cg07790169) to ENSEMBL genome code (goseq procedure)

0

Entering edit mode

dusan.petrovic • 0

@dusanpetrovic-15081

Last seen 7.9 years ago

Switzerland

Hello everyone,

I am new to this forum so I'm apologizing in advance if I am not posting in the most formal way.

I am using the goseq package from bioconductor in order to perform enrichment for markers that are differentially methylated (limma output). These markers have been measured according to CpG arrays from 450K illumina (i.e. cg07790169) and I am retrieving them on ucsc website. I am a newbie at using the goseq library, so I do not know it in details. But my understanding is that it uses codes from Ensembl (i.e. ENSG00....) and not gene codes from UCSC.

Therefore, I would like to know whether there is a package that would allow to link in a very straightforward way a methylation marker (i.e. cg07790169) to an Ensembl gene code (i.e. ENSG00...) so that I could properly run goseq.

I could do it by copy and paste, but it would be extremely time consuming (I have several hundred methylation markers).

Thank you for your kind help and understanding

goseq methylation ensembl ucsc limma • 2.3k views

ADD COMMENT • link 7.9 years ago dusan.petrovic • 0

score 0 · Answer 1 · 2018-02-22

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

The main use case for goseq is to perform GO hypergeometric testing with bias adjustments for gene length, when using RNA-Seq data. You aren't doing that, so why are you using goseq? There is no length bias inherent in the measurements from the Illumina 450K platform, nor does that platform give any measurements that are readily converted to gene expression.

I suppose you could naively attribute differentially methylated CpG islands to the nearest gene, and infer that the given gene is thus differentially expressed, but at that point you should be using something like GOstats or topGO, because your measurements don't have any length bias.

You don't say how you analyzed your Illumina data, but do note that the FDb.InfiniumMethylation.hg19 package is intended to provide genomic annotation (e.g., chromosomal positions, etc) for all of the probes on that array, and you could use that in concert with the TxDb.Hsapiens.UCSC.hg19.knownGene package to find the nearest gene for each CpG.

Anyway, you are trying to do some fairly non-standard stuff, which by definition means you will be pretty much on your own. This support site is really intended to help people with questions that are readily answered, whereas you appear to have pitched off into the deep end of the pool. If you are willing to do (quite a bit) of reading, you should be able to figure out what you need to do. Otherwise I would highly recommend finding somebody local with relevant experience.

ADD COMMENT • link 7.9 years ago James W. MacDonald 68k

0

Entering edit mode

And re: required reading, I would recommend the help pages for the FDb.InfiniumMethylation.hg19 package, particularly the getNearest function, which should be relevant.

ADD REPLY • link 7.9 years ago James W. MacDonald 68k

0

Entering edit mode

James, there is some literature on using goseq for DNA methylation data, and there is in fact a Bioconductor package called 'missMethyl' which has a function called 'gometh' which was specifically developed to apply goseq methods to the illumina 450k platform. The idea being that genes with differing # of CpGs are a priori more/less likely to appear in the DMR gene set, so gometh would help correct for that bias.

References:

Gene-set analysis is severely biased when applied to genome-wide methylation data

https://academic.oup.com/bioinformatics/article/29/15/1851/265573

missMethyl

https://bioconductor.org/packages/release/bioc/html/missMethyl.html

ADD REPLY • link 7.7 years ago bmreilly • 0

0

Entering edit mode

Sure. But that corrects for the fact that there may be more or less CpGs on an Illumina array that are sufficiently close to a given gene, which may or may not have anything to do with the length of the gene, which is what goseq is concerned with. Or do I miss something?

ADD REPLY • link 7.7 years ago James W. MacDonald 68k

score 0 · Answer 2 · 2018-03-05

0

Entering edit mode

dusan.petrovic • 0

@dusanpetrovic-15081

Last seen 7.9 years ago

Switzerland

Hi James,

Thank you for your reply, I switched to topGO following your advice.

Best

Dusan

ADD COMMENT • link 7.9 years ago dusan.petrovic • 0

0

Entering edit mode

Dusan, see my comment above on James' post. There are R packages which you will likely find useful for analyzing 450k data. See 'missMethyl' .

ADD REPLY • link 7.7 years ago bmreilly • 0