Extracting overlapping gene names from a list of peaks

0

Entering edit mode

Patrick Schorderet ▴ 20

@patrick-schorderet-6081

Last seen 9.0 years ago

United States

Dear all, I have a list of peaks from ChIPseq experiments. Now I am trying to find to over which genes these peaks overlap (and extract the gene name). I'm sure this should be pretty easy, but I am just starting with bioconductor, so some concepts are still vague. Here's what I did so far: # Not run because it is installed (as well as other packages) # source("http://bioconductor.org/biocLite.R") # biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene") # Load the Dmelanogaster genome library(TxDb.Dmelanogaster.UCSC.dm3.ensGene, quietly = TRUE) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene ee <- exonsBy(txdb, "gene") # Load an subset of my peaks for the sake of the example finalPeaks <- rbind(c("chr3R", "2788500", "2842850", "2815675", "54350"), (c("chr3R", "12484350", "12661350", "12572850", "177000"))) rownames(finalPeaks) <- c("Peak 1", "Peak 2") colnames(finalPeaks) <- c("chr", "start", "end", "center", "length") # Create a GRanges object with the file I have # finalPeaks is a matrix with rows being individual peaks and columns <- c() GRfinalPeaks <- GRanges(finalPeaks[,1], IRanges(start = as.numeric(finalPeaks[,2]), end = as.numeric(finalPeaks[,3]))) I'm stuck from here. What I'd like is to get, for each peak, the overlapping genes. For example, an output that would be: Peak1: GeneA, GeneB, GeneC, etc Peak2: GeneD Also, I don't know how simple it is because (as specified in the output example) one peak can overlap several genes or none at all.. Thanks for any help, Patrick [[alternative HTML version deleted]]

ChIPSeq chipseq ChIPSeq chipseq • 1.4k views

ADD COMMENT • link updated 10.7 years ago by Michael Lawrence ★ 11k • written 10.7 years ago by Patrick Schorderet ▴ 20

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi Patrick, On 8/6/2013 11:48 AM, Patrick Schorderet wrote: > Dear all, > > I have a list of peaks from ChIPseq experiments. Now I am trying to find to over which genes these peaks overlap (and extract the gene name). > I'm sure this should be pretty easy, but I am just starting with bioconductor, so some concepts are still vague. > Here's what I did so far: > > # Not run because it is installed (as well as other packages) > # source("http://bioconductor.org/biocLite.R") > # biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene") > # Load the Dmelanogaster genome > library(TxDb.Dmelanogaster.UCSC.dm3.ensGene, quietly = TRUE) > txdb<- TxDb.Dmelanogaster.UCSC.dm3.ensGene > ee<- exonsBy(txdb, "gene") > > # Load an subset of my peaks for the sake of the example > finalPeaks<- rbind(c("chr3R", "2788500", "2842850", "2815675", "54350"), (c("chr3R", "12484350", "12661350", "12572850", "177000"))) > rownames(finalPeaks)<- c("Peak 1", "Peak 2") > colnames(finalPeaks)<- c("chr", "start", "end", "center", "length") > > # Create a GRanges object with the file I have > # finalPeaks is a matrix with rows being individual peaks and columns<- c() > GRfinalPeaks<- GRanges(finalPeaks[,1], IRanges(start = as.numeric(finalPeaks[,2]), end = as.numeric(finalPeaks[,3]))) > > I'm stuck from here. What I'd like is to get, for each peak, the overlapping genes. > For example, an output that would be: > > Peak1: GeneA, GeneB, GeneC, etc > Peak2: GeneD > > Also, I don't know how simple it is because (as specified in the output example) one peak can overlap several genes or none at all.. > Thanks for any help, > sapply(1:2, function(x) names(ee[ee %over% GRfinalPeaks[x,],])) [[1]] [1] "FBgn0260642" [[2]] [1] "FBgn0000014" "FBgn0003944" "FBgn0015230" "FBgn0020556" "FBgn0051498" [6] "FBgn0063261" "FBgn0084245" "FBgn0084688" "FBgn0085056" You can then coerce that to whatever form you like. Best, Jim > > Patrick > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.7 years ago James W. MacDonald 65k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

You can efficiently get a list like this: hits <- findOverlaps(GRfinalPeaks, ee) listOfGenesByPeak <- split(names(ee)[subjectHits(hits)], queryHits(hits)) But maybe what you want is a long-form table: DataFrame(peak = queryHits(hits), gene = names(ee)[subjectHits(hits)]) On Tue, Aug 6, 2013 at 8:48 AM, Patrick Schorderet < patrick.schorderet@molbio.mgh.harvard.edu> wrote: > Dear all, > > I have a list of peaks from ChIPseq experiments. Now I am trying to find > to over which genes these peaks overlap (and extract the gene name). > I'm sure this should be pretty easy, but I am just starting with > bioconductor, so some concepts are still vague. > Here's what I did so far: > > # Not run because it is installed (as well as other packages) > # source("http://bioconductor.org/biocLite.R") > # biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene") > # Load the Dmelanogaster genome > library(TxDb.Dmelanogaster.UCSC.dm3.ensGene, quietly = TRUE) > txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene > ee <- exonsBy(txdb, "gene") > > # Load an subset of my peaks for the sake of the example > finalPeaks <- rbind(c("chr3R", "2788500", "2842850", "2815675", "54350"), > (c("chr3R", "12484350", "12661350", "12572850", "177000"))) > rownames(finalPeaks) <- c("Peak 1", "Peak 2") > colnames(finalPeaks) <- c("chr", "start", "end", "center", "length") > > # Create a GRanges object with the file I have > # finalPeaks is a matrix with rows being individual peaks and columns <- > c() > GRfinalPeaks <- GRanges(finalPeaks[,1], IRanges(start = > as.numeric(finalPeaks[,2]), end = as.numeric(finalPeaks[,3]))) > > I'm stuck from here. What I'd like is to get, for each peak, the > overlapping genes. > For example, an output that would be: > > Peak1: GeneA, GeneB, GeneC, etc > Peak2: GeneD > > Also, I don't know how simple it is because (as specified in the output > example) one peak can overlap several genes or none at all.. > Thanks for any help, > > Patrick > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 10.7 years ago Michael Lawrence ★ 11k

Login before adding your answer.