Question: annotatePeak() in ChIPseeker - how long does it take to run?
0
gravatar for Elsa
4.0 years ago by
Elsa0
United Kingdom
Elsa0 wrote:

Hi,

I have downloaded a chip-seq .bed file from an available geo dataset (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM590111) and would like to use ChIPseeker to plot the Average Profile of ChIP peaks binding to the TSS region of a specific group of genes. However, I cannot seem to pass the annotatePeak() function, since it is running for the past almost 4 days (see below)... and with no signs of being almost finishing the annotation.

What am I doing wrong?

The code I have used is this:

# Load required libraries
library(ChIPseeker)
library(org.Mm.eg.db)
require(TxDb.Mmusculus.UCSC.mm9.knownGene) 

# Use readPeakFile to load the peak and store in GRanges object
sample = readPeakFile("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed.gz")

# Annotate data
txdb = TxDb.Mmusculus.UCSC.mm9.knownGene
sample_ann = annotatePeak(sample, tssRegion=c(-3000, 3000),TxDb=txdb)

These are the messages I am getting (I am running this from linux):

>> preparing features information...         2015-07-23 19:03:40 
>> identifying nearest features...         2015-07-23 19:03:41 
>> calculating distance from peak to TSS...     2015-07-23 19:11:49 
>> assigning genomic annotation...         2015-07-23 19:11:49 

As you can see, this is running since 7pm on the 23rd of July...

Help please!

Thanks!

E.

P.S. Here is the session info:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Mmusculus.UCSC.mm9.knownGene_3.1.2 GenomicFeatures_1.20.1                 
 [3] GenomicRanges_1.20.5                    org.Mm.eg.db_3.1.2                     
 [5] RSQLite_1.0.0                           DBI_0.3.1                              
 [7] AnnotationDbi_1.30.1                    GenomeInfoDb_1.4.1                     
 [9] IRanges_2.2.5                           S4Vectors_0.6.1                        
[11] Biobase_2.28.0                          BiocGenerics_0.14.0                    
[13] ChIPseeker_1.4.3                       
ADD COMMENTlink modified 3.8 years ago by Martin Morgan ♦♦ 23k • written 4.0 years ago by Elsa0
Answer: annotatePeak() in ChIPseeker - how long does it take to run?
6
gravatar for Hervé Pagès
4.0 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Guangchuang,

Mmmm... I looked at this and here is what I found.

The timings I get on my machine using Elsa's data:

  • annotatePeak(sample[1:64000], TxDb=txdb) takes 28 sec
  • annotatePeak(sample[1:128000], TxDb=txdb) takes 112 sec
  • annotatePeak(sample[1:256000], TxDb=txdb) takes 429 sec

So it looks like annotatePeak() runs in quadratic time! This means that it will take about 3.5 days to run on Elsa's full sample GRanges object.

After taking a closer look I found that annotatePeak() is spending most of its time in internal utility ChIPseeker:::getFirstHitIndex(), which is defined as follow:

> ChIPseeker:::getFirstHitIndex
function (x) 
{
    sapply(unique(x), function(i) which(x == i)[1])
}

Unfortunately this implementation is very inefficient, not only because it uses a loop, but first of all because it's quadratic in time. And this explains why annotatePeak() itself also runs in quadratic time.

A much more efficient implementation for ChIPseeker:::getFirstHitIndex() is:

> ChIPseeker:::getFirstHitIndex
function (x) 
{
    which(!duplicated(x))
}

This implementation is almost linear in time and so, with this change, annotatePeak() also runs in linear time. After this change, calling annotatePeak() on Elsa's full sample GRanges object takes less than 6 minutes on my machine :-)

H.

 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Hervé Pagès ♦♦ 14k
1

Thanks Herve. This is really remarkable. FYI, I add you as a contributor in the author list, see https://github.com/GuangchuangYu/ChIPseeker/commit/afac661613f9a99173c296a76163980fbc1360a0

 

After using the efficient implementation of getFirstHitIndex(), it runs also less than 5min on my computer.

I have commit this new implementation to both release (1.4.6) and devel (1.5.8).

 

Bests,

Guangchuang

 

ADD REPLYlink written 4.0 years ago by Guangchuang Yu1.1k

Awesome. Thanks!  H.

ADD REPLYlink written 4.0 years ago by Hervé Pagès ♦♦ 14k

Thanks Herve and Guangchuang for the tests, explanations and new implementations!! You are the best! :)

How long will it take for the new updated version to become available in the "update" packages section of R?

E.

ADD REPLYlink written 4.0 years ago by Elsa0

It may take 2 or 3 days.
 

ADD REPLYlink written 4.0 years ago by Guangchuang Yu1.1k

Cool! Thanks!!

ADD REPLYlink written 4.0 years ago by Elsa0

already available. you can use biocLite to install the latest version.

ADD REPLYlink written 4.0 years ago by Guangchuang Yu1.1k
Answer: annotatePeak() in ChIPseeker - how long does it take to run?
0
gravatar for Ou, Jianhong
4.0 years ago by
Ou, Jianhong1.1k
United States
Ou, Jianhong1.1k wrote:

Hi Elsa,

Just in case you are in a hurry, I post some code here for you to get the annotation in less than 1 hours. Meanwhile, you can wait for Guangchuang's reply. I am sorry that if this will make you fell annoyed.

library(BiocInstaller)
biocLite(c("ChIPpeakAnno", "GEOquery"))
library(org.Mm.eg.db)
library(ChIPpeakAnno)
library(GEOquery)
getGEOSuppFiles("GSM590111")
setwd("GSM590111")
gunzip("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed.gz")
readLines("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed", n=5)
bed <- toGRanges("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed", format="BED", header=FALSE) ## here, if you want to keep the reads itself, you could try library(rtracklayer); bed <- import("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed");
data(TSS.mouse.NCBIM37) ##mm9
sample_ann <- annotatePeakInBatch(bed, AnnotationData=TSS.mouse.NCBIM37, output="upstream&inside", maxgap=3000, ignore.strand=FALSE)
sample_ann <- sample_ann[!is.na(sample_ann$shortestDistance)]
sample_ann <- sample_ann[sample_ann$shortestDistance <= 3000]
sample_ann <- addGeneIDs(sample_ann, IDs2Add="symbol", orgAnn="org.Mm.eg.db")
head(sample_ann)

 

 

ADD COMMENTlink written 4.0 years ago by Ou, Jianhong1.1k
Answer: annotatePeak() in ChIPseeker - how long does it take to run?
0
gravatar for Guangchuang Yu
4.0 years ago by
Guangchuang Yu1.1k
China/Guangzhou/Southern Medical University
Guangchuang Yu1.1k wrote:

1. If you want to plot average profile, you don't need to run annotatePeak.

2. It takes time to run annotatePeak is expected since the file is large (7084947 genomic coord).

> object.size(sample)
821865648 bytes

If you just want to annotate peaks with nearest gene as solution posted by Jianhong, you can use

sample_ann = annotatePeak(sample, tssRegion=c(-3000, 3000),TxDb=txdb, assignGenomicAnnotation= FALSE).

It runs 1 hour in my computer, similar to what Jianhong did with ChIPpeakAnno.

By default, annotatePeak will also annotate peaks with host information (exon/intron, UTR, promoter, gene downstream or distal intergenic), this will take more time than annotate by nearest gene and it consumed more memory as we stored multiple annotation info (see http://ygc.name/2014/10/01/multiple-annotation-in-chipseeker/ and http://ygc.name/2015/07/28/upsetplot-in-chipseeker/ ).

In this example, as your dataset contains 7084947 records, the matrix that store genomic annotation will have 7084947 rows. It is huge and inefficient to process in R.

 

If you play with such a large dataset, I recommend you randomly sample a subset to test your script before running the whole dataset.

 

BTW: The version you run 1.4.3 has a small bug introduced by a newly added function, seq2gene. The bug had been fixed in 1.4.4. Now the current version is 1.4.5. Please update your ChIPseeker.

 

Bests,

Guangchuang

 

 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Guangchuang Yu1.1k

The results of ChIPpeakAnno is not only the nearest, but all the annotations in that region you defined. Note that the output is not "nearest" but "upstream&inside". ChIPpeakAnno is defined for huge data and high efficiency. Just as Guangchuang said, to store multiple annotation info will consume more memories, and this will stuck your computer. I think it is better to call peaks before you feed all the raw reads into annotation tools.

Gook luck in your research.

ADD REPLYlink written 4.0 years ago by Ou, Jianhong1.1k

you are right, peak calling is a critical step.

"upstream&inside" reported by ChIPpeakAnno is not enough and in ChIPseeker we report very detail information, for example Exon (uc002sbe.3/9736, exon 69 of 80) which means that the peak is overlaps with the 69th exon of the 80 exons that transcript uc002sbe.3 possess and the corresponding Entrez gene ID is 9736.


 

ADD REPLYlink written 4.0 years ago by Guangchuang Yu1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 268 users visited in the last hour