Question

annotatePeak() in ChIPseeker - how long does it take to run?

0

Entering edit mode

Elsa • 0

@elsa-7629

Last seen 9.7 years ago

United Kingdom

Hi,

I have downloaded a chip-seq .bed file from an available geo dataset (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM590111) and would like to use ChIPseeker to plot the Average Profile of ChIP peaks binding to the TSS region of a specific group of genes. However, I cannot seem to pass the annotatePeak() function, since it is running for the past almost 4 days (see below)... and with no signs of being almost finishing the annotation.

What am I doing wrong?

The code I have used is this:

# Load required libraries
library(ChIPseeker)
library(org.Mm.eg.db)
require(TxDb.Mmusculus.UCSC.mm9.knownGene) 

# Use readPeakFile to load the peak and store in GRanges object
sample = readPeakFile("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed.gz")

# Annotate data
txdb = TxDb.Mmusculus.UCSC.mm9.knownGene
sample_ann = annotatePeak(sample, tssRegion=c(-3000, 3000),TxDb=txdb)

These are the messages I am getting (I am running this from linux):

>> preparing features information...         2015-07-23 19:03:40 
>> identifying nearest features...         2015-07-23 19:03:41 
>> calculating distance from peak to TSS...     2015-07-23 19:11:49 
>> assigning genomic annotation...         2015-07-23 19:11:49

As you can see, this is running since 7pm on the 23rd of July...

Help please!

Thanks!

E.

P.S. Here is the session info:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Mmusculus.UCSC.mm9.knownGene_3.1.2 GenomicFeatures_1.20.1                 
 [3] GenomicRanges_1.20.5                    org.Mm.eg.db_3.1.2                     
 [5] RSQLite_1.0.0                           DBI_0.3.1                              
 [7] AnnotationDbi_1.30.1                    GenomeInfoDb_1.4.1                     
 [9] IRanges_2.2.5                           S4Vectors_0.6.1                        
[11] Biobase_2.28.0                          BiocGenerics_0.14.0                    
[13] ChIPseeker_1.4.3

annotatePeak chipseeker chipseq org.Mm.eg.db TxDb.Mmusculus.UCSC.mm9.knownGene • 5.8k views

ADD COMMENT • link updated 9.5 years ago by Martin Morgan 25k • written 9.7 years ago by Elsa • 0

score 6 · Answer 1 · 2015-07-29

6

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Guangchuang,

Mmmm... I looked at this and here is what I found.

The timings I get on my machine using Elsa's data:

annotatePeak(sample[1:64000], TxDb=txdb) takes 28 sec
annotatePeak(sample[1:128000], TxDb=txdb) takes 112 sec
annotatePeak(sample[1:256000], TxDb=txdb) takes 429 sec

So it looks like annotatePeak() runs in quadratic time! This means that it will take about 3.5 days to run on Elsa's full sample GRanges object.

After taking a closer look I found that annotatePeak() is spending most of its time in internal utility ChIPseeker:::getFirstHitIndex(), which is defined as follow:

> ChIPseeker:::getFirstHitIndex
function (x) 
{
    sapply(unique(x), function(i) which(x == i)[1])
}

Unfortunately this implementation is very inefficient, not only because it uses a loop, but first of all because it's quadratic in time. And this explains why annotatePeak() itself also runs in quadratic time.

A much more efficient implementation for ChIPseeker:::getFirstHitIndex() is:

> ChIPseeker:::getFirstHitIndex
function (x) 
{
    which(!duplicated(x))
}

This implementation is almost linear in time and so, with this change, annotatePeak() also runs in linear time. After this change, calling annotatePeak() on Elsa's full sample GRanges object takes less than 6 minutes on my machine :-)

H.

ADD COMMENT • link 9.7 years ago Hervé Pagès 16k

1

Entering edit mode

Thanks Herve. This is really remarkable. FYI, I add you as a contributor in the author list, see https://github.com/GuangchuangYu/ChIPseeker/commit/afac661613f9a99173c296a76163980fbc1360a0

After using the efficient implementation of getFirstHitIndex(), it runs also less than 5min on my computer.

I have commit this new implementation to both release (1.4.6) and devel (1.5.8).

Bests,

Guangchuang

ADD REPLY • link 9.7 years ago Guangchuang Yu ★ 1.2k

0

Entering edit mode

Awesome. Thanks! H.

ADD REPLY • link 9.7 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks Herve and Guangchuang for the tests, explanations and new implementations!! You are the best! :)

How long will it take for the new updated version to become available in the "update" packages section of R?

E.

ADD REPLY • link 9.7 years ago Elsa • 0

0

Entering edit mode

It may take 2 or 3 days.

ADD REPLY • link 9.7 years ago Guangchuang Yu ★ 1.2k

0

Entering edit mode

Cool! Thanks!!

ADD REPLY • link 9.7 years ago Elsa • 0

0

Entering edit mode

already available. you can use biocLite to install the latest version.

ADD REPLY • link 9.7 years ago Guangchuang Yu ★ 1.2k

score 0 · Answer 2 · 2015-07-27

Hi Elsa,

Just in case you are in a hurry, I post some code here for you to get the annotation in less than 1 hours. Meanwhile, you can wait for Guangchuang's reply. I am sorry that if this will make you fell annoyed.

library(BiocInstaller)
biocLite(c("ChIPpeakAnno", "GEOquery"))
library(org.Mm.eg.db)
library(ChIPpeakAnno)
library(GEOquery)
getGEOSuppFiles("GSM590111")
setwd("GSM590111")
gunzip("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed.gz")
readLines("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed", n=5)
bed <- toGRanges("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed", format="BED", header=FALSE) ## here, if you want to keep the reads itself, you could try library(rtracklayer); bed <- import("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed");
data(TSS.mouse.NCBIM37) ##mm9
sample_ann <- annotatePeakInBatch(bed, AnnotationData=TSS.mouse.NCBIM37, output="upstream&inside", maxgap=3000, ignore.strand=FALSE)
sample_ann <- sample_ann[!is.na(sample_ann$shortestDistance)]
sample_ann <- sample_ann[sample_ann$shortestDistance <= 3000]
sample_ann <- addGeneIDs(sample_ann, IDs2Add="symbol", orgAnn="org.Mm.eg.db")
head(sample_ann)

score 0 · Answer 3 · 2015-07-28

1. If you want to plot average profile, you don't need to run annotatePeak.

2. It takes time to run annotatePeak is expected since the file is large (7084947 genomic coord).

> object.size(sample)
821865648 bytes

If you just want to annotate peaks with nearest gene as solution posted by Jianhong, you can use

sample_ann = annotatePeak(sample, tssRegion=c(-3000, 3000),TxDb=txdb, assignGenomicAnnotation= FALSE).

It runs 1 hour in my computer, similar to what Jianhong did with ChIPpeakAnno.

By default, annotatePeak will also annotate peaks with host information (exon/intron, UTR, promoter, gene downstream or distal intergenic), this will take more time than annotate by nearest gene and it consumed more memory as we stored multiple annotation info (see http://ygc.name/2014/10/01/multiple-annotation-in-chipseeker/ and http://ygc.name/2015/07/28/upsetplot-in-chipseeker/ ).

In this example, as your dataset contains 7084947 records, the matrix that store genomic annotation will have 7084947 rows. It is huge and inefficient to process in R.

If you play with such a large dataset, I recommend you randomly sample a subset to test your script before running the whole dataset.

BTW: The version you run 1.4.3 has a small bug introduced by a newly added function, seq2gene. The bug had been fixed in 1.4.4. Now the current version is 1.4.5. Please update your ChIPseeker.

Bests,

Guangchuang