annotatePeak() in ChIPseeker - how long does it take to run?
3
0
Entering edit mode
Elsa • 0
@elsa-7629
Last seen 8.8 years ago
United Kingdom

Hi,

I have downloaded a chip-seq .bed file from an available geo dataset (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM590111) and would like to use ChIPseeker to plot the Average Profile of ChIP peaks binding to the TSS region of a specific group of genes. However, I cannot seem to pass the annotatePeak() function, since it is running for the past almost 4 days (see below)... and with no signs of being almost finishing the annotation.

What am I doing wrong?

The code I have used is this:

# Load required libraries
library(ChIPseeker)
library(org.Mm.eg.db)
require(TxDb.Mmusculus.UCSC.mm9.knownGene) 

# Use readPeakFile to load the peak and store in GRanges object
sample = readPeakFile("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed.gz")

# Annotate data
txdb = TxDb.Mmusculus.UCSC.mm9.knownGene
sample_ann = annotatePeak(sample, tssRegion=c(-3000, 3000),TxDb=txdb)

These are the messages I am getting (I am running this from linux):

>> preparing features information...         2015-07-23 19:03:40 
>> identifying nearest features...         2015-07-23 19:03:41 
>> calculating distance from peak to TSS...     2015-07-23 19:11:49 
>> assigning genomic annotation...         2015-07-23 19:11:49 

As you can see, this is running since 7pm on the 23rd of July...

Help please!

Thanks!

E.

P.S. Here is the session info:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Mmusculus.UCSC.mm9.knownGene_3.1.2 GenomicFeatures_1.20.1                 
 [3] GenomicRanges_1.20.5                    org.Mm.eg.db_3.1.2                     
 [5] RSQLite_1.0.0                           DBI_0.3.1                              
 [7] AnnotationDbi_1.30.1                    GenomeInfoDb_1.4.1                     
 [9] IRanges_2.2.5                           S4Vectors_0.6.1                        
[11] Biobase_2.28.0                          BiocGenerics_0.14.0                    
[13] ChIPseeker_1.4.3                       
annotatePeak chipseeker chipseq org.Mm.eg.db TxDb.Mmusculus.UCSC.mm9.knownGene • 4.9k views
ADD COMMENT
6
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi Guangchuang,

Mmmm... I looked at this and here is what I found.

The timings I get on my machine using Elsa's data:

  • annotatePeak(sample[1:64000], TxDb=txdb) takes 28 sec
  • annotatePeak(sample[1:128000], TxDb=txdb) takes 112 sec
  • annotatePeak(sample[1:256000], TxDb=txdb) takes 429 sec

So it looks like annotatePeak() runs in quadratic time! This means that it will take about 3.5 days to run on Elsa's full sample GRanges object.

After taking a closer look I found that annotatePeak() is spending most of its time in internal utility ChIPseeker:::getFirstHitIndex(), which is defined as follow:

> ChIPseeker:::getFirstHitIndex
function (x) 
{
    sapply(unique(x), function(i) which(x == i)[1])
}

Unfortunately this implementation is very inefficient, not only because it uses a loop, but first of all because it's quadratic in time. And this explains why annotatePeak() itself also runs in quadratic time.

A much more efficient implementation for ChIPseeker:::getFirstHitIndex() is:

> ChIPseeker:::getFirstHitIndex
function (x) 
{
    which(!duplicated(x))
}

This implementation is almost linear in time and so, with this change, annotatePeak() also runs in linear time. After this change, calling annotatePeak() on Elsa's full sample GRanges object takes less than 6 minutes on my machine :-)

H.

 

ADD COMMENT
1
Entering edit mode

Thanks Herve. This is really remarkable. FYI, I add you as a contributor in the author list, see https://github.com/GuangchuangYu/ChIPseeker/commit/afac661613f9a99173c296a76163980fbc1360a0

 

After using the efficient implementation of getFirstHitIndex(), it runs also less than 5min on my computer.

I have commit this new implementation to both release (1.4.6) and devel (1.5.8).

 

Bests,

Guangchuang

 

ADD REPLY
0
Entering edit mode

Awesome. Thanks!  H.

ADD REPLY
0
Entering edit mode

Thanks Herve and Guangchuang for the tests, explanations and new implementations!! You are the best! :)

How long will it take for the new updated version to become available in the "update" packages section of R?

E.

ADD REPLY
0
Entering edit mode

It may take 2 or 3 days.
 

ADD REPLY
0
Entering edit mode

Cool! Thanks!!

ADD REPLY
0
Entering edit mode

already available. you can use biocLite to install the latest version.

ADD REPLY
0
Entering edit mode
Ou, Jianhong ★ 1.3k
@ou-jianhong-4539
Last seen 1 day ago
United States

Hi Elsa,

Just in case you are in a hurry, I post some code here for you to get the annotation in less than 1 hours. Meanwhile, you can wait for Guangchuang's reply. I am sorry that if this will make you fell annoyed.

library(BiocInstaller)
biocLite(c("ChIPpeakAnno", "GEOquery"))
library(org.Mm.eg.db)
library(ChIPpeakAnno)
library(GEOquery)
getGEOSuppFiles("GSM590111")
setwd("GSM590111")
gunzip("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed.gz")
readLines("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed", n=5)
bed <- toGRanges("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed", format="BED", header=FALSE) ## here, if you want to keep the reads itself, you could try library(rtracklayer); bed <- import("GSM590111_E14-serum_H3K4me3-ChIP_Seq.bed");
data(TSS.mouse.NCBIM37) ##mm9
sample_ann <- annotatePeakInBatch(bed, AnnotationData=TSS.mouse.NCBIM37, output="upstream&inside", maxgap=3000, ignore.strand=FALSE)
sample_ann <- sample_ann[!is.na(sample_ann$shortestDistance)]
sample_ann <- sample_ann[sample_ann$shortestDistance <= 3000]
sample_ann <- addGeneIDs(sample_ann, IDs2Add="symbol", orgAnn="org.Mm.eg.db")
head(sample_ann)

 

 

ADD COMMENT
0
Entering edit mode
Guangchuang Yu ★ 1.2k
@guangchuang-yu-5419
Last seen 5 days ago
China/Guangzhou/Southern Medical Univer…

1. If you want to plot average profile, you don't need to run annotatePeak.

2. It takes time to run annotatePeak is expected since the file is large (7084947 genomic coord).

> object.size(sample)
821865648 bytes

If you just want to annotate peaks with nearest gene as solution posted by Jianhong, you can use

sample_ann = annotatePeak(sample, tssRegion=c(-3000, 3000),TxDb=txdb, assignGenomicAnnotation= FALSE).

It runs 1 hour in my computer, similar to what Jianhong did with ChIPpeakAnno.

By default, annotatePeak will also annotate peaks with host information (exon/intron, UTR, promoter, gene downstream or distal intergenic), this will take more time than annotate by nearest gene and it consumed more memory as we stored multiple annotation info (see http://ygc.name/2014/10/01/multiple-annotation-in-chipseeker/ and http://ygc.name/2015/07/28/upsetplot-in-chipseeker/ ).

In this example, as your dataset contains 7084947 records, the matrix that store genomic annotation will have 7084947 rows. It is huge and inefficient to process in R.

 

If you play with such a large dataset, I recommend you randomly sample a subset to test your script before running the whole dataset.

 

BTW: The version you run 1.4.3 has a small bug introduced by a newly added function, seq2gene. The bug had been fixed in 1.4.4. Now the current version is 1.4.5. Please update your ChIPseeker.

 

Bests,

Guangchuang

 

 

ADD COMMENT
0
Entering edit mode

The results of ChIPpeakAnno is not only the nearest, but all the annotations in that region you defined. Note that the output is not "nearest" but "upstream&inside". ChIPpeakAnno is defined for huge data and high efficiency. Just as Guangchuang said, to store multiple annotation info will consume more memories, and this will stuck your computer. I think it is better to call peaks before you feed all the raw reads into annotation tools.

Gook luck in your research.

ADD REPLY
0
Entering edit mode

you are right, peak calling is a critical step.

"upstream&inside" reported by ChIPpeakAnno is not enough and in ChIPseeker we report very detail information, for example Exon (uc002sbe.3/9736, exon 69 of 80) which means that the peak is overlaps with the 69th exon of the 80 exons that transcript uc002sbe.3 possess and the corresponding Entrez gene ID is 9736.


 

ADD REPLY

Login before adding your answer.

Traffic: 488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6