Search
Question: CRISPRseek package: be aware that as of Oct 2015, CRISPRseek works best for for regions < 200kb
0
2.8 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

Hi,

I have been using CRISPRseek to attempt to identify sgRNA targets in a large (2 megabase) region of hg19.

I was surprised to find that CRISPRseek ran in 4 hours if I split the region up into 20 x 100,000 bp regions, but failed to run even after 4 days on the full 2 megabase region.

It seems to work well for small regions, but if you are looking for sgRNA sites in a larger region, it seems like CRISPRseek probably uses some sort of R data structure internally that does not scale linearly in time spent with the size of the region.

If you are trying to find sgRNAs in a very long window, you will most likely need to find another solution, or a "hack" solution like chopping the regions up into smaller ones.

See below for sample code. No files are necessary to run it; the sample code generates its own 3 megabase test file.

modified 2.7 years ago • written 2.8 years ago by alexgraehl10
1

A much more helpful report would be to provide a short reproducible example that illustrates the problem. Then the author of the package can address the issue, and both you and the rest of the community will benefit. An example of this is discussed in the recent Bioconductor newsletter, where a similar report lead to a single-line edit and change in execution time from an estimated 3.5 days to 6 minutes. How far down this path can you take this question? Can you identify the bottleneck and suggest a solution?

1
2.8 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

# Good idea! So here is the code I ran for CRISPRseek... this code generates a fake 3 megabase fasta file for testing, so you don't need any other files.

If you break up the 3 megabase sequence into much smaller sequences, then the program actually will finish in a few hours.  So it isn't just an issue of there being too much sequence; it's something about having a large sequence in a SINGLE fasta record that makes it become slow.

Test code below. This should run if CRISPRseek is installed (but will take multiple days to finish):

if (!require("CRISPRseek")) {     print("You need to install CRISPRseek via bioconductor... see the commented-out code in the source")     # source("http://bioconductor.org/biocLite.R") ; biocLite("CRISPRseek") }

if (!require(BSgenome.Hsapiens.UCSC.hg19)) {   print("You should install the hg19 BSgenome data via bioconductor. See the comments in the source!")   # source("http://bioconductor.org/biocLite.R") ; biocLite("BSgenome.Hsapiens.UCSC.hg19") }

if (!require(TxDb.Hsapiens.UCSC.hg19.knownGene)) {   print("You should install TxDb for your species also!")   # source("http://bioconductor.org/biocLite.R") ; biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene") }

if (!require(org.Hs.eg.db)) {   print("You should install org.Hs.eg.db !")   # source("http://bioconductor.org/biocLite.R") ; biocLite("org.Hs.eg.db") }

# Make a FAKE fasta file for testing, then use it FAKE_FASTA_FILE_NAME <- "test.fasta" MAKE_FAKE_TEST = TRUE if (MAKE_FAKE_TEST) {   FA <- FAKE_FASTA_FILE_NAME   set.seed(12345)   fc <- file(FAKE_FASTA_FILE_NAME)   lineLen = 100   numLines = 4 # <-- change THIS number to change the number of lines generated. Set it to 3 to run super quickly, set to 30000 to take many many days   print(paste("Generatating a total of this many bases:", numLines*lineLen))   fakeDNA <- c(1:numLines)   for (i in 1:numLines) {     fakeDNA[i] <- paste(sample(c("A","C","G","T"), lineLen, replace=T), collapse="")   }   writeLines(c(">test_fake_3_megabase_region",fakeDNA), fc)   close(fc)   print("Done creating a fake fasta file for testing.") }

OUTDIR=paste(FA, ".out.dir", sep='') dir.create(OUTDIR)

print("The function below will take DAYS to run and will still not generate any output if the input file is large (2+ megabases).") # "Calling the function offTargetAnalysis with chromToSearch='"' results in quick gRNA search without performing on-target and off-target analysis" results <- CRISPRseek::offTargetAnalysis(inputFilePath=FA                                          , findgRNAsWithREcutOnly=FALSE                                          , findPairedgRNAOnly=FALSE, findgRNAs=TRUE                                          , chromToSearch = 'chr1' # just to make it faster --- set to "all" to check all chromosomes                                          , BSgenomeName=Hsapiens, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene, orgAnn=org.Hs.egSYMBOL                                          , PAM.size=3, PAM="NGG"                                          , gRNA.size=20, upstream=0, downstream=0 # default is 200                                          , max.mismatch=1, outputDir=OUTDIR, overwrite = TRUE)

0
2.8 years ago by
Julie Zhu3.8k
United States
Julie Zhu3.8k wrote:

Alex,

Thanks for sharing your testing! It turns out that the sequence length is not the culprit. It depends on the actual input sequence. It will be much faster for real repeat masked genomic sequence. The fake sequence you generated has lots of GG. Therefore, there are lots of gRNAs one after another, and annotating them with paired configuration and searching for off targets for each of the gRNAs takes long time.  In general, I suggest use repeat masked genomic sequence. I also agree with your approach to break down super long sequence to short sequences and perhaps submit multiple jobs in a cluster computing environment.

Thanks again for taking time to test the performance and sharing your findings with us! I will try to add multi core implementation to speed it up.

Best,

Julie

0
2.8 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

Hi Julie,

Probably using fake data was not a great example, then—I have actually done it with a 2 megabase (real) sequence, which *does* work if I split it like this:

>part1 (split part 1 of 30 or something)

ACGT... etc  (1–100,000 bases)

>part2 (split part 2 of 30 or something)

AGGCT... etc (100,001–200,000 bases)

etc....

And it does work like this, but it does NOT work if I make one gigantic chunk (i.e., I don't split it up into >part1, >part2, etc...).

This may be kind of a niche use case and not worth handling, but it would be nice to see it mentioned in the documentation if there either is some way to search for sgRNA sites in a huge region (maybe disabling the looking for pairs?), or if that is not possible.

Is there any way to disable the paired searching? I definitely don't care about that particular aspect; I'd just like to get the target specificity / off-target effects / "does-the-RNA-fold-into-something-and-become-unsuitable-for-targeting" information.

Currently I have been using "Benchling.com" to do this, but it requires a ton of clicking and is obviously inferior for large workflows as compared to some handy R code!

Alex, Thanks for the great idea to document the special use case! I am testing the multicore implementation, which should speed up the search for large sequences. Once it passes the tests, I will let you know the version to download. In addition, I will try to add a parameter called annotatePaired, default to TRUE. For your use case, you can set it to FALSE. I hope to get this feature in the coming release. Best, Julie
ADD REPLYlink modified 2.8 years ago by Dan Tenenbaum ♦♦ 8.2k • written 2.8 years ago by Julie Zhu3.8k

The description of the problem makes it sound like the evaluation time increases quadratically with the size of the data (e.g., taking 4 times as long for data that is only twice as big); if splitting the data in two allows it to run in a 'reasonable' time, then that implies that the algorithm does not have to be quadratic. There are many ways in which small changes to R code can change the algorithm from quadratic to linear, so a realistic example and code analysis would seem to be beneficial.

0
2.8 years ago by
Julie Zhu3.8k
United States
Julie Zhu3.8k wrote:

Alex, I meant version 1.9.13. Thanks!

0
2.8 years ago by
Julie Zhu3.8k
United States
Julie Zhu3.8k wrote:

Alex,

I have added the following to the user guide of CRISPRseek version 1.9.14.

2.9 Scenario 9. gRNA search and offTarget analysis of super long input sequence (longer than 200kb)

Calling the function offTargetAnalysis with annotatePaired = FALSE and enable.multicore = TRUE will improve the performance. We also suggest split the super long sequence into smaller chunks and perform offTarget analysis for each subsequence separately (Thank Alex Williams for sharing this use case at https://support.bioconductor.org/p/72994/). In addition, please remember to use repeat masked sequence as input.

Many thanks for your feedback and great suggestion!

Bes regards,

Julie

0
2.7 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

Hi,

I have run the new and updated CRISPRseek 1.10 on a small sample successfully!

Thanks for updating the CRISPRseek package!

(The 2 megabase region I was interested in is still running after 5 days, but maybe that will finish too... the sequences have been found, but the Summary file with their actual locations in the input hasn't been generated yet.)

I wanted to report an issue that I always see, which is that since the "GeneRfold" library is seemingly no longer available, CRISPRseek 100% of the time reports an error message before exiting. It seems like GeneRfold just doesn't work anymore?

Here is the text I see upon completion of any run of CRISPRseek:

[1] "MY_ERROR:   Error in library(GeneRfold): there is no package called 'GeneRfold'\n"

Done. Please check output files in directory (output dir)

I looked in the source, and it looks like CRISPRseek just skips over GeneRfold in this case. It looks like GeneRfold might also interact somehow with the "ViennaRNA" package, but I didn't dig too deeply.

I assume GeneRfold is unlikely to be updated again, so it may be worth finding a workaround and/or removing the folding entirely, although that would be sort of unfortunate since I am told that this is an important part of CRISPR guide RNA design.

GeneRfold was removed from Bioconductor with the 2.10 release. it looks like the devel version of CRISPRseek does not refer to GeneRfold at all. The CRISPRseek maintainers should remove reference to it from the release version as well.

Removed packages are listed here: http://bioconductor.org/about/removed-packages/

0
2.7 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

Also, I have always had to set "multicore = FALSE," or I always get this error:

Error in do.call(rbind, lapply(1:length(subjects), function(p) { :

error in evaluating the argument 'args' in selecting a method for function 'do.call': Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  :

all connections are in use

Not sure if that's a problem with my setup (R 3.2.2 on RedHat Enterprise Linux 6.5, I think), or what, but the easy workaround is to just set multicore=FALSE.

0
2.7 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

It appears that I have 128 sockets on this machine. I was trying to run a 2 megabase region, so perhaps the region is too large or it gets split too many ways? Possibly the "multicore" could pick a lower and/or user-defined limit?

> showConnections(all=T)

125 "<-localhost:11096" "sockconn" "a+b" "binary" "opened" "yes"    "yes"

126 "<-localhost:11096" "sockconn" "a+b" "binary" "opened" "yes"    "yes"

127 "<-localhost:11096" "sockconn" "a+b" "binary" "opened" "yes"    "yes"

0
2.7 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

Bug report and proposed solution: multicore=TRUE will *ALWAYS* fail on a server with a very large number of cores. (Specifically, I am running this on a machine with 160 cores, which is greater than the maximum allowed number of file connections (128).)

Here is the issue:

A maximum of 128 connections can be allocated (not necessarily open) at any one time. Three of these are pre-allocated (see stdout). The OS will impose limits on the numbers of connections of various types, but these are usually larger than 125. https://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html

Proposed solution: hard-code a maximum number of allowed connections to something like 32 or 64.

So the problem for multicore=TRUE is that the function:

> detectCores()

reports the number of CPU cores. Which is normally less than the R maximum number of sockets—most machines have 4 or 8 cores.

However, I am running this on a ridiculous server with 160 cores.

So detectCores returns 160:

[1] 160

Which is larger than the maximum number of sockets!

My recommendation would be to change all instances of:

n.cores <- detectCores() - 1
n.cores <- min(n.cores, x)

CONST_MAX_ALLOWED_CORES = 64

n.cores <- min(detectCores() - 1, kMAX_ALLOWED_CORES)

n.cores <- min(n.cores, x)

Otherwise this parameter will always fail on machines with a large number of cores!

Great that you found the problem. Will do. Thanks!
ADD REPLYlink modified 2.7 years ago by Martin Morgan ♦♦ 22k • written 2.7 years ago by Julie Zhu3.8k

A better solution (IMO) for CRISPRseek is to use BiocParallel::bplapply() rather than parLapply(). For Alex' machine this would use multicore (forked processes), whereas parLapply is using SNOW (independent processes communicating via sockets). The way to regulate the number of workers (forked or otherwise) in BiocParallel is to set the option options(mc.cores=...) or to register(MulticoreParam(<cores you want to use>)) before calling the function(s) that use parallel evaluation; use register(SerialParam()) to evaluate on a single core.

Thanks, Martin! Best, Julie From: "Martin Morgan [bioc]" <noreply@bioconductor.org<mailto:noreply@bioconductor.org>> Reply-To: "reply+1f2038a1+code@bioconductor.org<mailto:reply+1f2038a1+code@bioconductor.org>" <reply+1f2038a1+code@bioconductor.org<mailto:reply+1f2038a1+code@bioconductor.org>> Date: Monday, November 2, 2015 5:30 PM To: Lihua Julie Zhu <julie.zhu@umassmed.edu<mailto:julie.zhu@umassmed.edu>> Subject: [bioc] C: CRISPRseek package: be aware that as of Oct 2015, CRISPRseek works best for for Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User Martin Morgan<https: support.bioconductor.org="" u="" 1513=""/> wrote Comment: CRISPRseek package: be aware that as of Oct 2015, CRISPRseek works best for for regions < 200kb<https: support.bioconductor.org="" p="" 72994="" #74051="">: A better solution (IMO) for CRISPRseek is to use BiocParallel<http: bioconductor.org="" packages="" biocparallel="">::bplapply() rather than parLapply(). For Alex' machine this would use multicore (forked processes), whereas parLapply is using SNOW (independent processes communicating via sockets). The way to regulate the number of workers (forked or otherwise) in BiocParallel is to set the option options(mc.cores=...) or to register(MulticoreParam(<cores you="" want="" to="" use="">)) before calling the function(s) that use parallel evaluation; use register(SerialParam()) to evaluate on a single core. ________________________________ Post tags: crisprseek, sgrna, crispr, cas9, R You may reply via email or visit C: CRISPRseek package: be aware that as of Oct 2015, CRISPRseek works best for for
0
2.7 years ago by
Julie Zhu3.8k
United States
Julie Zhu3.8k wrote:

Alex,

Thanks for the feedback!  FYI, Herve has kindly agreed to help with the speed issue.

You can set foldgRNAs = FALSE to disable the code from GeneRfold. However, if you are interested in the folding output, please download the GeneRfold at http://bioconductor.case.edu/bioconductor/2.8/bioc/html/GeneRfold.html and geneR at http://bioconductor.case.edu/bioconductor/2.8/bioc/html/GeneR.html, and install these two packages using the following commands.

R CMD INSTALL GeneRfold_1.10.0.tar.gz
R CMD INSTALL GeneR_2.22.0.tar.gz

Best regards,

Julie

I will do that, thanks!

It appears that GeneRFold also has a dependency on Vienna:

configure: error: viennaRNA library >= 1.5 requiered http://www.tbi.univie.ac.at/~ivo/RNA/

ERROR: configuration failed for package 'GeneRfold'

* removing '/usr/lib64/R/library/GeneRfold'

Installing packages that are no longer supported is not a good idea -- they were removed because they did not work, not for an arbitrary reason. From the page of removed packages the most recent version of GeneRfold is in Bioconductor 2.9, but that is from 2011! There is a lot of water under the bridge since then.

Thanks for the comment! I totally agree that it is worth being not overly-optimistic about the chances of successfully running not-updated packages.

In this particular case, I was also wary of this old package, but manual installation of Vienna + GeneRfold + data.table + GeneR actually did seem (at least at first glance) to work properly. I have no idea if the output will actually work, though!

Bioconductor packages seem to be obsoleted in very aggressive fashion, so I wouldn't be surprised to find GeneRfold actually still working, but I equally wouldn't be surprised if some under-the-hood change in the last 5 years has broken it.

Actually if you look at the list of removed packages compared to the number of packages we add with each release, it's a very small number (in fact, no packages were removed with the most recent release). We only remove packages when they are completely broken and the maintainer does not fix them after repeated requests (or does not respond to repeated requests/cannot be reached), or the maintainer has asked for it to be removed. You can be assured there was a good reason this package was removed, and that it was not arbitrary or capricious.

Looking through my old emails, it turns out that GeneRfold was removed because it has a dependency on GeneR which is another package that went through the deprecated/defunct cycle, and the GeneRfold maintainer was asked to work around this dependency and never replied.

Even though you got the package to install, it may not successfully build or check; in any case, its use is not officially supported.

0
2.7 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

Current status: even with 4 2 GHz Xeon cores running over the weekend, but after 3 days, not much headway has been made on my 2 megabase region of interest. Maybe this region is just way too big to be even remotely reasonable?

I ran CRISPRSeek with multicore enabled, and each of the four multi-core processes consumed 32 GB of RAM, which ended up using up basically all the RAM and swap space on the machine.

Here is the R output (it's still running):

Validating input ...
Searching for gRNAs ...
>>> Finding all hits in sequence chr3 ...
>>> DONE searching
Building feature vectors for scoring ...

The files that have been generated so far (from 2 days ago), are:

REcutDetails.xls
myregion_allgRNAs.fa
myregion.gbk.gbk

It looks like all the cut regions have been found (in REcutDetails.xls), but unfortunately no scores are provided in this file. (Scores are usually generated and provided in a later file, but I think this never happens with a 2 megabase input region).

I tried splitting the 2 megabase region up into 20 100,000 base regions—this actually ran in a reasonable amount of time, but it is a bit annoying since the coordinates have to be manually adjusted at the end.

0
2.7 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

Current status: even with 4 2 GHz Xeon cores running over the weekend, but after 3 days, not much headway has been made on my 2 megabase region of interest. Maybe this region is just way too big to be even remotely reasonable?

I ran CRISPRSeek with multicore enabled, and each of the four multi-core processes consumed 32 GB of RAM, which ended up using up basically all the RAM and swap space on the machine.

Here is the R output (it's still running):

Validating input ...
Searching for gRNAs ...
>>> Finding all hits in sequence chr3 ...
>>> DONE searching
Building feature vectors for scoring ...

The files that have been generated so far (from 2 days ago), are:

REcutDetails.xls
myregion_allgRNAs.fa
myregion.gbk.gbk

It looks like all the cut regions have been found (in REcutDetails.xls), but unfortunately no scores are provided in this file. (Scores are usually generated and provided in a later file, but I think this never happens with a 2 megabase input region).

I tried splitting the 2 megabase region up into 20 100,000 base regions—this actually ran in a reasonable amount of time, but it is a bit annoying since the coordinates have to be manually adjusted at the end.

0
2.7 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

I end up with this error result (see bold part below). I assume this has something to do with filenames / connections, but I'm not sure what exactly causes this error. Possibly next time I will run the script with "options(error=recover)" to try to diagnose it. This is a very long-running job, though, so it could take a while to get this.

>>> Finding all hits in sequence chr3 ...

>>> DONE searching

Building feature vectors for scoring ...

Warning message:

In .MTB_PDict(x, max.mismatch, algo) :

  given the characteristics of dictionary 'x', this value of 'max.mismatch' will

  give poor performance when you call matchPDict() on this MTB_PDict object

  (it will of course depend ultimately on the length of the subject)

Error in unlist(parLapply(cl, 1:dim(mismatch.pos)[1], function(i) { :

  error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in unserialize(node\$con) : error reading from connection

0
2.7 years ago by
alexgraehl10
United States
alexgraehl10 wrote:

Q: I notice that there are two terms used:

gRNAefficacy

calculategRNAEfficacy

calculategRNAEfficiency

These both appear in the documentation, and are of course both possible valid words! I am assuming that one of these is a typo, and that "efficiency" is the preferred terminology?