# Good idea! So here is the code I ran for CRISPRseek... this code generates a fake 3 megabase fasta file for testing, so you don't need any other files.
If you break up the 3 megabase sequence into much smaller sequences, then the program actually will finish in a few hours. So it isn't just an issue of there being too much sequence; it's something about having a large sequence in a SINGLE fasta record that makes it become slow.
Test code below. This should run if CRISPRseek is installed (but will take multiple days to finish):
if (!require("CRISPRseek")) {
print("You need to install CRISPRseek via bioconductor... see the commented-out code in the source")
# source("http://bioconductor.org/biocLite.R") ; biocLite("CRISPRseek")
}
if (!require(BSgenome.Hsapiens.UCSC.hg19)) {
print("You should install the hg19 BSgenome data via bioconductor. See the comments in the source!")
# source("http://bioconductor.org/biocLite.R") ; biocLite("BSgenome.Hsapiens.UCSC.hg19")
}
if (!require(TxDb.Hsapiens.UCSC.hg19.knownGene)) {
print("You should install TxDb for your species also!")
# source("http://bioconductor.org/biocLite.R") ; biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")
}
if (!require(org.Hs.eg.db)) {
print("You should install org.Hs.eg.db !")
# source("http://bioconductor.org/biocLite.R") ; biocLite("org.Hs.eg.db")
}
# Make a FAKE fasta file for testing, then use it
FAKE_FASTA_FILE_NAME <- "test.fasta"
MAKE_FAKE_TEST = TRUE
if (MAKE_FAKE_TEST) {
FA <- FAKE_FASTA_FILE_NAME
set.seed(12345)
fc <- file(FAKE_FASTA_FILE_NAME)
lineLen = 100
numLines = 4 # <-- change THIS number to change the number of lines generated. Set it to 3 to run super quickly, set to 30000 to take many many days
print(paste("Generatating a total of this many bases:", numLines*lineLen))
fakeDNA <- c(1:numLines)
for (i in 1:numLines) {
fakeDNA[i] <- paste(sample(c("A","C","G","T"), lineLen, replace=T), collapse="")
}
writeLines(c(">test_fake_3_megabase_region",fakeDNA), fc)
close(fc)
print("Done creating a fake fasta file for testing.")
}
OUTDIR=paste(FA, ".out.dir", sep='')
dir.create(OUTDIR)
print("The function below will take DAYS to run and will still not generate any output if the input file is large (2+ megabases).")
# "Calling the function offTargetAnalysis with chromToSearch='"' results in quick gRNA search without performing on-target and off-target analysis"
results <- CRISPRseek::offTargetAnalysis(inputFilePath=FA
, findgRNAsWithREcutOnly=FALSE
, findPairedgRNAOnly=FALSE, findgRNAs=TRUE
, chromToSearch = 'chr1' # just to make it faster --- set to "all" to check all chromosomes
, BSgenomeName=Hsapiens, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene, orgAnn=org.Hs.egSYMBOL
, PAM.size=3, PAM="NGG"
, gRNA.size=20, upstream=0, downstream=0 # default is 200
, max.mismatch=1, outputDir=OUTDIR, overwrite = TRUE)
A much more helpful report would be to provide a short reproducible example that illustrates the problem. Then the author of the package can address the issue, and both you and the rest of the community will benefit. An example of this is discussed in the recent Bioconductor newsletter, where a similar report lead to a single-line edit and change in execution time from an estimated 3.5 days to 6 minutes. How far down this path can you take this question? Can you identify the bottleneck and suggest a solution?