# Good idea! So here is the code I ran for CRISPRseek... this code generates a fake 3 megabase fasta file for testing, so you don't need any other files.
If you break up the 3 megabase sequence into much smaller sequences, then the program actually will finish in a few hours. So it isn't just an issue of there being too much sequence; it's something about having a large sequence in a SINGLE fasta record that makes it become slow.
Test code below. This should run if CRISPRseek is installed (but will take multiple days to finish):
if (!require("CRISPRseek")) {
print("You need to install CRISPRseek via bioconductor... see the commented-out code in the source")
}
if (!require(BSgenome.Hsapiens.UCSC.hg19)) {
print("You should install the hg19 BSgenome data via bioconductor. See the comments in the source!")
}
if (!require(TxDb.Hsapiens.UCSC.hg19.knownGene)) {
print("You should install TxDb for your species also!")
}
if (!require(org.Hs.eg.db)) {
print("You should install org.Hs.eg.db !")
}
FAKE_FASTA_FILE_NAME <- "test.fasta"
MAKE_FAKE_TEST = TRUE
if (MAKE_FAKE_TEST) {
FA <- FAKE_FASTA_FILE_NAME
set.seed(12345)
fc <- file(FAKE_FASTA_FILE_NAME)
lineLen = 100
numLines = 4
print(paste("Generatating a total of this many bases:", numLines*lineLen))
fakeDNA <- c(1:numLines)
for (i in 1:numLines) {
fakeDNA[i] <- paste(sample(c("A","C","G","T"), lineLen, replace=T), collapse="")
}
writeLines(c(">test_fake_3_megabase_region",fakeDNA), fc)
close(fc)
print("Done creating a fake fasta file for testing.")
}
OUTDIR=paste(FA, ".out.dir", sep='')
dir.create(OUTDIR)
print("The function below will take DAYS to run and will still not generate any output if the input file is large (2+ megabases).")
results <- CRISPRseek::offTargetAnalysis(inputFilePath=FA
, findgRNAsWithREcutOnly=FALSE
, findPairedgRNAOnly=FALSE, findgRNAs=TRUE
, chromToSearch = 'chr1'
, BSgenomeName=Hsapiens, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene, orgAnn=org.Hs.egSYMBOL
, PAM.size=3, PAM="NGG"
, gRNA.size=20, upstream=0, downstream=0
, max.mismatch=1, outputDir=OUTDIR, overwrite = TRUE)
A much more helpful report would be to provide a short reproducible example that illustrates the problem. Then the author of the package can address the issue, and both you and the rest of the community will benefit. An example of this is discussed in the recent Bioconductor newsletter, where a similar report lead to a single-line edit and change in execution time from an estimated 3.5 days to 6 minutes. How far down this path can you take this question? Can you identify the bottleneck and suggest a solution?