Question: derfinder: analyzeChr() breaks on chromosomes > 22
0
4.6 years ago by
United States
jessica.hekman40 wrote:

Leo -- thanks for your recent help with fullCoverage() in derfinder. After your updates, that function began working very well for me. However, I have discovered that analyzeChr() is still having issues with chromosomes > 22. When I give it a chromosome in the human range, it does fine, but once we get up into higher numbers, it breaks.

R 3.1.1
Bioconductor version 3.0 (BiocInstaller 1.16.0)
derfinder 1.0.4
GenomeInfoDb 1.2.2 (version including the Canis familiaris genome update)

My script:

library(derfinder)
library(GenomeInfoDb)
library(GenomicRanges)

args <- commandArgs(trailingOnly = TRUE)

perms = 200  # for alpha = 0.05
cores = 15   # too many -> memory problems
#perms = 1
#cores = 25
chroms = c(args[1])

# Set up our groups
tameFileNames <- lapply(tameIds, function(id) { return(paste("tophat-juncs-", id, sep="")) } )
groups <- as.factor(sapply(names(files), function(filename) { ifelse(grepl(filename, tameFileNames), "tame", "aggr") }))

# Initial data analysis over all chromosomes:
print("Starting fullCoverage...")
fullCov <- fullCoverage(files, chrs = chroms, species = 'canis_familiaris')
sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)
models <- makeModels(sampleDepths, testvars = groups)

# Keep only bases with coverage > 2
print("Filtering...")
filteredCov <- lapply(fullCov, filterData, cutoff = 2)

originalWd <- getwd()
setwd(file.path(originalWd, 'derResults'))

# Analyze each chromosome individually (analysis will be parallelized
# and significance will be evaluated by permutations).
for (chrom in chroms) {
print(chrom)
currentChrom <- paste("chr", chrom, sep="")
analyzeChr(chr = currentChrom, filteredCov[[currentChrom]], models, mc.cores=cores, groupInfo = groups, writeOutput = TRUE, cutoffFstat = 0.05, nPermute = perms, seeds = 19731107 + seq_len(perms))
}

print("Chromosome analysis completed.")

The error on chromosomes > 22:

2014-11-02 12:44:36 analyzeChr: Annotating regions
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
subscript contains NAs or out-of-bounds indices
Calls: analyzeChr ... extractROWS -> normalizeSingleBracketSubscript -> NSBS -> NSBS
In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
Execution halted

I also noticed this output message during execution of analyzeChr():

nearestgene: loading bumphunter hg19 transcript database

I'm guessing we should not be using hg19. In fact, this makes  me wonder if the other low-numbered chromosomes which completed correctly have valid results or if they need to be re-run against a different transcript database.

Any help you can give is much appreciated. I suspect that the only solution will involve you editing the code again, for which I'm sorry if it's true!

Jessica


derfinder • 2.0k views
modified 4.6 years ago • written 4.6 years ago by jessica.hekman40

You might also want to check how to create a TxDB.* package (or at least the object) for Canis familiaris as I don't see one at http://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData Check more details at http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf

Sure. How would I use such an object (I'm not sure where it fits in to my workflow)?

I think that bumphunter::annotateNearest() could in principle use TxDb objects. If that's true, then you'll need one for Canis familiaris.

But well, I don't know if my 1st sentence is true

Answer: derfinder: analyzeChr() breaks on chromosomes > 22
1
4.6 years ago by
United States

Hi Jessica,

There are two different issues in this. First, I had not updated the BioC release version of derfinder to match the changes in 1.2.1 from GenomeInfoDb. That has been fixed already. Basically, I wasn't sure if they had merged the changes from GenomeInfoDb Bioc-devel version 1.3.2 to Bioc-release.

The second issue is that derfinder relies on bumphunter http://www.bioconductor.org/packages/release/bioc/html/bumphunter.html to annotate the candidate DERs. derfinder uses bumphunter::annotateNearest() and this function currently only supports Homo sapiens build hg19. You will need to generate your own annotation object for Canis familiaris and it should look similar to what you get running:

> library(bumphunter)
> data(TT, package = "bumphunter", envir = environment())
> TT

I highly recommend creating a new question with the bumphunter tag on how to proceed with it. Meanwhile, you can still use derfinder to find the DERs by using analyzeChr(runAnnotation = FALSE). Note that your previous results for chrs 1 to 19 were annotated versus Homo sapiens because the defaults for analyzeChr() are subject = 'hg19', runAnnotation = TRUE, so the annotation results will not make sense. You don't have to re-run those chromosomes, simply remove the annotation information.

The following code  works and generates  Note that I used some artificial models and cutoffs for the toy example. The BAM and BAI file used are those you previously posted.

Used derfinder 1.0.6, derfinderHelper 1.0.3, and bumphunter 1.6.0

Best,
Leonardo

Hi,

I created https://github.com/ririzarr/bumphunter/issues/1 regarding using bumphunter::annotateNearest() with genomes other than hg19.

Cheers,

Leo

Answer: derfinder: analyzeChr() breaks on chromosomes > 22
0
4.6 years ago by
United States
jessica.hekman40 wrote:

Great, thank you. I have posted a request for bumphunter help -- see Using bumphunter with non-human genomes.

I too was surprised to not get an email update, because I had set my options to request one -- but I'm getting email updates now so it's all good :)

I'm checking to make sure that the code you suggested actually does what it should on my end with the real BAMs. When it does, I'll accept this answer.

Thanks,

Jessica

OK, this is perplexing. Your code works fine for me. Then I modified it to run on my actual BAM files, not the little sample:

library('derfinder')

## Set global species option

options(species = 'canis_familiaris')

# Set up our groups
tameFileNames <- lapply(tameIds, function(id) { return(paste("tophat-juncs-", id, sep="")) } )
groups <- as.factor(sapply(names(files), function(filename) { ifelse(grepl(filename, tameFileNames), "tame", "aggr") }))

fullCov <- fullCoverage(files, chrs = c(1))

2014-11-10 12:16:18 loadCoverage: loading BAM file tophat/tophat-juncs-9_129/accepted_hits.bam
2014-11-10 12:16:20 loadCoverage: applying the cutoff to the merged data
Error: 1 errors; first error:
Error in mapSeqlevels(chr, "UCSC"): 'seqnames' must be a character vector

the function and set the argument 'BPRESUME' to TRUE or wrap the
previous call in bpresume().

First traceback:
18: fullCoverage(files, chrs = c(1))
17: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
16: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
15: lapply(X, FUN, ...)
14: lapply(X, FUN, ...)
13: FUN(1L[[1L]], ...)
12: .try(FUN(...))
11: withRestarts(withCallingHandlers(expr, warning = handler_warning,
error = handler_error), abort = handler_abort)

However, when I run it with the last line like so:

fullCov <- fullCoverage(bam, chrs = c(1:38, 'X'))

it completes without error, even though the only difference I can see is that I'm running it on more than one chromosome. (Derfinder is 1.0.6.)

I'm baffled!

Jessica

1

Hi Jessica,

The answer is simple in this case. When you are using:

fullCov <- fullCoverage(files, chrs = c(1))

The chrs argument is an integer vector, not a character vector. See:

> x <- 1
> class(x)
[1] "numeric"
> y <- c(1:38, 'X')
> class(y)
[1] "character"
> z <- '1'
> class(z)
[1] "character"


So use:

fullCov <- fullCoverage(files, chrs = '1')

I made a small change in version 1.0.7 (devel 1.1.10) so the error message will be more informative.

Cheers,

Leo

Worked! Got another step further, got a new confusing error.

> fullCov <- fullCoverage(files, chrs = '38')

> sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)

> models <- makeModels(sampleDepths, testvars = groups)

> filteredCov <- lapply(fullCov, filterData, cutoff = 2)

> analyzeChr(chr = "chr38", filteredCov[["chr38"]], models, mc.cores=15, groupInfo = groups, writeOutput = TRUE, cutoffFstat = 0.5, nPermute = perms, seeds = 19731107 + seq_len(perms))

2014-11-10 16:50:34 analyzeChr: Pre-processing the coverage data
2014-11-10 16:50:56 analyzeChr: Calculating statistics

2014-11-10 16:45:38 calculateStats: calculating the F-statistics
starting worker pid=3131 on localhost:11928 at 16:45:38.358
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
cannot open the connection

Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> socketConnection
In socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
localhost:11928 cannot be opened
Execution halted

Help?

Jessica

1

See if you can reproduce the problem on that same node/computer running:

library(BiocParallel)
xx <- bplapply(seq_len(1000), function(x) Sys.time(), BPPARAM = SnowParam(15))

My guess is that it'll work fine unless there really is an issue with your node/computer. You can try a larger number than 1000 if you want as this test runs fast (~7 secs).

The error still happens this morning. I tried your suggested code; it's hung for several minutes now with no feedback, which I presume is an error state given that it completed in ~7 seconds for you and I'm running it on a fast machine. I wonder if there were firewall changes made on the machine recently -- do you think that could cause these symptoms? If you don't know more, I'll ask the BiocParallel folks.

> traceback()
13: socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,
open = "a+b", timeout = timeout)
12: newPSOCKnode(names[[i]], options = options, rank = i)
11: makePSOCKcluster(spec, ...)
10: (function (spec, type = getClusterOption("type"), ...)
{
switch(type, PSOCK = makePSOCKcluster(spec, ...), FORK = makeForkCluster(spec,
...), SOCK = snow::makeSOCKcluster(spec, ...), MPI = snow::makeMPIcluster(spec,
...), NWS = snow::makeNWScluster(spec, ...), stop("unknown cluster type"))
})(spec = 5, type = "PSOCK", outfile = "")
9: do.call(makeCluster, x$.clusterargs) 8: bpstart(BPPARAM) 7: bpstart(BPPARAM) 6: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, BPRESUME = BPRESUME, BPPARAM = BPPARAM) 5: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, BPRESUME = BPRESUME, BPPARAM = BPPARAM) 4: bplapply(mclapplyIndex, fstats.apply, data = coverageProcessed, mod = models$mod, mod0 = models$mod0, lowMemDir = lowMemDir, scalefac = scalefac, method = method, adjustF = adjustF, BPPARAM = BPPARAM) 3: bplapply(mclapplyIndex, fstats.apply, data = coverageProcessed, mod = models$mod, mod0 = models$mod0, lowMemDir = lowMemDir, scalefac = scalefac, method = method, adjustF = adjustF, BPPARAM = BPPARAM) 2: calculateStats(coveragePrep = prep, models = models, lowMemDir = lowMemDir, ...) 1: analyzeChr(chr = "chr1", filteredCov[["chr1"]], models, mc.cores = 5, groupInfo = groups, writeOutput = TRUE, cutoffFstat = 0.5, nPermute = 2, seeds = 19731107 + seq_len(2))  On that other note, I didn't understand what sampleDepth() was doing. Because running derfinder on even one chromosome takes 4-5 hours with 200 permutations, and setting mc.cores=15 doesn't speed it up much, I had settled on running separate R processes for each chromosome and then merging them together at the end with a single process. So one process for fullCoverage() + analyzeChr() for each chromosome, then when that's all done, start a new R process and merge the results all together. I take it this is actually suboptimal due to the inability to process different sample depths prior to running analyzeChr() on each chromosome? Jessica ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by jessica.hekman40 1 Hi Jessica, Ask the BiocParallel folks about the error using the short code I gave you that gets the system time 1000 times. As for sampleDepth(), what part of the docs describing it is confusing? That way I can try to improve the docs. As I said before, you should use the same model for each chromosome and feed sampleDepth() the fullCoverage() output from all the chrs. For processing one chromosome at a time with analyzeChr(), you could simply save each element of your filteredData list object to a Rdata file, then load it before using analyzeChr(). ADD REPLYlink written 4.6 years ago by Leonardo Collado Torres640 1 Thanks, I will proceed with the BiocParallel folks. As for sampleDepth(), I returned to the docs and they're pretty clear: "We can use sampleDepth() and it's helper function collapseFullCoverage() to get a proxy of the library size. Note that you would normally use the unfiltered data from all the chromosomes in this step and not just one." I think the problem is that I set it up expecting to run the whole script as a single process, and was surprised when I realized it was going to take about 10 days to complete, and that I could get a lot more bang for my buck by splitting up among multiple processors myself. When I split it up, I didn't think to deal with changing the part of my code that handled sampleDepth(). So if you're looking for how to improve the docs (I am not complaining, by the way!) then I'd say you could do so by explicitly addressing the fact that it's going to take a while to run, you might want to run it in multiple processes simultaneously, and what is the best workflow to do that (what can be run separately and what must be run all together). Hope that helps and that I'm not just missing something. Jessica ADD REPLYlink written 4.6 years ago by jessica.hekman40 Digging into the BiocParallel error, I find that the code you gave me will run fine if I remove "BPPARAM = SnowParam(15)". This makes me wonder why the error suddenly appeared on my machine very recently -- did you update something in the most recent derfinder? Maybe start specifying SnowParam() when you had not previously? ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by jessica.hekman40 If you remove the "BPPARAM = SnowParam(15)" you will be using a different type of cluster. Check the output of: bpparam() This is further described in the BiocParallel vignette. In any case, derfinder has been using SnowParam() for parallel calls for a while now (even before it was released in Bioconductor). So it's not a derfinder issue, but either something is wrong with your infraestructure, something changed in BiocParallel, or something changed in snow. Regardless of the case, it's only something the developers of BiocParallel can help you with. ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Leonardo Collado Torres640 Thanks for the input! ADD REPLYlink written 4.6 years ago by Leonardo Collado Torres640 BiocParallel people answered and said the problem is with my system's ability to open ports. I've mailed my admin to see if he changed something recently. He is liable to do things like this without telling us :) However, the BiocParallel people said that "If you're running this is as a parallel job on a single Linux / mac computer, then it's almost always better to use the (default) MulticoreParam." Now, this is clearly my problem and not yours and I'm working on figuring it out on my end, buuuut -- I'm curious why derfinder uses SnowParam and whether it would make sense to use MulticoreParam instead. If you have good reason to use SnowParam, no worries, I'll figure it out on my end, but otherwise, perhaps the easiest path would be switching to MulticoreParam (especially since I imagine other people might have similarly [mis]configured servers and have trouble using derfinder as a result)? Jessica ADD REPLYlink written 4.6 years ago by jessica.hekman40 Hi Jessica, The answer is simple. When using MulticoreParam() the memory usage is around 5 to 8 times higher (it depends on the # of cores) than when using SnowParam() as measured in the cluster I have access to. If you still want to use MulticoreParam(), you can do so by using library('derfinder') library('BiocParallel') analyzeChr(other parts, BPPARAM.custom = MulticoreParam(workers = x)) where x is the number of cores you want to use. Similarly specify the BPPARAM.custom argument for other functions like fullCoverage(). If the memory usage is too high on the analyzeChr() step, then consider using smaller chunks by specifying the chunksize argument described in the details of ?preprocessCoverage. If you want to see more about BPPARAM.custom check https://github.com/lcolladotor/derfinder/blob/master/R/utils.R and http://www.bioconductor.org/packages/release/bioc/vignettes/derfinder/inst/doc/derfinderAdvanced.html#Functions_that_use_multiple_cores (I have to fix a typo there: the advanced argument is BPPARAM.custom, not BPPARAM -- the name change helps avoid name clashes). If using a smaller chunksize doesn't help enough memory wise, consider using less cores with MulticoreParam(). In any case, it's a bit hard to find the perfect balance between memory and computation time, and the type of cluster (SnowParam(), MulticoreParam(), etc), the number of cores, and the chunksize will all come into play. Anyhow, you were able to use SnowParam() successfully until recently, so maybe you can get that to work again with the help of your admin. Cheers, Leo ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Leonardo Collado Torres640 1 Snow launches independent R processes each with it's own memory. Multicore forks processes that share memory until modified. Probably the tools used to assess memory usage are misleading you. In addition, because the snow processes are independent, extra care needs to be taken to ensure that each has all the data required (e.g., loading packages and moving data from the master to the workers), whereas this 'just works' with multicore. The BiocParallel vignette section 6 suggests developers should just invoke bplapply & friends without specifying BPPARAM, allowing the (experienced, as jessica is) user to register() alternative parallel engines. The developer does need to write the FUN of bplapply as though it were running under a snow-like cluster -- loading required packages and ensuring that additional user data is passed to the process using .... ADD REPLYlink written 4.6 years ago by Martin Morgan ♦♦ 23k Regarding the tools used to assess memory usage, that's very likely. But since those are the tools used in my cluster to terminate jobs if memory is too high, I have to work under them. As for derfinder, usage of your preferred type of cluster is allowed through the optional BPPARAM.custom argument, so there's no problem there. Also, the code is written with a snow-like cluster in mind. ADD REPLYlink written 4.6 years ago by Leonardo Collado Torres640 Phew, who knew this would get so complicated. My admin insists he has not changed ipranges on the machine recently and doesn't know why I'd suddenly have trouble accessing ports. So it is perplexing. However, since I can work around the issue by specifying BPPARAM.custom in derfinder (thank you Leo!), I can move forward -- and sounds like since I don't have a cluster the right answer is for me to specify Multicore anyways. Thanks to all. Jessica ADD REPLYlink written 4.6 years ago by jessica.hekman40 I just used your feedback (will be reflected in version 1.0.8 and 1.1.11 for release and devel respectively). See https://github.com/lcolladotor/derfinder/commit/b6c9dac6725843ea1039a7846a530d73a883ac71 ADD REPLYlink written 4.6 years ago by Leonardo Collado Torres640 1 I'm glad it was helpful! I just successfully ran analyzeChr() with your suggested parameter settings. Wow, it completed successfully! Onward to revisit the final issue I have open on derfinder on this site, my merge problem -- now that I can get that far in the code again. ADD REPLYlink written 4.6 years ago by jessica.hekman40 OK, so analyzeChr() ran fine with a single test chromosome in the fullCov object. Good. I then re-ran fullCoverage() on all the chromosomes, saved the results to .RData, and started a new process to run analyzeChr() on a single chromosome using the fullCoverage() object that had all the chromosomes in it, with 200 permutations (in other words, my complete run, not a test). The analyzeChr() script ran nicely for several hours, declared that it had finished the final permutation and was calculating the p-values, and then hung. That was about 15 hours ago. Is it expected for the final p-value calculations to take this long? When I was running analyzeChr() previously it completed in a few hours, nothing close to this. My initial script to generate fullCoverage object and other useful data objects: library(derfinder) ## Set global species option options(species = 'canis_familiaris') files <- rawFiles(datadir="tophat", samplepatt="tophat-juncs-*") # Set up our groups tameIds <- read.table("/home/hekman2/Dropbox/UIUC/transcriptome/hypothalamus/notes/samples-hyp-tame.txt") tameFileNames <- lapply(tameIds, function(id) { return(paste("tophat-juncs-", id, sep="")) } ) groups <- as.factor(sapply(names(files), function(filename) { ifelse(grepl(filename, tameFileNames), "tame", "aggr") })) # Initial data analysis over all chromosomes: print("Starting fullCoverage...") fullCov <- fullCoverage(files, chrs = c(1:38, 'X')) sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1) models <- makeModels(sampleDepths, testvars = groups) filteredCov <- lapply(fullCov, filterData, cutoff = 2) My script to run analyzeCoverage: library(derfinder) library(BiocParallel) load(".RData") # Some configuration stuff perms = 200 # for alpha = 0.05 cores = 15 # too many -> memory problems ## Set global species option options(species = 'canis_familiaris') args <- commandArgs(trailingOnly = TRUE) currentChrom <- paste("chr", args[1], sep="") # cutoffFstat = 0.5? print(currentChrom) analyzeChr(chr = currentChrom, filteredCov[[currentChrom]], models, mc.cores=cores, groupInfo = groups, writeOutput = TRUE, cutoffFstat = 0.5, cutoffType = 'empirical', nPermute = perms, seeds = 19731107 + seq_len(perms), BPPARAM.custom = MulticoreParam(workers = cores), verbose = TRUE, runAnnotation = FALSE) print("Chromosome analysis completed.") Note that I set my cutoff at 0.5 (I had been using 0.05 previously) -- could that have made this difference? But the test with a single chromosome also used cutoff of 0.5 and it didn't run this long. Wish I could stop asking for help every two days but I keep getting stuck :( Jessica ADD REPLYlink written 4.6 years ago by jessica.hekman40 1 Hi Jessica, I believe that you were previously using: analyzeChr(cutoffFstat = 0.05) which implied the default for cutoffType = 'theoretical'. That is, analyzeChr(cutoffFstat = 0.05, cutoffType = 'theoretical') In my previous answer in this thread A: derfinder: analyzeChr() breaks on chromosomes > 22 I am using the cutoffType = 'empirical' simply for illustrative purposes because I was using a toy model. By using analyzeChr(cutoffFstat = 0.5, cutoffType = 'empirical') you are saying, take the median of the F-stats for the chromosome and use that as the F-stats cutoff. So contiguous bases with F-stats higher than the median cutoff will be considered candidate DERs. Thus by definition, half of the bases with coverage > 2 for at least one sample (your data cutoff used in filterData(cutoff = 2) earlier) will be contained in a candidate DER. That can be a lot of DERs! You can compare the actual cutoff values used by inspecting the log files. Look for the lines: ## Earlier run DATE analyzeChr: Using the following theoretical cutoff for the F-statistics XX ## Most recent run you describe above DATE analyzeChr: Using the following empirical cutoff for the F-statistics YY where XX and YY are the actual numbers of the cutoffs used. I suspect that YY is smaller than XX. If you decide to keep using cutoffType = 'empirical' you might want to consider using cutoffFstat = 0.95 or higher (but still smaller than 1). Or, you can switch back to using cutoffType = 'theoretical' and use a smaller cutoff than 0.05 (if you wanted less candidate DERs than before). Leo ADD REPLYlink written 4.6 years ago by Leonardo Collado Torres640 Ah, that's super helpful, thanks. I'm rerunning with cutoffType='empirical' and cutoffFstat=0.95, though given previous discussions with you I am wondering if I should set the Fstat higher (0.99?). Seems like there is no way when running analyzeChr() to determine how many DERs have been identified, and my best approach is to guess at the appropriate Fstat, run it, and if it's taking too long, kill it and try again? Unless I'm missing it, there isn't a good section in the documentation about how to use the cutoffFstat and cutoffType parameters. My statistics background isn't strong enough for me to feel comfortable with these concepts without a little handholding. If you are still looking for documentation advice, I'd add some specific examples about cutoffFstat choices one might make in different situations (what you said in this post was great). Thanks again for your help and if you have ideas about how I can determine the most appropriate cutoffFstat parameter to use, I'd love to hear them. Otherwise, I'll see how it goes -- I'm rerunning now. Jessica ADD REPLYlink written 4.6 years ago by jessica.hekman40 Hi Jessica, I'll revise the docs about cutoffFstat and cutoffType for analyzeChr(). Thanks for the input again. Now, for quickly getting the number of DERs, just run analyzeChr(nPermute = 0). None of the DERs will have p-values, but that's ok as at this point you only want to get the number of DERs and maybe explore how long they are: like range of widths, mean, median, 1st and 3rd quartiles, etc. If you are using cutoffType = 'empirical', then you can know exactly the length (in bp) of all the DERs. It will be (1 - cutoffFstat) times the number of bases that passed your data filtering step. So, if cutoffFstat = 0.95, cutoffType = 'empirical' and 1 million bases passed the data cutoff of filterData(cutoff = 2), the total number of bases in DERs will be (1 - 0.95) 1000000 = 0.05 * 1000000 = 50000. Whether those 50 thousand bases are just one DER (a very long one) or up to 50000 1bp DERs is a different thing. Finally, note that cutoffType = 'empirical' is chromosome specific. If you wanted to use the same cutoff, you could run it with one chromosome (generally the largest, and that's generally chromosome 1). Then get the actual value (say ZZ) from the logs and use it on the next chromosomes by using analyzeChr(cutoffType = 'manual', cutoffFstat = ZZ) Leo ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Leonardo Collado Torres640 I just edited my previous comment to clarify things and added an example. There was a mistake also on the original version. ADD REPLYlink written 4.6 years ago by Leonardo Collado Torres640 I used your suggestions and derfinder ran perfectly and completed in just a few hours. No splitting among processors needed. Perfect. The actual code I used to get the Fstatcutoff I chose (might be useful to add to the docs): library(BiocParallel) currentChrom = "chr1" results <- analyzeChr(chr = currentChrom, filteredCov[[currentChrom]], models, mc.cores=cores, groupInfo = groups, cutoffFstat = 0.95, cutoffType = 'empirical', nPermute = 0, BPPARAM.custom = MulticoreParam(workers = cores), verbose = TRUE, runAnnotation = FALSE, returnOutput = TRUE) # "Using the following empirical cutoff for the F-statistics 4.57554352333578" library(GenomicRanges) reg <- results$regions$regions length(reg) # 29461 summary(width(reg)) # Lengths of the ranges we found: # Min. 1st Qu. Median Mean 3rd Qu. Max. # 1.00 2.00 5.00 13.17 14.00 412.00 length(reg[reg$value > 10])
# 595

About 600 regions for the longest chromosome sounded good to me so I set the Fstatcutoff at 10 and the cutoffType='manual' and things went well. I got about 5000 significant regions in the end, which is quite enough to get me started with future work.

Next steps are working on annotation (which will probably be a bit of a bear), and graphical visualization of the data with derfinderPlot, so you may yet hear more from me, sorry :)

Jessica

How are things going with your analysis?

Note that I just corrected a bug (fixed in versions 1.0.10 and 1.1.14 for release and devel respectively) as described in https://github.com/lcolladotor/derfinder/commit/30ede08171cc97c98b4a8dd6db0e27d50dd0f587

Have you tried running this code again? Doesn't seem like any error related to derfinder. It's an error related to the use of BiocParallel::bplapply(someDerfinderFunction, BPPARAM = BiocParallel::SnowParam(workers = 15) ) The error is saying that the SNOW cluster failed. So maybe there was a problem in the node/computer you were running this. If you can reproduce the problem, it'll be worth asking help from the BiocParallel maintainers. In any case, if you get the error again please include the output from traceback() as it might have some other info on what went wrong.

----

On another note, why are you using sampleDepth() with a fullCoverage() result from just chr38? Ideally you should use it with the information from all chrs to estimate the library sizes. Then, you can use the same output from makeModels() for all your chrs and don't have to re-run this for each.