derfinder: analyzeChr() breaks on chromosomes > 22
2
0
Entering edit mode
@jessicahekman-6877
Last seen 8.6 years ago
United States

Leo -- thanks for your recent help with fullCoverage() in derfinder. After your updates, that function began working very well for me. However, I have discovered that analyzeChr() is still having issues with chromosomes > 22. When I give it a chromosome in the human range, it does fine, but once we get up into higher numbers, it breaks.

 

R 3.1.1
Bioconductor version 3.0 (BiocInstaller 1.16.0)
derfinder 1.0.4
GenomeInfoDb 1.2.2 (version including the Canis familiaris genome update)

 

My script:

library(derfinder)
library(GenomeInfoDb)
library(GenomicRanges)

args <- commandArgs(trailingOnly = TRUE)

perms = 200  # for alpha = 0.05
cores = 15   # too many -> memory problems
#perms = 1
#cores = 25
chroms = c(args[1])

files <- rawFiles(datadir="tophat", samplepatt="tophat-juncs-*")

# Set up our groups
tameIds <- read.table("/home/hekman2/Dropbox/UIUC/transcriptome/hypothalamus/notes/samples-hyp-tame.txt")
tameFileNames <- lapply(tameIds, function(id) { return(paste("tophat-juncs-", id, sep="")) } )
groups <- as.factor(sapply(names(files), function(filename) { ifelse(grepl(filename, tameFileNames), "tame", "aggr") }))

# Initial data analysis over all chromosomes:
print("Starting fullCoverage...")
fullCov <- fullCoverage(files, chrs = chroms, species = 'canis_familiaris')
sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)
models <- makeModels(sampleDepths, testvars = groups)

# Keep only bases with coverage > 2
print("Filtering...")
filteredCov <- lapply(fullCov, filterData, cutoff = 2)

originalWd <- getwd()
setwd(file.path(originalWd, 'derResults'))

# Analyze each chromosome individually (analysis will be parallelized
# and significance will be evaluated by permutations).
for (chrom in chroms) {
  print(chrom)
  currentChrom <- paste("chr", chrom, sep="")
  analyzeChr(chr = currentChrom, filteredCov[[currentChrom]], models, mc.cores=cores, groupInfo = groups, writeOutput = TRUE, cutoffFstat = 0.05, nPermute = perms, seeds = 19731107 + seq_len(perms))
}

print("Chromosome analysis completed.")

The error on chromosomes > 22:

2014-11-02 12:44:36 analyzeChr: Annotating regions
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript contains NAs or out-of-bounds indices
Calls: analyzeChr ... extractROWS -> normalizeSingleBracketSubscript -> NSBS -> NSBS
In addition: Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)
Execution halted

I also noticed this output message during execution of analyzeChr():

nearestgene: loading bumphunter hg19 transcript database

I'm guessing we should not be using hg19. In fact, this makes  me wonder if the other low-numbered chromosomes which completed correctly have valid results or if they need to be re-run against a different transcript database.

Any help you can give is much appreciated. I suspect that the only solution will involve you editing the code again, for which I'm sorry if it's true!

 

Jessica

 

 
derfinder • 4.3k views
ADD COMMENT
0
Entering edit mode

You might also want to check how to create a `TxDB.*` package (or at least the object) for Canis familiaris as I don't see one at http://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData Check more details at http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf

ADD REPLY
0
Entering edit mode

Sure. How would I use such an object (I'm not sure where it fits in to my workflow)?

ADD REPLY
0
Entering edit mode

I think that bumphunter::annotateNearest() could in principle use `TxDb` objects. If that's true, then you'll need one for Canis familiaris.

But well, I don't know if my 1st sentence is true

ADD REPLY
1
Entering edit mode
@lcolladotor
Last seen 3 days ago
United States

Hi Jessica,

There are two different issues in this. First, I had not updated the BioC release version of `derfinder` to match the changes in 1.2.1 from GenomeInfoDb. That has been fixed already. Basically, I wasn't sure if they had merged the changes from GenomeInfoDb Bioc-devel version 1.3.2 to Bioc-release.

The second issue is that `derfinder` relies on `bumphunter` http://www.bioconductor.org/packages/release/bioc/html/bumphunter.html to annotate the candidate DERs. `derfinder` uses `bumphunter::annotateNearest()` and this function currently only supports Homo sapiens build hg19. You will need to generate your own annotation object for Canis familiaris and it should look similar to what you get running:

> library(bumphunter)
> data(TT, package = "bumphunter", envir = environment())
> TT

I highly recommend creating a new question with the `bumphunter` tag on how to proceed with it. Meanwhile, you can still use `derfinder` to find the DERs by using `analyzeChr(runAnnotation = FALSE)`. Note that your previous results for chrs 1 to 19 were annotated versus Homo sapiens because the defaults for `analyzeChr()` are `subject = 'hg19', runAnnotation = TRUE`, so the annotation results will not make sense. You don't have to re-run those chromosomes, simply remove the annotation information.

 

The following code  works and generates  Note that I used some artificial models and cutoffs for the toy example. The BAM and BAI file used are those you previously posted.

Used derfinder 1.0.6, derfinderHelper 1.0.3, and bumphunter 1.6.0

Best,
Leonardo
 

 

 

ADD COMMENT
0
Entering edit mode

Hi,

I created https://github.com/ririzarr/bumphunter/issues/1 regarding using bumphunter::annotateNearest() with genomes other than hg19.

Cheers,

Leo

ADD REPLY
0
Entering edit mode
@jessicahekman-6877
Last seen 8.6 years ago
United States

Great, thank you. I have posted a request for bumphunter help -- see Using bumphunter with non-human genomes.

I too was surprised to not get an email update, because I had set my options to request one -- but I'm getting email updates now so it's all good :)

I'm checking to make sure that the code you suggested actually does what it should on my end with the real BAMs. When it does, I'll accept this answer.

Thanks,

Jessica

ADD COMMENT
0
Entering edit mode

OK, this is perplexing. Your code works fine for me. Then I modified it to run on my actual BAM files, not the little sample:

library('derfinder')

## Set global species option

options(species = 'canis_familiaris')

files <- rawFiles(datadir="tophat", samplepatt="tophat-juncs-*")

# Set up our groups
tameIds <- read.table("/home/hekman2/Dropbox/UIUC/transcriptome/hypothalamus/notes/samples-hyp-tame.txt")
tameFileNames <- lapply(tameIds, function(id) { return(paste("tophat-juncs-", id, sep="")) } )
groups <- as.factor(sapply(names(files), function(filename) { ifelse(grepl(filename, tameFileNames), "tame", "aggr") }))

fullCov <- fullCoverage(files, chrs = c(1))


And I get this error (after loading all the BAM files...)

2014-11-10 12:16:18 loadCoverage: loading BAM file tophat/tophat-juncs-9_129/accepted_hits.bam
2014-11-10 12:16:20 loadCoverage: applying the cutoff to the merged data
Error: 1 errors; first error:
  Error in mapSeqlevels(chr, "UCSC"): 'seqnames' must be a character vector

For more information, use bplasterror(). To resume calculation, re-call
  the function and set the argument 'BPRESUME' to TRUE or wrap the
  previous call in bpresume().

First traceback:
  18: fullCoverage(files, chrs = c(1))
  17: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
          bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
          mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM)
  16: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
          bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
          mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM)
  15: lapply(X, FUN, ...)
  14: lapply(X, FUN, ...)
  13: FUN(1L[[1L]], ...)
  12: .try(FUN(...))
  11: withRestarts(withCallingHandlers(expr, warning = handler_warning,
          error = handler_error), abort = handler_abort)

 

However, when I run it with the last line like so:

fullCov <- fullCoverage(bam, chrs = c(1:38, 'X'))


it completes without error, even though the only difference I can see is that I'm running it on more than one chromosome. (Derfinder is 1.0.6.)

I'm baffled!

Jessica

ADD REPLY
1
Entering edit mode

Hi Jessica,

The answer is simple in this case. When you are using:

fullCov <- fullCoverage(files, chrs = c(1))

The chrs argument is an integer vector, not a character vector. See:

> x <- 1
> class(x)
[1] "numeric"
> y <- c(1:38, 'X')
> class(y)
[1] "character"
> z <- '1'
> class(z)
[1] "character"

So use:

fullCov <- fullCoverage(files, chrs = '1')

I made a small change in version 1.0.7 (devel 1.1.10) so the error message will be more informative.

Cheers,

Leo

ADD REPLY
0
Entering edit mode

Worked! Got another step further, got a new confusing error.

> fullCov <- fullCoverage(files, chrs = '38')

> sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)

> models <- makeModels(sampleDepths, testvars = groups)

> filteredCov <- lapply(fullCov, filterData, cutoff = 2)

> analyzeChr(chr = "chr38", filteredCov[["chr38"]], models, mc.cores=15, groupInfo = groups, writeOutput = TRUE, cutoffFstat = 0.5, nPermute = perms, seeds = 19731107 + seq_len(perms))

2014-11-10 16:50:34 analyzeChr: Pre-processing the coverage data
2014-11-10 16:50:56 analyzeChr: Calculating statistics

2014-11-10 16:45:38 calculateStats: calculating the F-statistics
starting worker pid=3131 on localhost:11928 at 16:45:38.358
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  cannot open the connection

Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> socketConnection
In addition: Warning message:
In socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  localhost:11928 cannot be opened
Execution halted

Help?

Jessica

ADD REPLY
1
Entering edit mode

See if you can reproduce the problem on that same node/computer running:

library(BiocParallel)
xx <- bplapply(seq_len(1000), function(x) Sys.time(), BPPARAM = SnowParam(15))

My guess is that it'll work fine unless there really is an issue with your node/computer. You can try a larger number than 1000 if you want as this test runs fast (~7 secs).

ADD REPLY
0
Entering edit mode

The error still happens this morning. I tried your suggested code; it's hung for several minutes now with no feedback, which I presume is an error state given that it completed in ~7 seconds for you and I'm running it on a fast machine. I wonder if there were firewall changes made on the machine recently -- do you think that could cause these symptoms? If you don't know more, I'll ask the BiocParallel folks.

> traceback()
13: socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,
        open = "a+b", timeout = timeout)
12: newPSOCKnode(names[[i]], options = options, rank = i)
11: makePSOCKcluster(spec, ...)
10: (function (spec, type = getClusterOption("type"), ...)
    {
        switch(type, PSOCK = makePSOCKcluster(spec, ...), FORK = makeForkCluster(spec,
            ...), SOCK = snow::makeSOCKcluster(spec, ...), MPI = snow::makeMPIcluster(spec,
            ...), NWS = snow::makeNWScluster(spec, ...), stop("unknown cluster type"))
    })(spec = 5, type = "PSOCK", outfile = "")
9: do.call(makeCluster, x$.clusterargs)
8: bpstart(BPPARAM)
7: bpstart(BPPARAM)
6: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, BPRESUME = BPRESUME,
       BPPARAM = BPPARAM)
5: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, BPRESUME = BPRESUME,
       BPPARAM = BPPARAM)
4: bplapply(mclapplyIndex, fstats.apply, data = coverageProcessed,
       mod = models$mod, mod0 = models$mod0, lowMemDir = lowMemDir,
       scalefac = scalefac, method = method, adjustF = adjustF,
       BPPARAM = BPPARAM)
3: bplapply(mclapplyIndex, fstats.apply, data = coverageProcessed,
       mod = models$mod, mod0 = models$mod0, lowMemDir = lowMemDir,
       scalefac = scalefac, method = method, adjustF = adjustF,
       BPPARAM = BPPARAM)
2: calculateStats(coveragePrep = prep, models = models, lowMemDir = lowMemDir,
       ...)
1: analyzeChr(chr = "chr1", filteredCov[["chr1"]], models, mc.cores = 5,
       groupInfo = groups, writeOutput = TRUE, cutoffFstat = 0.5,
       nPermute = 2, seeds = 19731107 + seq_len(2))

On that other note, I didn't understand what sampleDepth() was doing. Because running derfinder on even one chromosome takes 4-5 hours with 200 permutations, and setting mc.cores=15 doesn't speed it up much, I had settled on running separate R processes for each chromosome and then merging them together at the end with a single process. So one process for fullCoverage() + analyzeChr() for each chromosome, then when that's all done, start a new R process and merge the results all together. I take it this is actually suboptimal due to the inability to process different sample depths prior to running analyzeChr() on each chromosome?

 

Jessica

ADD REPLY
1
Entering edit mode

Hi Jessica,

Ask the BiocParallel folks about the error using the short code I gave you that gets the system time 1000 times.

 

As for sampleDepth(), what part of the docs describing it is confusing? That way I can try to improve the docs.

As I said before, you should use the same model for each chromosome and feed sampleDepth() the fullCoverage() output from all the chrs. For processing one chromosome at a time with analyzeChr(), you could simply save each element of your `filteredData` list object to a Rdata file, then load it before using analyzeChr().

ADD REPLY
1
Entering edit mode

Thanks, I will proceed with the BiocParallel folks.

As for sampleDepth(), I returned to the docs and they're pretty clear: "We can use sampleDepth() and it's helper function collapseFullCoverage() to get a proxy of the library size. Note that you would normally use the unfiltered data from all the chromosomes in this step and not just one." I think the problem is that I set it up expecting to run the whole script as a single process, and was surprised when I realized it was going to take about 10 days to complete, and that I could get a lot more bang for my buck by splitting up among multiple processors myself. When I split it up, I didn't think to deal with changing the part of my code that handled sampleDepth().

So if you're looking for how to improve the docs (I am not complaining, by the way!) then I'd say you could do so by explicitly addressing the fact that it's going to take a while to run, you might want to run it in multiple processes simultaneously, and what is the best workflow to do that (what can be run separately and what must be run all together). Hope that helps and that I'm not just missing something.

Jessica

ADD REPLY
0
Entering edit mode

Digging into the BiocParallel error, I find that the code you gave me will run fine if I remove "BPPARAM = SnowParam(15)". This makes me wonder why the error suddenly appeared on my machine very recently -- did you update something in the most recent derfinder? Maybe start specifying SnowParam() when you had not previously?

Anyways, see: bplapply() fails when BPPARAM = SnowParam() is specified

ADD REPLY
0
Entering edit mode

If you remove the "BPPARAM = SnowParam(15)" you will be using a different type of cluster. Check the output of:

bpparam()

This is further described in the BiocParallel vignette. 

In any case, derfinder has been using SnowParam() for parallel calls for a while now (even before it was released in Bioconductor). So it's not a derfinder issue, but either something is wrong with your infraestructure, something changed in BiocParallel, or something changed in snow. Regardless of the case, it's only something the developers of BiocParallel can help you with.

ADD REPLY
0
Entering edit mode

Thanks for the input!

ADD REPLY
0
Entering edit mode

BiocParallel people answered and said the problem is with my system's ability to open ports. I've mailed my admin to see if he changed something recently. He is liable to do things like this without telling us :)

However, the BiocParallel people said that "If you're running this is as a parallel job on a single Linux / mac computer, then it's almost always better to use the (default) MulticoreParam." Now, this is clearly my problem and not yours and I'm working on figuring it out on my end, buuuut -- I'm curious why derfinder uses SnowParam and whether it would make sense to use MulticoreParam instead. If you have good reason to use SnowParam, no worries, I'll figure it out on my end, but otherwise, perhaps the easiest path would be switching to MulticoreParam (especially since I imagine other people might have similarly [mis]configured servers and have trouble using derfinder as a result)?

Jessica

ADD REPLY
0
Entering edit mode

Hi Jessica,

The answer is simple. When using MulticoreParam() the memory usage is around 5 to 8 times higher (it depends on the # of cores) than when using SnowParam() as measured in the cluster I have access to. If you still want to use MulticoreParam(), you can do so by using

library('derfinder')
library('BiocParallel')
analyzeChr(other parts, BPPARAM.custom = MulticoreParam(workers = x))

where `x` is the number of cores you want to use. Similarly specify the `BPPARAM.custom` argument for other functions like fullCoverage(). If the memory usage is too high on the analyzeChr() step, then consider using smaller chunks by specifying the `chunksize` argument described in the details of ?preprocessCoverage. If you want to see more about `BPPARAM.custom` check https://github.com/lcolladotor/derfinder/blob/master/R/utils.R and http://www.bioconductor.org/packages/release/bioc/vignettes/derfinder/inst/doc/derfinderAdvanced.html#Functions_that_use_multiple_cores (I have to fix a typo there: the advanced argument is BPPARAM.custom, not BPPARAM -- the name change helps avoid name clashes).

If using a smaller chunksize doesn't help enough memory wise, consider using less cores with MulticoreParam(). In any case, it's a bit hard to find the perfect balance between memory and computation time, and the type of cluster (SnowParam(), MulticoreParam(), etc), the number of cores, and the chunksize will all come into play.

 

Anyhow, you were able to use SnowParam() successfully until recently, so maybe you can get that to work again with the help of your admin.

Cheers,

Leo

ADD REPLY
1
Entering edit mode

Snow launches independent R processes each with it's own memory. Multicore forks processes that share memory until modified. Probably the tools used to assess memory usage are misleading you. In addition, because the snow processes are independent, extra care needs to be taken to ensure that each has all the data required (e.g., loading packages and moving data from the master to the workers), whereas this 'just works' with multicore.

The BiocParallel vignette section 6 suggests developers should just invoke bplapply & friends without specifying BPPARAM, allowing the (experienced, as jessica is) user to register() alternative parallel engines. The developer does need to write the FUN of bplapply as though it were running under a snow-like cluster -- loading required packages and ensuring that additional user data is passed to the process using ....

ADD REPLY
0
Entering edit mode

Regarding the tools used to assess memory usage, that's very likely. But since those are the tools used in my cluster to terminate jobs if memory is too high, I have to work under them.

As for `derfinder`, usage of your preferred type of cluster is allowed through the optional `BPPARAM.custom` argument, so there's no problem there. Also, the code is written with a snow-like cluster in mind.

ADD REPLY
0
Entering edit mode

Phew, who knew this would get so complicated.

My admin insists he has not changed ipranges on the machine recently and doesn't know why I'd suddenly have trouble accessing ports. So it is perplexing.

However, since I can work around the issue by specifying BPPARAM.custom in derfinder (thank you Leo!), I can move forward -- and sounds like since I don't have a cluster the right answer is for me to specify Multicore anyways.

Thanks to all.

Jessica

 

ADD REPLY
0
Entering edit mode

I just used your feedback (will be reflected in version 1.0.8 and 1.1.11 for release and devel respectively). See https://github.com/lcolladotor/derfinder/commit/b6c9dac6725843ea1039a7846a530d73a883ac71

ADD REPLY
1
Entering edit mode

I'm glad it was helpful!

I just successfully ran analyzeChr() with your suggested parameter settings. Wow, it completed successfully! Onward to revisit the final issue I have open on derfinder on this site, my merge problem -- now that I can get that far in the code again.

ADD REPLY
0
Entering edit mode

OK, so analyzeChr() ran fine with a single test chromosome in the fullCov object. Good.

I then re-ran fullCoverage() on all the chromosomes, saved the results to .RData, and started a new process to run analyzeChr() on a single chromosome using the fullCoverage() object that had all the chromosomes in it, with 200 permutations (in other words, my complete run, not a test). The analyzeChr() script ran nicely for several hours, declared that it had finished the final permutation and was calculating the p-values, and then hung. That was about 15 hours ago. Is it expected for the final p-value calculations to take this long? When I was running analyzeChr() previously it completed in a few hours, nothing close to this.

My initial script to generate fullCoverage object and other useful data objects:

library(derfinder)

## Set global species option
options(species = 'canis_familiaris')

files <- rawFiles(datadir="tophat", samplepatt="tophat-juncs-*")

# Set up our groups
tameIds <- read.table("/home/hekman2/Dropbox/UIUC/transcriptome/hypothalamus/notes/samples-hyp-tame.txt")

tameFileNames <- lapply(tameIds, function(id) { return(paste("tophat-juncs-", id, sep="")) } )

groups <- as.factor(sapply(names(files), function(filename) { ifelse(grepl(filename, tameFileNames), "tame", "aggr") }))

# Initial data analysis over all chromosomes:
print("Starting fullCoverage...")
fullCov <- fullCoverage(files, chrs = c(1:38, 'X'))
sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)
models <- makeModels(sampleDepths, testvars = groups)

filteredCov <- lapply(fullCov, filterData, cutoff = 2)

 

 

My script to run analyzeCoverage:

library(derfinder)
library(BiocParallel)

load(".RData")

# Some configuration stuff
perms = 200  # for alpha = 0.05
cores = 15   # too many -> memory problems
## Set global species option
options(species = 'canis_familiaris')

args <- commandArgs(trailingOnly = TRUE)
currentChrom <- paste("chr", args[1], sep="")
# cutoffFstat = 0.5?
print(currentChrom)
analyzeChr(chr = currentChrom, filteredCov[[currentChrom]], models, mc.cores=cores, groupInfo = groups, writeOutput = TRUE, cutoffFstat = 0.5, cutoffType = 'empirical', nPermute = perms, seeds = 19731107 + seq_len(perms), BPPARAM.custom = MulticoreParam(workers = cores), verbose = TRUE, runAnnotation = FALSE)

print("Chromosome analysis completed.")

 

Note that I set my cutoff at 0.5 (I had been using 0.05 previously) -- could that have made this difference? But the test with a single chromosome also used cutoff of 0.5 and it didn't run this long.

Wish I could stop asking for help every two days but I keep getting stuck :(

Jessica

ADD REPLY
1
Entering edit mode

Hi Jessica,

I believe that you were previously using:

analyzeChr(cutoffFstat = 0.05)

which implied the default for `cutoffType = 'theoretical'`. That is,

analyzeChr(cutoffFstat = 0.05, cutoffType = 'theoretical')

In my previous answer in this thread A: derfinder: analyzeChr() breaks on chromosomes > 22 I am using the `cutoffType = 'empirical'` simply for illustrative purposes because I was using a toy model.

By using

analyzeChr(cutoffFstat = 0.5, cutoffType = 'empirical')

you are saying, take the median of the F-stats for the chromosome and use that as the F-stats cutoff. So contiguous bases with F-stats higher than the median cutoff will be considered candidate DERs. Thus by definition, half of the bases with coverage > 2 for at least one sample (your data cutoff used in filterData(cutoff = 2) earlier) will be contained in a candidate DER. That can be a lot of DERs!

You can compare the actual cutoff values used by inspecting the log files. Look for the lines:

## Earlier run
DATE analyzeChr: Using the following theoretical cutoff for the F-statistics XX
## Most recent run you describe above
DATE analyzeChr: Using the following empirical cutoff for the F-statistics YY

where XX and YY are the actual numbers of the cutoffs used. I suspect that YY is smaller than XX.

If you decide to keep using `cutoffType = 'empirical'` you might want to consider using `cutoffFstat = 0.95` or higher (but still smaller than 1). Or, you can switch back to using `cutoffType = 'theoretical'` and use a smaller cutoff than 0.05 (if you wanted less candidate DERs than before).

Leo

 

ADD REPLY
0
Entering edit mode

Ah, that's super helpful, thanks. I'm rerunning with cutoffType='empirical' and cutoffFstat=0.95, though given previous discussions with you I am wondering if I should set the Fstat higher (0.99?). Seems like there is no way when running analyzeChr() to determine how many DERs have been identified, and my best approach is to guess at the appropriate Fstat, run it, and if it's taking too long, kill it and try again?

Unless I'm missing it, there isn't a good section in the documentation about how to use the cutoffFstat and cutoffType parameters. My statistics background isn't strong enough for me to feel comfortable with these concepts without a little handholding. If you are still looking for documentation advice, I'd add some specific examples about cutoffFstat choices one might make in different situations (what you said in this post was great).

Thanks again for your help and if you have ideas about how I can determine the most appropriate cutoffFstat parameter to use, I'd love to hear them. Otherwise, I'll see how it goes -- I'm rerunning now.

Jessica

ADD REPLY
0
Entering edit mode

Hi Jessica,

I'll revise the docs about `cutoffFstat` and `cutoffType` for `analyzeChr()`. Thanks for the input again.

Now, for quickly getting the number of DERs, just run `analyzeChr(nPermute = 0)`. None of the DERs will have p-values, but that's ok as at this point you only want to get the number of DERs and maybe explore how long they are: like range of widths, mean, median, 1st and 3rd quartiles, etc.

 

If you are using `cutoffType = 'empirical'`, then you can know exactly the length (in bp) of all the DERs. It will be (1 - `cutoffFstat`) times the number of bases that passed your data filtering step. So, if `cutoffFstat = 0.95, cutoffType = 'empirical'` and 1 million bases passed the data cutoff of `filterData(cutoff = 2)`, the total number of bases in DERs will be (1 - 0.95) 1000000 = 0.05 * 1000000 = 50000. Whether those 50 thousand bases are just one DER (a very long one) or up to 50000 1bp DERs is a different thing.

Finally, note that `cutoffType = 'empirical'` is chromosome specific. If you wanted to use the same cutoff, you could run it with one chromosome (generally the largest, and that's generally chromosome 1). Then get the actual value (say ZZ) from the logs and use it on the next chromosomes by using

analyzeChr(cutoffType = 'manual', cutoffFstat = ZZ)

Leo

 

ADD REPLY
0
Entering edit mode

I just edited my previous comment to clarify things and added an example. There was a mistake also on the original version.

ADD REPLY
0
Entering edit mode

I used your suggestions and derfinder ran perfectly and completed in just a few hours. No splitting among processors needed. Perfect.

The actual code I used to get the Fstatcutoff I chose (might be useful to add to the docs):

library(BiocParallel)
currentChrom = "chr1"
results <- analyzeChr(chr = currentChrom, filteredCov[[currentChrom]], models, mc.cores=cores, groupInfo = groups, cutoffFstat = 0.95, cutoffType = 'empirical', nPermute = 0, BPPARAM.custom = MulticoreParam(workers = cores), verbose = TRUE, runAnnotation = FALSE, returnOutput = TRUE)

# "Using the following empirical cutoff for the F-statistics 4.57554352333578"

library(GenomicRanges)
reg <- results$regions$regions
length(reg)
# 29461
summary(width(reg))
# Lengths of the ranges we found:
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   1.00    2.00    5.00   13.17   14.00  412.00

length(reg[reg$value > 10])
# 595


About 600 regions for the longest chromosome sounded good to me so I set the Fstatcutoff at 10 and the cutoffType='manual' and things went well. I got about 5000 significant regions in the end, which is quite enough to get me started with future work.

Next steps are working on annotation (which will probably be a bit of a bear), and graphical visualization of the data with derfinderPlot, so you may yet hear more from me, sorry :)

Jessica

ADD REPLY
0
Entering edit mode

How are things going with your analysis?

Note that I just corrected a bug (fixed in versions 1.0.10 and 1.1.14 for release and devel respectively) as described in https://github.com/lcolladotor/derfinder/commit/30ede08171cc97c98b4a8dd6db0e27d50dd0f587

ADD REPLY
0
Entering edit mode

Have you tried running this code again? Doesn't seem like any error related to derfinder. It's an error related to the use of BiocParallel::bplapply(someDerfinderFunction, BPPARAM = BiocParallel::SnowParam(workers = 15) ) The error is saying that the SNOW cluster failed. So maybe there was a problem in the node/computer you were running this. If you can reproduce the problem, it'll be worth asking help from the BiocParallel maintainers. In any case, if you get the error again please include the output from traceback() as it might have some other info on what went wrong.

 

----

On another note, why are you using sampleDepth() with a fullCoverage() result from just chr38? Ideally you should use it with the information from all chrs to estimate the library sizes. Then, you can use the same output from makeModels() for all your chrs and don't have to re-run this for each.

ADD REPLY

Login before adding your answer.

Traffic: 740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6