derfinder gets stuck in calculatePvalues

Entering edit mode

drramki.chop • 0

@drramkichop-8164

Last seen 7.2 years ago

United States

I am using 'derfinder' to find differentially expressed regions using RNA-seq data. The tool works until some point and gets stuck after calculating p-values. Some of the intermediate files are created but it doesn't finish the analysis. Here is the log:

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Error in tools:::httpdPort <= 0L :
comparison (4) is possible only for atomic and list types
> library(derfinder)
Find out what's changed in derfinder with
news(Version == "1.2.0", package = "derfinder")
> library(GenomicRanges)

> library(BiocParallel)

> setwd("~/Research/Ring/rnaSeq/bams/");
>
> files <- Sys.glob(file.path(".","*chr20*.bam"))
> fullCov <- fullCoverage(files=files,chrs='20')
2015-06-12 14:07:26 fullCoverage: processing chromosome 20
2015-06-12 14:07:26 loadCoverage: finding chromosome lengths
2015-06-12 14:07:26 loadCoverage: loading BAM file ./E11-1.chr20.sorted.bam
2015-06-12 14:07:31 loadCoverage: loading BAM file ./E11-2.chr20.sorted.bam
2015-06-12 14:07:33 loadCoverage: loading BAM file ./E8-1.chr20.sorted.bam
2015-06-12 14:07:36 loadCoverage: loading BAM file ./E8-2.chr20.sorted.bam
2015-06-12 14:07:39 loadCoverage: applying the cutoff to the merged data
2015-06-12 14:07:39 filterData: originally there were 63025520 rows, now there are 63025520 rows. Meaning that 0 percent was filtered.
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
> filteredCov <- lapply(fullCov, filterData, cutoff = 2)
2015-06-12 14:07:40 filterData: originally there were 63025520 rows, now there are 3580364 rows. Meaning that 94.32 percent was filtered.
> sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)
2015-06-12 14:07:40 sampleDepth: Calculating sample quantiles
2015-06-12 14:07:40 sampleDepth: Calculating sample adjustments
> pheno <- data.frame(group = c("Ring","Ring","WT","WT"))
>
> models <- makeModels(sampleDepths, testvars = pheno$group)
> originalWd <- getwd()
> setwd(file.path(originalWd, 'analysisResults'))
>
> results <- analyzeChr(chr = '20', filteredCov$chr20, models, cutoffType='manual', mc.cores=6, method = 'regular', runAnnotation = 'false', BPPARAM.custom = MulticoreParam(workers = 6),
+ groupInfo = pheno$group, writeOutput = FALSE, cutoffFstat = 5e-02, nPermute = 5, seeds = 20140923 + seq_len(5), returnOutput = TRUE)
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2015-06-12 14:09:07 analyzeChr: Pre-processing the coverage data
2015-06-12 14:09:18 analyzeChr: Calculating statistics
2015-06-12 14:09:18 calculateStats: calculating the F-statistics
2015-06-12 14:09:21 analyzeChr: Calculating pvalues
2015-06-12 14:09:21 analyzeChr: Using the following manual cutoff for the F-statistics 0.05
2015-06-12 14:09:21 calculatePvalues: identifying data segments
2015-06-12 14:09:21 findRegions: segmenting F-stats information
2015-06-12 14:09:21 findRegions: identifying candidate regions
2015-06-12 14:09:21 findRegions: identifying region clusters
2015-06-12 14:09:48 calculatePvalues: calculating F-statistics for permutation 1 and seed 20140924
2015-06-12 14:09:51 findRegions: segmenting F-stats information
2015-06-12 14:09:51 findRegions: identifying candidate regions
2015-06-12 14:09:51 calculatePvalues: calculating F-statistics for permutation 2 and seed 20140925
2015-06-12 14:09:53 findRegions: segmenting F-stats information
2015-06-12 14:09:53 findRegions: identifying candidate regions
2015-06-12 14:09:54 calculatePvalues: calculating F-statistics for permutation 3 and seed 20140926
2015-06-12 14:09:56 findRegions: segmenting F-stats information
2015-06-12 14:09:56 findRegions: identifying candidate regions
2015-06-12 14:09:56 calculatePvalues: calculating F-statistics for permutation 4 and seed 20140927
2015-06-12 14:09:58 findRegions: segmenting F-stats information
2015-06-12 14:09:59 findRegions: identifying candidate regions
2015-06-12 14:09:59 calculatePvalues: calculating F-statistics for permutation 5 and seed 20140928
2015-06-12 14:10:01 findRegions: segmenting F-stats information
2015-06-12 14:10:01 findRegions: identifying candidate regions
2015-06-12 14:10:01 calculatePvalues: calculating the p-values

It just gets stuck here. Any suggestions to fix?

derfinder software error rnaseq • 2.5k views

ADD COMMENT • link 9.7 years ago • updated 9.6 years ago drramki.chop • 0

Entering edit mode

Hi,

Can you please provide the session information? Thanks!

options(width = 120)
devtools::session_info()

Best,

Leonardo

ADD REPLY • link 9.7 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 21 days ago

United States

Hi,

This issue has been addressed in derfinder Bioconductor-release version 1.2.1 (and Bioconductor-devel version 1.3.1) as explained in further detail at https://github.com/lcolladotor/derfinder/issues/29.

The user has not gotten back to me with the session information they had, but it's fair to assume that they were not using this version given that it has only been available from biocLite() since today.

I originally asked for the session information for completeness and also because the above github issue got resolved 5 days ago, 1 day before the user posted this question. The delay in derfinder 1.2.1's availability via biocLite() happened because the derfinder Bioconductor-release git-svn bridge was not properly setup, my bad.

Best,

Leonardo

ADD COMMENT • link 9.7 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

drramki.chop • 0

@drramkichop-8164

Last seen 7.2 years ago

United States

Sorry for not providing the session info. However, I upgraded the package to the latest version and it is working fine.

Thanks for the support.

ADD COMMENT • link 9.6 years ago drramki.chop • 0

Entering edit mode

No problem and I'm glad that it's working for you now.

ADD REPLY • link 9.6 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

I am looking to find contiguous regions that are differentially expressed b/w cases and controls. derfinder finds 33k+ in one chromosome (chr 20 in his case) regions that are really short and of course, clusters (most of them are same as what we find in our DESeq/edgeR experiments as expected). When I use plotClusters, the plot's title has p-values for clusters and is there a way to extract co-ordinates for clusters and the respective p-values?

ADD REPLY • link 9.6 years ago drramki.chop • 0

Entering edit mode

Hi,

This is really a new question instead of a comment. In the future, I encourage you to post new questions as such so other users will be able to find it and hopefully take advantage of the answer(s).

derfinderPlot::plotCluster() uses the p-value from the selected region inside the cluster for the plot title as shown in https://github.com/leekgroup/derfinderPlot/blob/master/R/plotCluster.R#L135

What you are asking for is not something we have developed code for. That is, separate the DERs into clusters and calculate cluster specific p-values. Before going down this route, consider using a lower cutoff, so your small regions will get merged into longer ones. You should also carefully choose the distance until which you consider neighboring DERs to be part of the same cluster given your biological problem at hand.

If you still want to go down the route of merging DERs, I wrote you some code. In the first section, I separate DERs into clusters and store them in a GRangesList object. To get a cluster level p-value, you can summarize the region level p-values. I haven't given much thought into this, so please think it over before using the code as is.

On the second section, I use the information from the null regions to get null areas for clusters with the same number of regions as those observed. For doing so, I ignore the distance between null regions since `derfinder` currently does not save that piece of information. Maybe it should.

Let me know what you think.

Here's the code and output

	> ## See https://support.bioconductor.org/p/68709 for context
	>
	> ## Required packages
	> library('derfinder')
	Warning message:
	replacing previous import by ‘scales::alpha’ when loading ‘Hmisc’
	> library('GenomicRanges')
	Loading required package: BiocGenerics
	Loading required package: parallel

	Attaching package: ‘BiocGenerics’

	The following objects are masked from ‘package:parallel’:

	clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap,
	parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB

	The following object is masked from ‘package:stats’:

	xtabs

	The following objects are masked from ‘package:base’:

	anyDuplicated, append, as.data.frame, as.vector, cbind, colnames, do.call,
	duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted,
	lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
	Position, rank, rbind, Reduce, rep.int, rownames, sapply, setdiff, sort, table,
	tapply, union, unique, unlist, unsplit

	Loading required package: S4Vectors
	Loading required package: stats4
	Loading required package: IRanges
	Loading required package: GenomeInfoDb
	> library('devtools')
	>
	> #### First section
	>
	> ## Get regions
	> example('analyzeChr', 'derfinder')

	anlyzC> ## Collapse the coverage information
	anlyzC> collapsedFull <- collapseFullCoverage(list(genomeData$coverage),
	anlyzC+ verbose = TRUE)
	2015-07-10 12:32:54 collapseFullCoverage: Sorting fullCov
	2015-07-10 12:32:54 collapseFullCoverage: Collapsing chromosomes information by sample

	anlyzC> ## Calculate library size adjustments
	anlyzC> sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), nonzero=TRUE,
	anlyzC+ verbose=TRUE)
	2015-07-10 12:32:54 sampleDepth: Calculating sample quantiles
	2015-07-10 12:32:54 sampleDepth: Calculating sample adjustments

	anlyzC> ## Build the models
	anlyzC> groupInfo <- genomeInfo$pop

	anlyzC> adjustvars <- data.frame(genomeInfo$gender)

	anlyzC> models <- makeModels(sampleDepths, testvars=groupInfo, adjustvars=adjustvars)

	anlyzC> ## Analyze the chromosome
	anlyzC> results <- analyzeChr(chr='21', coverageInfo=genomeData, models=models,
	anlyzC+ cutoffFstat=1, cutoffType='manual', groupInfo=groupInfo, mc.cores=1,
	anlyzC+ writeOutput=FALSE, returnOutput=TRUE, method='regular',
	anlyzC+ runAnnotation = FALSE)
	extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
	2015-07-10 12:32:54 analyzeChr: Pre-processing the coverage data
	2015-07-10 12:32:54 analyzeChr: Calculating statistics
	2015-07-10 12:32:54 calculateStats: calculating the F-statistics
	2015-07-10 12:32:54 analyzeChr: Calculating pvalues
	2015-07-10 12:32:54 analyzeChr: Using the following manual cutoff for the F-statistics 1
	2015-07-10 12:32:54 calculatePvalues: identifying data segments
	2015-07-10 12:32:54 findRegions: segmenting F-stats information
	2015-07-10 12:32:54 findRegions: identifying candidate regions
	2015-07-10 12:32:54 findRegions: identifying region clusters
	2015-07-10 12:32:54 calculatePvalues: calculating F-statistics for permutation 1 and seed 20150711
	2015-07-10 12:32:54 findRegions: segmenting F-stats information
	2015-07-10 12:32:54 findRegions: identifying candidate regions
	2015-07-10 12:32:55 calculatePvalues: calculating the p-values
	2015-07-10 12:32:55 analyzeChr: Annotating regions

	anlyzC> names(results)
	[1] "timeinfo" "optionsStats" "coveragePrep" "fstats" "regions" "annotation"
	> regs <- results$regions$regions
	>
	> ## Note that all of them are from the same chromosome.
	> ## If you have data for more than one chromsome, then you can split the
	> ## GRanges object into a list, with one element per chromosome.
	> table(seqnames(regs))

	chr21
	33
	>
	> ## Split regions by cluster
	> clus <- endoapply(split(regs, regs$cluster), identity)
	>
	> ## Calculate the total area for each cluster
	> mcols(clus)$area <- sapply(clus, function(x) { sum(x$area)})
	>
	>
	> ## Order clusters by area
	> clus <- clus[order(mcols(clus)$area, decreasing = TRUE)]
	>
	> ## Check the first two clusters by area
	> clus[1:2]
	GRangesList object of length 2:
	$6
	GRanges object with 4 ranges and 14 metadata columns:
	seqnames ranges strand \| value area indexStart indexEnd cluster
	<Rle> <IRanges> <Rle> \| <numeric> <numeric> <integer> <integer> <Rle>
	up chr21 [47411924, 47411986] * \| 7.096704 447.09232 618 680 6
	up chr21 [47412088, 47412131] * \| 3.344261 147.14747 691 734 6
	up chr21 [47412277, 47412312] * \| 3.575484 128.71743 735 770 6
	up chr21 [47412078, 47412083] * \| 2.293818 13.76291 681 686 6
	clusterL meanCoverage meanCEU meanYRI log2FoldChangeYRIvsCEU pvalues significant
	<Rle> <numeric> <numeric> <numeric> <numeric> <numeric> <factor>
	up 389 0.5847414 0.8374906 0.05396825 -3.955890 0.01098901 TRUE
	up 389 0.4464809 0.6590909 0.00000000 -Inf 0.09890110 FALSE
	up 389 0.5053763 0.7460317 0.00000000 -Inf 0.14285714 FALSE
	up 389 0.4139785 0.6031746 0.01666667 -5.177538 0.64835165 FALSE
	qvalues significantQval
	<numeric> <factor>
	up 0.01193762 TRUE
	up 0.05371929 TRUE
	up 0.05968810 TRUE
	up 0.12555084 FALSE

	$3
	GRanges object with 2 ranges and 14 metadata columns:
	seqnames ranges strand \| value area indexStart indexEnd cluster
	up chr21 [47409522, 47409560] * \| 15.053644 587.09211 128 166 3
	up chr21 [47409685, 47409688] * \| 1.215855 4.86342 192 195 3
	clusterL meanCoverage meanCEU meanYRI log2FoldChangeYRIvsCEU pvalues significant
	up 167 0.4152192 0.6129426 0.0 -Inf 0.01098901 TRUE
	up 167 0.4758065 0.6547619 0.1 -2.71097 0.80219780 FALSE
	qvalues significantQval
	up 0.01193762 TRUE
	up 0.14055586 FALSE

	-------
	seqinfo: 1 sequence from an unspecified genome; no seqlengths
	>
	> ## One option to get p-values for clusters is to take the minimum, or mean p-value
	> mcols(clus)$minPval <- sapply(clus, function(x) min(x$pvalues))
	> mcols(clus)$meanPval <- sapply(clus, function(x) mean(x$pvalues))
	> mcols(clus)$maxPval <- sapply(clus, function(x) max(x$pvalues))
	>
	> ## Note that the function you choose to summarize the region level p-values into
	> ## cluster level p-values will matter. Using min() is less conservative than
	> ## using max().
	>
	>
	> #### Second section
	>
	> ## Get the null areas
	> null <- results$regions$nullStats * results$regions$nullWidths
	>
	> ## Get some null clusters with number of elements equal to those observed.
	> ## Note that this code doesn't consider the distance between the null regions.
	> ## So take these numbers with caution.
	> n <- 100
	> set.seed(20150710) ## Choose your own seed
	> index <- sample(seq_len(length(null)), n * sum(elementLengths(clus)), replace = TRUE)
	> groups <- factor(rep(rep(names(clus), elementLengths(clus)), n), levels = names(clus))
	> ## Check that each corresponds to elementLenths(clus) * n
	> table(groups)
	groups
	6 3 10 2 5 9 7 8 4 1 11
	400 200 500 300 600 200 100 100 600 200 100
	>
	> ## Split null areas into resampled groups
	> area_by_clus <- split(null[index], groups)
	>
	> ## Split by number of regions in each cluster, calculate null areas
	> clus_null <- unlist(mapply(function(areas, nRegs, n) {
	+ sapply(split(areas, rep(seq_len(n), each = nRegs)), sum)
	+ }, area_by_clus, elementLengths(clus), MoreArgs = list(n = n), SIMPLIFY = FALSE), use.names = FALSE)
	>
	> ## Calculate (approximated) p-values for clusters
	> mcols(clus)$pvalue <- derfinder:::.calcPval(mcols(clus)$area, clus_null)
	>
	> ## View them all
	> mcols(clus)
	DataFrame with 11 rows and 5 columns
	area minPval meanPval maxPval pvalue
	<numeric> <numeric> <numeric> <numeric> <numeric>
	1 736.7201 0.01098901 0.2252747 0.6483516 0.0009082652
	2 591.9555 0.01098901 0.4065934 0.8021978 0.0081743869
	3 535.0815 0.01098901 0.4549451 0.8021978 0.0145322434
	4 479.5088 0.01098901 0.1868132 0.4065934 0.0281562216
	5 384.6838 0.03296703 0.3992674 0.5824176 0.0708446866
	... ... ... ... ... ...
	7 357.286601 0.01098901 0.01098901 0.01098901 0.09173479
	8 208.429718 0.03296703 0.03296703 0.03296703 0.30336058
	9 108.674290 0.47252747 0.60805861 0.93406593 0.53405995
	10 94.934229 0.31868132 0.39560440 0.47252747 0.60581290
	11 3.997655 0.84615385 0.84615385 0.84615385 0.96457766
	>
	> ## Use p.adjust() or qvalue::qvalue() to control the FDR
	> p.adjust(mcols(clus)$pvalue, method = 'BH')
	[1] 0.009990917 0.044959128 0.053284893 0.077429609 0.141537996 0.141537996 0.144154665
	[8] 0.417120799 0.652739933 0.666394187 0.964577657
	>
	>
	> ## Reproducibility info
	> options(width = 120)
	> session_info()
	Session info -----------------------------------------------------------------------------------------------------------
	setting value
	version R version 3.2.0 Patched (2015-05-18 r68382)
	system x86_64, darwin10.8.0
	ui AQUA
	language (EN)
	collate en_US.UTF-8
	tz America/New_York

	Packages ---------------------------------------------------------------------------------------------------------------
	package * version date source
	acepack 1.3-3.3 2014-11-24 CRAN (R 3.2.0)
	AnnotationDbi 1.31.16 2015-06-10 Bioconductor
	Biobase 2.29.1 2015-05-03 Bioconductor
	BiocGenerics * 0.15.2 2015-06-05 Bioconductor
	BiocParallel 1.3.25 2015-06-12 Bioconductor
	biomaRt 2.25.1 2015-04-21 Bioconductor
	Biostrings 2.37.2 2015-05-07 Bioconductor
	bitops 1.0-6 2013-08-17 CRAN (R 3.2.0)
	bumphunter 1.9.1 2015-05-05 Bioconductor
	cluster 2.0.1 2015-01-31 CRAN (R 3.2.0)
	codetools 0.2-11 2015-03-10 CRAN (R 3.2.0)
	colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
	curl 0.9 2015-06-19 CRAN (R 3.2.1)
	DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
	derfinder * 1.3.3 2015-06-15 Bioconductor
	derfinderHelper 1.3.0 2015-04-17 Bioconductor
	devtools * 1.8.0 2015-05-09 CRAN (R 3.2.0)
	digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
	doRNG 1.6 2014-03-07 CRAN (R 3.2.0)
	foreach 1.4.2 2014-04-11 CRAN (R 3.2.0)
	foreign 0.8-63 2015-02-20 CRAN (R 3.2.0)
	Formula 1.2-1 2015-04-07 CRAN (R 3.2.0)
	futile.logger 1.4.1 2015-04-20 CRAN (R 3.2.0)
	futile.options 1.0.0 2010-04-06 CRAN (R 3.2.0)
	GenomeInfoDb * 1.5.7 2015-06-02 Bioconductor
	GenomicAlignments 1.5.9 2015-05-13 Bioconductor
	GenomicFeatures 1.21.13 2015-06-09 Bioconductor
	GenomicFiles 1.5.3 2015-05-13 Bioconductor
	GenomicRanges * 1.21.15 2015-06-09 Bioconductor
	ggplot2 1.0.1.9000 2015-06-12 Github (hadley/ggplot2@3b6a126)
	git2r 0.10.1 2015-05-07 CRAN (R 3.2.0)
	gridExtra 0.9.1 2012-08-09 CRAN (R 3.2.0)
	gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
	Hmisc 3.16-0 2015-04-30 CRAN (R 3.2.0)
	IRanges * 2.3.12 2015-06-16 Bioconductor
	iterators 1.0.7 2014-04-11 CRAN (R 3.2.0)
	lambda.r 1.1.7 2015-03-20 CRAN (R 3.2.0)
	lattice 0.20-31 2015-03-30 CRAN (R 3.2.0)
	latticeExtra 0.6-26 2013-08-15 CRAN (R 3.2.0)
	locfit 1.5-9.1 2013-04-20 CRAN (R 3.2.0)
	magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
	MASS 7.3-40 2015-03-21 CRAN (R 3.2.0)
	Matrix 1.2-1 2015-06-01 CRAN (R 3.2.0)
	matrixStats 0.14.0 2015-02-14 CRAN (R 3.2.0)
	memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
	munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
	nnet 7.3-9 2015-02-11 CRAN (R 3.2.0)
	pkgmaker 0.22 2014-05-14 CRAN (R 3.2.0)
	plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
	proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
	qvalue 2.1.0 2015-04-17 Bioconductor
	RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.0)
	Rcpp 0.11.6 2015-05-01 CRAN (R 3.2.0)
	RCurl 1.95-4.6 2015-04-24 CRAN (R 3.2.0)
	registry 0.2 2012-01-24 CRAN (R 3.2.0)
	reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0)
	rngtools 1.2.4 2014-03-06 CRAN (R 3.2.0)
	rpart 4.1-9 2015-02-24 CRAN (R 3.2.0)
	Rsamtools 1.21.8 2015-06-03 Bioconductor
	RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
	rtracklayer 1.29.10 2015-06-13 Bioconductor
	rversions 1.0.1 2015-06-06 CRAN (R 3.2.0)
	S4Vectors * 0.7.6 2015-06-16 Bioconductor
	scales 0.2.5 2015-06-12 CRAN (R 3.2.0)
	stringi 0.5-5 2015-06-29 CRAN (R 3.2.1)
	stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
	SummarizedExperiment 0.1.5 2015-05-17 Bioconductor
	survival 2.38-2 2015-06-12 CRAN (R 3.2.0)
	XML 3.98-1.2 2015-05-31 CRAN (R 3.2.0)
	xml2 0.1.1 2015-06-02 CRAN (R 3.2.0)
	xtable 1.7-4 2014-09-12 CRAN (R 3.2.0)
	XVector 0.9.1 2015-04-30 Bioconductor
	zlibbioc 1.15.0 2015-04-17 Bioconductor
	> proc.time()
	user system elapsed
	13.672 0.369 14.119
	> Sys.time()
	[1] "2015-07-10 12:32:59 EDT"
	>

view raw clusterPvalue-output.txt hosted with ❤ by GitHub

	## See https://support.bioconductor.org/p/68709 for context

	## Required packages
	library('derfinder')
	library('GenomicRanges')
	library('devtools')

	#### First section

	## Get regions
	example('analyzeChr', 'derfinder')
	regs <- results$regions$regions

	## Note that all of them are from the same chromosome.
	## If you have data for more than one chromsome, then you can split the
	## GRanges object into a list, with one element per chromosome.
	table(seqnames(regs))

	## Split regions by cluster
	clus <- endoapply(split(regs, regs$cluster), identity)

	## Calculate the total area for each cluster
	mcols(clus)$area <- sapply(clus, function(x) { sum(x$area)})


	## Order clusters by area
	clus <- clus[order(mcols(clus)$area, decreasing = TRUE)]

	## Check the first two clusters by area
	clus[1:2]

	## One option to get p-values for clusters is to take the minimum, or mean p-value
	mcols(clus)$minPval <- sapply(clus, function(x) min(x$pvalues))
	mcols(clus)$meanPval <- sapply(clus, function(x) mean(x$pvalues))
	mcols(clus)$maxPval <- sapply(clus, function(x) max(x$pvalues))

	## Note that the function you choose to summarize the region level p-values into
	## cluster level p-values will matter. Using min() is less conservative than
	## using max().


	#### Second section

	## Get the null areas
	null <- results$regions$nullStats * results$regions$nullWidths

	## Get some null clusters with number of elements equal to those observed.
	## Note that this code doesn't consider the distance between the null regions.
	## So take these numbers with caution.
	n <- 100
	set.seed(20150710) ## Choose your own seed
	index <- sample(seq_len(length(null)), n * sum(elementLengths(clus)), replace = TRUE)
	groups <- factor(rep(rep(names(clus), elementLengths(clus)), n), levels = names(clus))
	## Check that each corresponds to elementLenths(clus) * n
	table(groups)

	## Split null areas into resampled groups
	area_by_clus <- split(null[index], groups)

	## Split by number of regions in each cluster, calculate null areas
	clus_null <- unlist(mapply(function(areas, nRegs, n) {
	sapply(split(areas, rep(seq_len(n), each = nRegs)), sum)
	}, area_by_clus, elementLengths(clus), MoreArgs = list(n = n), SIMPLIFY = FALSE), use.names = FALSE)

	## Calculate (approximated) p-values for clusters
	mcols(clus)$pvalue <- derfinder:::.calcPval(mcols(clus)$area, clus_null)

	## View them all
	mcols(clus)

	## Use p.adjust() or qvalue::qvalue() to control the FDR
	p.adjust(mcols(clus)$pvalue, method = 'BH')


	## Reproducibility info
	options(width = 120)
	session_info()
	proc.time()
	Sys.time()

view raw clusterPvalue.R hosted with ❤ by GitHub

Best,

Leonardo

ADD REPLY • link 9.6 years ago Leonardo Collado Torres ★ 1.1k