Search
Question: derfinder mean Expressed Region cutoff
0
8 months ago by
Germany

I am running derfinder to ultimately find DER in a time course but despite the very good documentation I have some doubts about how set the cutoff to find expressed regions. In the 2017 NAR paper, it is stated that an average across all regions and samples is used as a cut-off (Finding ERs):

Here, we focus on a new approach based on the bump-hunting methodology for region level genomic analysis (33) (Figure 2). This approach first calculates ERs across the set of observed samples. For each base, the average, potentially library size-adjusted, coverage is calculated across all samples in the data set. This generates a vector of (normalized) mean level expression measurements across the genome. Then an average-coverage cutoff is applied to this mean coverage vector to identify bases that show minimum levels of expression. An expressed region is any contiguous set of bases that has expression above the mean expression cutoff.

However, in the manual I could not find any indication of this being used, and all examples use a hard cutoff for base coverage. I also looked into help page of filterData but I am still in the dark.

Two questions:

1. Is this filtering option using calculated mean coverage implemented? I could of course calculate that myself and feed it to fullCoverage or regionMatrix but that could be re-inventing the wheel.
2. If it turns out I am misunderstanding the above paper paragraph, any tips on selecting a reasonable cut-off? I would usually create a distribution of random bases (sampling n times) and then use the median as a threshold.

Unrelated to the main points, the data are normalized bigwigs, generated from paired-end sequencing. Which read-length should bed used in regionMatrix? Actual read length or something else? derfinder on bacteria PE strand specific RNA-seq is somewhat related but I am not interested in finding expressed regions for each individual strand (at least not at the moment).

Cheers.

> sessionInfo()

R version 3.4.1 (2017-06-30)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default

BLAS: /usr/lib/atlas-base/libf77blas.so.3.0

LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C

[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8

[7] LC_PAPER=en_US.UTF-8       LC_NAME=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets

[8] methods   base

other attached packages:

[1] DESeq2_1.18.1              SummarizedExperiment_1.8.0

[3] DelayedArray_0.4.1         matrixStats_0.52.2

[5] Biobase_2.38.0             GenomicRanges_1.30.0

[7] GenomeInfoDb_1.14.0        IRanges_2.12.0

[9] S4Vectors_0.16.0           BiocGenerics_0.24.0

[11] derfinder_1.12.0           fortunes_1.5-4

loaded via a namespace (and not attached):

[1] RMySQL_0.10.13           bit64_0.9-7              splines_3.4.1

[4] foreach_1.4.3            GenomicFiles_1.14.0      Formula_1.2-2

[7] bumphunter_1.20.0        assertthat_0.2.0         latticeExtra_0.6-28

[10] doRNG_1.6.6              blob_1.1.0               BSgenome_1.46.0

[13] GenomeInfoDbData_0.99.1  Rsamtools_1.30.0         progress_1.1.2

[16] RSQLite_2.0              backports_1.1.1          lattice_0.20-35

[19] digest_0.6.12            RColorBrewer_1.1-2       XVector_0.18.0

[22] checkmate_1.8.5          qvalue_2.10.0            colorspace_1.3-2

[25] htmltools_0.3.6          Matrix_1.2-12            plyr_1.8.4

[28] devtools_1.13.4          XML_3.98-1.9             biomaRt_2.34.0

[31] genefilter_1.60.0        zlibbioc_1.24.0          xtable_1.8-2

[34] scales_0.5.0             BiocParallel_1.12.0      annotate_1.56.1

[37] htmlTable_1.9            tibble_1.3.4             pkgmaker_0.22

[40] ggplot2_2.2.1            withr_2.1.0              GenomicFeatures_1.30.0

[43] nnet_7.3-12              lazyeval_0.2.1           survival_2.41-3

[46] magrittr_1.5             memoise_1.1.0            foreign_0.8-69

[49] registry_0.3             tools_3.4.1              data.table_1.10.4-3

[52] prettyunits_1.0.2        stringr_1.2.0            locfit_1.5-9.1

[55] munsell_0.4.3            rngtools_1.2.4           cluster_2.0.6

[58] AnnotationDbi_1.40.0     Biostrings_2.46.0        compiler_3.4.1

[61] rlang_0.1.4              grid_3.4.1               RCurl_1.95-4.8

[64] rstudioapi_0.7           iterators_1.0.8          VariantAnnotation_1.24.1

[67] htmlwidgets_0.9          bitops_1.0-6             base64enc_0.1-3

[70] derfinderHelper_1.12.0   gtable_0.2.0             codetools_0.2-15

[73] DBI_0.7                  reshape2_1.4.2           R6_2.2.2

[76] GenomicAlignments_1.14.0 gridExtra_2.3            knitr_1.17

[79] rtracklayer_1.38.0       bit_1.1-12               Hmisc_4.0-3

[82] stringi_1.1.6            Rcpp_0.12.13             geneplotter_1.56.0

[85] rpart_4.1-11             acepack_1.4.1

modified 8 months ago by Leonardo Collado Torres610 • written 8 months ago by António Miguel de Jesus Domingues390
2
8 months ago by
United States

Hi,

Maybe the wording in the paper could have been clearer. Once you have a mean coverage vector (in reality a Rle object), we use a single global cutoff for finding the expressed regions. Just like in Figure 2A of the derfinder paper https://academic.oup.com/nar/article/45/2/e9/2953306

For choosing the global cutoff, you might want to make plots like in Supplementary Figure 4 of the derfinder paper. None of the derfinder functions "find" a cutoff for you to use, simply the default is 5.

As for your second question, if you data is already library-size adjusted (scaled) and you don't want regionMatrix() to do anything to it set totalMapped to NULL and leave L = 1. Sounds to me that your case is kind of similar to the one described in https://f1000research.com/articles/6-1558/v1. In particular, in equation 1 your BigWigs are already scaled, so you just need to calculate sum of coverage / read length (where read length could be 100). In recount's case, the sample bigwigs are not scaled.

Best,

Leonardo