universalmotif, question about enrich_motifs function and number of hits returned
1
0
Entering edit mode
@alexandreblais-23753
Last seen 7 weeks ago
France

Edited to include results with renamed shuffled sequences

Hello

We are using universalmotif and the "enrich_motif" function to compare different sets of sequences in their enrichment for certain transcription factor binding motifs. I have two related questions.

While testing the algorithm, I noticed that if I run enrich_motif with background sequences that were shuffled different RNG seed values, the number of motif hits in the TARGET group of sequences is slightly different. I do also see this slight variation in the background set (this was expected) but I was not expecting this to also happen in the target group of sequences since they are the same in each run.


# Get target sequences
original_full_seq <- getSeq(Hsapiens, original_full_GRanges)

# Generate 5 different shuffled versions by just changing the rng.seed value
original_full_seq_shuf1 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 200)
original_full_seq_shuf2 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 201)
original_full_seq_shuf3 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 202)
original_full_seq_shuf4 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 203)
original_full_seq_shuf5 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 204)

# Get motifs from JASPAR (clustered set)
"JASPAR_2022_matrix_clustering_vertebrates_CORE_cluster_root_motifs.tf")

# Run enrichment analysis
Enrich_original_full_seq_shuf1 <-   enrich_motifs(  clustered_motif_db,
original_full_seq,
original_full_seq_shuf1,
max.p = 1,
max.q = 1,
max.e = 2000,
qval.method = "fdr",
threshold = my_tfbsHit_co,
threshold.type = "pvalue",
verbose = 1,
RC = TRUE,
use.freq = 1,
return.scan.results = FALSE,
nthreads = 1, motif_pvalue.k = 8,
use.gaps = TRUE, allow.nonfinite = FALSE, warn.NA = TRUE,
no.overlaps = FALSE, no.overlaps.by.strand = FALSE,
no.overlaps.strat = "score", respect.strand = FALSE)

# repeat the process, but using original_full_seq_shuf2 as the background
# ...and so on with all sets
# also repeat with an additional background set made by concatenating all five shuffled sets
##     this gives "with_shuf12345"
# also repeat, but after giving unique names to each of the shuffled sequences in the background set
##     this gives "with_shuf12345n"
# Combine the results for the top enriched motif and look at the results

>combo_135
motif target.hits target.seq.hits target.seq.count  bkg.hits bkg.seq.hits bkg.seq.count        Pval
<character>   <integer>       <integer>        <integer> <integer>    <integer>     <integer>   <numeric>
with_shuf1      cluster_135         627             228              757       200          147           757 2.43265e-52
with_shuf2      cluster_135         631             230              757       223          145           757 3.58700e-46
with_shuf3      cluster_135         581             215              757       186          139           757 2.02178e-48
with_shuf4      cluster_135         553             203              757       189          123           757 1.53964e-42
with_shuf5      cluster_135         564             207              757       184          129           757 5.52251e-46
with_shuf12345  cluster_135         561             206              757       910          407          3785 4.58477e-46
with_shuf12345n cluster_135         627             228              757      1035          707          3785 3.60288e-50

> sum(combo_135$bkg.hits[1:5]) [1] 982 > sum(combo_135$bkg.seq.hits[1:5])
[1] 683


I am happy to post a more complete reproducible example if needed. So my questions are:

-is it expected to obtain different target hits when the background sequences change?

-is it expected to obtain fewer background sequence hits (407 instead of the expected 683) and background hits (910 instead of 982) when the background sequences are more numerous and the shuffled set has duplicate names?

-is it expect to obtain MORE background sequence hits (707 instead of 683) and background hits (1035 instead of 982) when the sequences do not have duplicate names ?

I would like to understand because ultimately, my goal is to compare motif enrichment in two different target sets of sequences, and I am struggling a bit with deciding what I should use as background:

-set1 against shuffled set1 and set2 against shuffled set2 ?

OR

-set1 and set2 against concatenation of shuffled set1+set2 ?

Alex

> sessionInfo( )
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:

attached base packages:
[1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] magick_2.7.3                      ggtext_0.1.1                      universalmotif_1.10.2             circlize_0.4.13                   RColorBrewer_1.1-2
[6] ComplexHeatmap_2.8.0              EnhancedVolcano_1.10.0            beepr_1.3                         stringr_1.4.0                     reshape2_1.4.4
[11] regioneR_1.24.0                   RUVSeq_1.26.0                     EDASeq_2.26.1                     ShortRead_1.50.0                  BiocParallel_1.26.2
[16] gridExtra_2.3                     ggpubr_0.4.0                      pheatmap_1.0.12                   ggrepel_0.9.1                     ggfortify_0.4.14
[21] tidyr_1.1.4                       ggplot2_3.3.5                     dplyr_1.0.7                       magrittr_2.0.1                    BSgenome.Hsapiens.UCSC.hg38_1.4.3
[26] BSgenome_1.60.0                   rtracklayer_1.52.1                DESeq2_1.32.0                     edgeR_3.34.1                      limma_3.48.3
[31] GenomicAlignments_1.28.0          Rsamtools_2.8.0                   Biostrings_2.60.2                 XVector_0.32.0                    SummarizedExperiment_1.22.0
[36] Biobase_2.52.0                    MatrixGenerics_1.4.3              matrixStats_0.61.0                GenomicRanges_1.44.0              GenomeInfoDb_1.28.4
[41] IRanges_2.26.0                    S4Vectors_0.30.2                  BiocGenerics_0.38.0

loaded via a namespace (and not attached):
[1] backports_1.4.1        aroma.light_3.22.0     systemfonts_1.0.3      BiocFileCache_2.0.0    plyr_1.8.6             splines_4.1.1          digest_0.6.29          foreach_1.5.1
[9] fansi_1.0.2            memoise_2.0.1          cluster_2.1.2          doParallel_1.0.16      annotate_1.70.0        extrafont_0.17         R.utils_2.11.0         extrafontdb_1.0
[17] prettyunits_1.1.1      jpeg_0.1-9             colorspace_2.0-2       blob_1.2.2             rappdirs_0.3.3         textshaping_0.3.6      crayon_1.4.2           RCurl_1.98-1.5
[25] genefilter_1.74.1      iterators_1.0.13       survival_3.2-11        glue_1.6.0             gtable_0.3.0           zlibbioc_1.38.0        GetoptLong_1.0.5       DelayedArray_0.18.0
[33] proj4_1.0-10.1         car_3.0-12             Rttf2pt1_1.3.9         shape_1.4.6            maps_3.4.0             abind_1.4-5            scales_1.1.1           DBI_1.1.2
[41] rstatix_0.7.0          Rcpp_1.0.7             gridtext_0.1.4         xtable_1.8-4           progress_1.2.2         clue_0.3-60            bit_4.0.4              httr_1.4.2
[49] ellipsis_0.3.2         farver_2.1.0           pkgconfig_2.0.3        XML_3.99-0.8           R.methodsS3_1.8.1      dbplyr_2.1.1           locfit_1.5-9.4         utf8_1.2.2
[57] tidyselect_1.1.1       rlang_0.4.12           AnnotationDbi_1.54.1   munsell_0.5.0          tools_4.1.1            cachem_1.0.6           generics_0.1.1         RSQLite_2.2.9
[65] audio_0.1-10           broom_0.7.11           fastmap_1.1.0          ragg_1.2.1             yaml_2.2.1             bit64_4.0.5            purrr_0.3.4            KEGGREST_1.32.0
[73] ash_1.0-15             ggrastr_1.0.1          R.oo_1.24.0            xml2_1.3.3             biomaRt_2.48.3         compiler_4.1.1         beeswarm_0.4.0         filelock_1.0.2
[81] curl_4.3.2             png_0.1-7              ggsignif_0.6.3         tibble_3.1.6           geneplotter_1.70.0     stringi_1.7.6          GenomicFeatures_1.44.2 ggalt_0.4.0
[89] lattice_0.20-44        Matrix_1.3-4           vctrs_0.3.8            pillar_1.6.4           lifecycle_1.0.1        GlobalOptions_0.1.2    bitops_1.0-7           R6_2.5.1
[97] BiocIO_1.2.0           latticeExtra_0.6-29    hwriter_1.3.2          KernSmooth_2.23-20     vipor_0.4.5            codetools_0.2-18       MASS_7.3-54            assertthat_0.2.1
[105] rjson_0.2.20           withr_2.4.3            GenomeInfoDbData_1.2.6 hms_1.1.1              carData_3.0-5          Cairo_1.5-14           ggbeeswarm_0.6.0       restfulr_0.0.13

universalmotif scanning enrich_motifs PWM • 195 views
1
Entering edit mode
btremblay ▴ 60
@btremblay-15941
Last seen 12 weeks ago
Spain

Hi,

After doing some digging I've isolated the issue to being related to the threshold calculation which you request with the threshold.type = "pvalue" option. This is because in versions 1.10 and earlier of the universalmotif package, this calculation was actually an approximation involving randomly subsetting the motif. If you look at motif_pvalue() (the underlying function which calculates the threshold for enrich_motifs()) it has an rng.seed option to control this randomness. I did not think to make this option accessible from enrich_motifs(), but since the default is to simply make a call to sample() you could use set.seed() before calling enrich_motifs() to effectively achieve the same thing. This randomness means there will be slight differences in what hits pass the motif thresholds if run repeatedly without controlling the RNG state. (If you set verbose=3 you will see the computed threshold value being printed.)

If you are interested, starting in version 1.12 I have implemented the dynamic score/p-value calculation method used by the MEME suite (and some others). This calculation method is much faster, more precise and does not involve any randomness (or a need to set motif_pvalue.k). (There is no need to do anything to get this, it is the new default way of calculating p-values.) Hopefully upgrading Bioconductor versions is an option for you, but if not then I would simply recommend setting a seed manually with set.seed().

As an aside, your mention of duplicate names has made me aware of a bug in enrich_motifs() -- the target.seq.hits column will not actually count sequences with duplicated names. This won't affect the enrichment calculation in versions 1.10 and earlier (or using the default settings of version 1.12 and later), but I will push a patch within the next couple of days to correct this. Due to Bioconductor policy this update will not be applicable to the 1.10 series, as only the current release of Bioconductor (which includes 1.12) can receive updates.

Please let me know if you still have any questions regarding this.

1
Entering edit mode

Thank you for the detailed reply. I had indeed noticed the varying thresholds with changing background sets, when using verbose output (verbose=4 is needed to see these values).

I have updated Bioconductor and universalmotif and now the results are more aligned with my initial expectations:

> combo_135[,c(1, 3:9)]
DataFrame with 7 rows and 8 columns
motif target.hits target.seq.hits target.seq.count  bkg.hits bkg.seq.hits bkg.seq.count        Pval
<character>   <integer>       <integer>        <integer> <integer>    <integer>     <integer>   <numeric>
with_shuf1      cluster_135         583             216              757       183          139           757 9.73772e-50
with_shuf2      cluster_135         583             216              757       196          134           757 1.03302e-45
with_shuf3      cluster_135         583             216              757       186          139           757 8.77975e-49
with_shuf4      cluster_135         583             216              757       200          126           757 1.56868e-44
with_shuf5      cluster_135         583             216              757       196          135           757 1.03302e-45
with_shuf12345  cluster_135         583             216              757       961          419          3785 6.40135e-47
with_shuf12345n cluster_135         583             216              757       961          673          3785 6.40135e-47

> sum(combo_135$bkg.hits[1:5]) [1] 961 > sum(combo_135$bkg.seq.hits[1:5])
[1] 673


I think this solves my issue; I am relieved that the number of target hits no longer changes with background.

Alex