Edited to include results with renamed shuffled sequences
Hello
We are using universalmotif and the "enrich_motif" function to compare different sets of sequences in their enrichment for certain transcription factor binding motifs. I have two related questions.
While testing the algorithm, I noticed that if I run enrich_motif with background sequences that were shuffled different RNG seed values, the number of motif hits in the TARGET group of sequences is slightly different. I do also see this slight variation in the background set (this was expected) but I was not expecting this to also happen in the target group of sequences since they are the same in each run.
# Get target sequences
original_full_seq <- getSeq(Hsapiens, original_full_GRanges)
# Generate 5 different shuffled versions by just changing the rng.seed value
original_full_seq_shuf1 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 200)
original_full_seq_shuf2 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 201)
original_full_seq_shuf3 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 202)
original_full_seq_shuf4 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 203)
original_full_seq_shuf5 <- shuffle_sequences(original_full_seq, k=2, rng.seed = 204)
# Get motifs from JASPAR (clustered set)
download.file( "https://jaspar.genereg.net/static/clustering/2022/vertebrates/CORE/interactive_trees/JASPAR_2022_matrix_clustering_vertebrates_CORE_cluster_root_motifs.tf",
"JASPAR_2022_matrix_clustering_vertebrates_CORE_cluster_root_motifs.tf")
clustered_motif_db <- read_transfac("JASPAR_2022_matrix_clustering_vertebrates_CORE_cluster_root_motifs.tf")
# Run enrichment analysis
Enrich_original_full_seq_shuf1 <- enrich_motifs( clustered_motif_db,
original_full_seq,
original_full_seq_shuf1,
max.p = 1,
max.q = 1,
max.e = 2000,
qval.method = "fdr",
threshold = my_tfbsHit_co,
threshold.type = "pvalue",
verbose = 1,
RC = TRUE,
use.freq = 1,
return.scan.results = FALSE,
nthreads = 1, motif_pvalue.k = 8,
use.gaps = TRUE, allow.nonfinite = FALSE, warn.NA = TRUE,
no.overlaps = FALSE, no.overlaps.by.strand = FALSE,
no.overlaps.strat = "score", respect.strand = FALSE)
# repeat the process, but using original_full_seq_shuf2 as the background
# ...and so on with all sets
# also repeat with an additional background set made by concatenating all five shuffled sets
## this gives "with_shuf12345"
# also repeat, but after giving unique names to each of the shuffled sequences in the background set
## this gives "with_shuf12345n"
# Combine the results for the top enriched motif and look at the results
>combo_135
motif target.hits target.seq.hits target.seq.count bkg.hits bkg.seq.hits bkg.seq.count Pval
<character> <integer> <integer> <integer> <integer> <integer> <integer> <numeric>
with_shuf1 cluster_135 627 228 757 200 147 757 2.43265e-52
with_shuf2 cluster_135 631 230 757 223 145 757 3.58700e-46
with_shuf3 cluster_135 581 215 757 186 139 757 2.02178e-48
with_shuf4 cluster_135 553 203 757 189 123 757 1.53964e-42
with_shuf5 cluster_135 564 207 757 184 129 757 5.52251e-46
with_shuf12345 cluster_135 561 206 757 910 407 3785 4.58477e-46
with_shuf12345n cluster_135 627 228 757 1035 707 3785 3.60288e-50
> sum(combo_135$bkg.hits[1:5])
[1] 982
> sum(combo_135$bkg.seq.hits[1:5])
[1] 683
I am happy to post a more complete reproducible example if needed. So my questions are:
-is it expected to obtain different target hits when the background sequences change?
-is it expected to obtain fewer background sequence hits (407 instead of the expected 683) and background hits (910 instead of 982) when the background sequences are more numerous and the shuffled set has duplicate names?
-is it expect to obtain MORE background sequence hits (707 instead of 683) and background hits (1035 instead of 982) when the sequences do not have duplicate names ?
I would like to understand because ultimately, my goal is to compare motif enrichment in two different target sets of sequences, and I am struggling a bit with deciding what I should use as background:
-set1 against shuffled set1 and set2 against shuffled set2 ?
OR
-set1 and set2 against concatenation of shuffled set1+set2 ?
Thanks in advance for your help and advice.
Alex
> sessionInfo( )
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C LC_TIME=English_Canada.1252
attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] magick_2.7.3 ggtext_0.1.1 universalmotif_1.10.2 circlize_0.4.13 RColorBrewer_1.1-2
[6] ComplexHeatmap_2.8.0 EnhancedVolcano_1.10.0 beepr_1.3 stringr_1.4.0 reshape2_1.4.4
[11] regioneR_1.24.0 RUVSeq_1.26.0 EDASeq_2.26.1 ShortRead_1.50.0 BiocParallel_1.26.2
[16] gridExtra_2.3 ggpubr_0.4.0 pheatmap_1.0.12 ggrepel_0.9.1 ggfortify_0.4.14
[21] tidyr_1.1.4 ggplot2_3.3.5 dplyr_1.0.7 magrittr_2.0.1 BSgenome.Hsapiens.UCSC.hg38_1.4.3
[26] BSgenome_1.60.0 rtracklayer_1.52.1 DESeq2_1.32.0 edgeR_3.34.1 limma_3.48.3
[31] GenomicAlignments_1.28.0 Rsamtools_2.8.0 Biostrings_2.60.2 XVector_0.32.0 SummarizedExperiment_1.22.0
[36] Biobase_2.52.0 MatrixGenerics_1.4.3 matrixStats_0.61.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
[41] IRanges_2.26.0 S4Vectors_0.30.2 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] backports_1.4.1 aroma.light_3.22.0 systemfonts_1.0.3 BiocFileCache_2.0.0 plyr_1.8.6 splines_4.1.1 digest_0.6.29 foreach_1.5.1
[9] fansi_1.0.2 memoise_2.0.1 cluster_2.1.2 doParallel_1.0.16 annotate_1.70.0 extrafont_0.17 R.utils_2.11.0 extrafontdb_1.0
[17] prettyunits_1.1.1 jpeg_0.1-9 colorspace_2.0-2 blob_1.2.2 rappdirs_0.3.3 textshaping_0.3.6 crayon_1.4.2 RCurl_1.98-1.5
[25] genefilter_1.74.1 iterators_1.0.13 survival_3.2-11 glue_1.6.0 gtable_0.3.0 zlibbioc_1.38.0 GetoptLong_1.0.5 DelayedArray_0.18.0
[33] proj4_1.0-10.1 car_3.0-12 Rttf2pt1_1.3.9 shape_1.4.6 maps_3.4.0 abind_1.4-5 scales_1.1.1 DBI_1.1.2
[41] rstatix_0.7.0 Rcpp_1.0.7 gridtext_0.1.4 xtable_1.8-4 progress_1.2.2 clue_0.3-60 bit_4.0.4 httr_1.4.2
[49] ellipsis_0.3.2 farver_2.1.0 pkgconfig_2.0.3 XML_3.99-0.8 R.methodsS3_1.8.1 dbplyr_2.1.1 locfit_1.5-9.4 utf8_1.2.2
[57] tidyselect_1.1.1 rlang_0.4.12 AnnotationDbi_1.54.1 munsell_0.5.0 tools_4.1.1 cachem_1.0.6 generics_0.1.1 RSQLite_2.2.9
[65] audio_0.1-10 broom_0.7.11 fastmap_1.1.0 ragg_1.2.1 yaml_2.2.1 bit64_4.0.5 purrr_0.3.4 KEGGREST_1.32.0
[73] ash_1.0-15 ggrastr_1.0.1 R.oo_1.24.0 xml2_1.3.3 biomaRt_2.48.3 compiler_4.1.1 beeswarm_0.4.0 filelock_1.0.2
[81] curl_4.3.2 png_0.1-7 ggsignif_0.6.3 tibble_3.1.6 geneplotter_1.70.0 stringi_1.7.6 GenomicFeatures_1.44.2 ggalt_0.4.0
[89] lattice_0.20-44 Matrix_1.3-4 vctrs_0.3.8 pillar_1.6.4 lifecycle_1.0.1 GlobalOptions_0.1.2 bitops_1.0-7 R6_2.5.1
[97] BiocIO_1.2.0 latticeExtra_0.6-29 hwriter_1.3.2 KernSmooth_2.23-20 vipor_0.4.5 codetools_0.2-18 MASS_7.3-54 assertthat_0.2.1
[105] rjson_0.2.20 withr_2.4.3 GenomeInfoDbData_1.2.6 hms_1.1.1 carData_3.0-5 Cairo_1.5-14 ggbeeswarm_0.6.0 restfulr_0.0.13
Thank you for the detailed reply. I had indeed noticed the varying thresholds with changing background sets, when using verbose output (verbose=4 is needed to see these values).
I have updated Bioconductor and universalmotif and now the results are more aligned with my initial expectations:
I think this solves my issue; I am relieved that the number of target hits no longer changes with background.
Alex