Hi,
I noticed that in the seqPattern plots, the positions of sequences seemed to change in plots where there are less matches. Below shows motif matches of the same input sequences, and from left to right the motifs have increasing complexity. The dots (motif match) in each plot should be a subset of dots (at the exact position) of any of the plot to its left. But the section above the top dash line (the line position is chosen randomly as reference) seems different across plots, and the section below the bottom dash line seemed consistent. From the biology of the sequences, the plot with CACACT is what I would expect. I only noticed this behavior with this particular motif, which has a lot more sequences with no matching site at all comparing to other motifs.
I tried all versions of R/Bioconductor our cluster has and the result is the same. Any suggestions would be appreciated.
Input sequences download
Script:
library(seqPattern)
library(Biostrings)
w=800
h=2500
plot_pattern = function(fa,nm) {
plotPatternDensityMap(regionsSeq = fa, patterns = "YGGYMACACT", color = "gray", outFile = paste0(nm, "Density_ohler1_YGGYMACACT"), plotWidth = w, plotHeight =h, addReferenceLine=F, plotScale=F, cexAxis=6, xTicksAt=c(1,150,300), xTicks=c("-150","TSS","150"), addPatternLabel=F)
plotPatternDensityMap(regionsSeq = fa, patterns = "GGTCACACT", color = "gray", outFile = paste0(nm, "Density_ohler1_GGTCACACT"), plotWidth = w, plotHeight =h, addReferenceLine=F, plotScale=F, cexAxis=6, xTicksAt=c(1,150,300), xTicks=c("-150","TSS","150"), addPatternLabel=F)
plotPatternDensityMap(regionsSeq = fa, patterns = "TCACACT", color = "gray", outFile = paste0(nm, "Density_ohler1_TCACACT"), plotWidth = w, plotHeight =h, addReferenceLine=F, plotScale=F, cexAxis=6, xTicksAt=c(1,150,300), xTicks=c("-150","TSS","150"), addPatternLabel=F)
plotPatternDensityMap(regionsSeq = fa, patterns = "CACACT", color = "gray", outFile = paste0(nm, "Density_ohler1_CACACT"), plotWidth = w, plotHeight =h, addReferenceLine=F, plotScale=F, cexAxis=6, xTicksAt=c(1,150,300), xTicks=c("-150","TSS","150"), addPatternLabel=F)
}
fa_list=list.files(pattern = '*.fa$') # the list has only 1 .fa file in this case
print(fa_list)
for (nm in fa_list) {
fa = readDNAStringSet(nm, format="fasta")
plot_pattern(fa, nm)
}
Run log and sessionInfo:
Currently Loaded Modules:
1) gcc/7.3.0-centos_7 2) r/3.5.0
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: XVector
Attaching package: ‘Biostrings’
The following object is masked from ‘package:base’:
strsplit
[1] "intq_mid_spgnchr_g1_g2BH_Up_g2BH_DOWN_no_break.fa"
Getting oligonucleotide occurrence matrix...
Calculating density...
->YGGYMACACT
Plotting...
->YGGYMACACT
Getting oligonucleotide occurrence matrix...
Calculating density...
->GGTCACACT
Plotting...
->GGTCACACT
Getting oligonucleotide occurrence matrix...
Calculating density...
->TCACACT
Plotting...
->TCACACT
Getting oligonucleotide occurrence matrix...
Calculating density...
->CACACT
Plotting...
->CACACT
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /usr/lib64/libblas.so.3.4.2
LAPACK: /scg/apps/software/r/3.5.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] Biostrings_2.50.2 XVector_0.22.0 IRanges_2.16.0
[4] S4Vectors_0.20.1 BiocGenerics_0.28.0 seqPattern_1.14.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.28.0 compiler_3.5.0 GenomicRanges_1.34.0
[4] GenomeInfoDbData_1.2.0 RCurl_1.95-4.12 KernSmooth_2.23-15
[7] plotrix_3.7-5 GenomeInfoDb_1.18.2 bitops_1.0-6