I apologise if some version of this question has already been asked in some way- I could not find any issues that were similar to my own.
I have a choice between taking data from the datasource package 'MotifDb' and 'JASPAR2018':
MotifDb has a much larger wealth of data:
library (MotifDb) library("BSgenome.Mmusculus.UCSC.mm9") tf_motif_hoco=subset(MotifDb, organism=='Mmusculus') length(tf_motif_hoco)
1411
JASPAR database is much smaller:
library(JASPAR2018)
opts = list(species=10090)
jaspar_data = getMatrixSet(JASPAR2018, opts)
length(jaspar_data)
116
Hence in order to maximise the amount of data that I can access, I'd prefer to go with the MotifDb option.
However when I convert the PFM to a PWM the conversion looks slightly different between 2 corresponding searches from the databases.
See below how I take the same PFM from the JASPAR 2018 part of MotifDb and the actual JASPAR2018 database and compare:
library (MotifDb) library(TFBSTools) tf_motif_hoco=subset(MotifDb, organism=='Mmusculus' & dataSource=='jaspar2018') pfm=tf_motif_hoco[[1]] #take first search result- ARNT
pfm
1 2 3 4 5 6 A 0.2 0.95 0 0 0 0 C 0.8 0.00 1 0 0 0 G 0.0 0.05 0 1 0 1 T 0.0 0.00 0 0 1 0pfm_rounded=round(pfm*1000) pfm_rounded=apply(pfm_rounded,2,as.integer) rownames(pfm_rounded)=c('A','C','G','T') pwm=toPWM(pfm_rounded)
And now the corresponding process in JASPAR:
library(JASPAR2018) opts = list(species=10090)
pfm <- getMatrixSet(JASPAR2018, opts)
pfm
[,1] [,2] [,3] [,4] [,5] [,6]
A 4 19 0 0 0 0
C 16 0 20 0 0 0
G 0 1 0 20 0 20
T 0 0 0 0 20 0
You can see that the above printed pfms are the same except for the fact that the values in the one that came from MotifDB are 1/20 that of Jaspar.
I made a mistake when originally writing a script that took all of the PFMs from MotifDB and implementing further analysis- before converting them to PWMs I multiplied the values by 1000 rather than 20... Hence when I compare the PWMs from the different sources are different.... The values are similar but not equal... e.g.
JASPAR2018
[,1] [,2] [,3] [,4] [,5] [,6]
A -0.3081223 1.884523 -4.700440 -4.700440 -4.700440 -4.700440
C 1.6394103 -4.700440 1.957772 -4.700440 -4.700440 -4.700440
G -4.7004397 -2.115477 -4.700440 1.957772 -4.700440 1.957772
T -4.7004397 -4.700440 -4.700440 -4.700440 1.957772 -4.700440
MotifDb
1 2 3 4 5 6
A -0.3216398 1.925149 -10.288866 -10.288866 -10.288866 -10.288866
C 1.6772788 -10.288866 1.999135 -10.288866 -10.288866 -10.288866
G -10.2888661 -2.317323 -10.288866 1.999135 -10.288866 1.999135
T -10.2888661 -10.288866 -10.288866 -10.288866 1.999135 -10.288866
The script took a really long time to run the first time so I would prefer not to have to run this script again and just use the output that was found the first time...
Also as a wider issue- can I be sure that all PFMs from MotifDb are only need to be multiplied by 20 in order to be convertible to a PWM?
I tried to get hold of a pair of PFMs from HOCOMOCO and from the HOCOMOCO section of MotifDb in order to make the same comparison as above but HOCOMOCO has updated to version 11 while the corresponding PFMs from MotifDb have not and hence they are not comparable...