Identifying TF binding sites (Motifdb + TxDb.Hsapiens.UCSC.hg19.knownGene)
3
0
Entering edit mode
@lisa-hopcroft-5130
Last seen 6.2 years ago
University of Glasgow, Glasgow

Dear all,

I would like to generate a network between human TFs and their predicted targets using sequence motif information.  I have been following the workflow described here (https://www.bioconductor.org/help/workflows/gene-regulation-tfbs/) with additional methodology here (https://support.bioconductor.org/p/68495/#68597).  See MWE at end of this email.

However, I’m a bit confused about some of the results that I’m getting.  When I look for targets of MYC (a transcription factor known to regulate thousands of genes) with this method, I get 2553 unique gene IDs.  Seems reasonable.

But when I do the same thing with TP53 (also known to transcriptionally regulate thousands of genes) and I only get 4?  Seems not reasonable.

I have also tried to increase the upstream region to 5kb in case TP53 motif binding sites are particularly distant, but then the numbers of targets identified for TP53 and MYC are 12 and 7009 respectively.

Can anyone spot an obvious mistake?  Or am I missing a biological issue specific to TP53?

Lisa

=====================================================

library(MotifDb)
library(GenomicFeatures)
library(org.Hs.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

getTFtranscripts <- function(pwm, txdb, genome,
upstream=2000, downstream=200, min.score="80%")
{
prom <- suppressWarnings(
promoters(txdb, upstream=upstream, downstream=downstream,
columns=c("tx_name", "gene_id")))
prom <- trim(prom)
genome <- getBSgenome(genome)
prom_seqs <- getSeq(genome, prom)
hits_per_prom <- suppressWarnings(sapply(
seq_along(prom_seqs),
function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score)
))
ans <- mcols(prom)[hits_per_prom != 0L, ]
ans[ , "gene_id"] <- as.character(ans[ , "gene_id"])
as.data.frame(ans)
}

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]]
TP53.targets <- getTFtranscripts(TP53.jaspar,
TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
upstream=1000, downstream=250,
min.score="85%")

print( TP53.targets )
print( length( unique(TP53.targets$gene_id[!is.na(TP53.targets$gene_id)] ) ) )

MYC.jaspar <- query(query(MotifDb,"MYC"),"JASPAR_CORE")[[1]]
MYC.targets <- getTFtranscripts(MYC.jaspar,
TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
upstream=1000, downstream=250,
min.score="85%")

print( MYC.targets )
print( length( unique(MYC.targets$gene_id[!is.na(MYC.targets$gene_id)] ) ) )

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 BSgenome.Hsapiens.UCSC.hg19_1.4.0       BSgenome_1.38.0
[4] rtracklayer_1.30.1                      org.Hs.eg.db_3.2.3                      RSQLite_1.0.0
[7] DBI_0.3.1                               GenomicFeatures_1.22.0                  AnnotationDbi_1.32.0
[10] Biobase_2.30.0                          GenomicRanges_1.22.0                    GenomeInfoDb_1.6.0
[13] MotifDb_1.12.0                          Biostrings_2.38.0                       XVector_0.10.0
[16] IRanges_2.4.1                           S4Vectors_0.8.0                         BiocGenerics_0.16.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0            GenomicAlignments_1.6.1    BiocParallel_1.4.0         tools_3.2.2
[5] SummarizedExperiment_1.0.0 lambda.r_1.1.7             futile.logger_1.4.1        futile.options_1.0.0
[9] bitops_1.0-6               biomaRt_2.26.0             RCurl_1.95-4.7             Rsamtools_1.22.0
[13] XML_3.98-1.3

motifdb JASPAR transcription factor binding site • 1.3k views
1
Entering edit mode
pshannon ▴ 90
@pshannon-6931
Last seen 6.3 years ago
United States

Hi LIsa,

This is a very well-posed question!   Thank you.

I contributed the MotifDb package a couple of years ago, but I am not intimate with the details of TF binding sites and their distribution.  However, I notice that the JASPAR-CORE pwm for Myc is just 12 bases wide but TP53 is 20.  In my naive view, that seems overly long.

So I just ran an experiment, and came up with 1696 hits for TP53:

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]][, 2:13]

This shortened motif captures the apparent core of the pattern,  at positions 6-9

>  TP53.jaspar
2         3         4         5 6 7 8 9        10         11         12         13
A 0.17647059 0.2352941 0.2941176 0.7647059 0 1 0 0 0.0000000 0.00000000 0.00000000 0.05882353
C 0.41176471 0.0000000 0.0000000 0.0000000 1 0 0 0 0.6470588 0.94117647 0.94117647 0.00000000
G 0.35294118 0.7647059 0.7058824 0.2352941 0 0 0 1 0.0000000 0.00000000 0.00000000 0.88235294
T 0.05882353 0.0000000 0.0000000 0.0000000 0 0 1 0 0.3529412 0.05882353 0.05882353 0.05882353

Making no other change to your code, this retrieves 1696 hits in 685 genes.

We should solicit the opinion of a TF binding expert to find out if my trimming is defensible.

Hope this helps,

- Paul

0
Entering edit mode

Thanks for your response Paul, I'm glad my question was clear.

Ahh, didn't clock the length of the motif.  That would explain it.  I will consult with the people at JASPAR on that one (and other experts locally here), but would also appreciate any answers with respect to trimming the motif in this thread also.

1
Entering edit mode
Ou, Jianhong ★ 1.3k
@ou-jianhong-4539
Last seen 1 day ago
United States

Hi Lisa,

The motif you retrieved from MotifDb should be position frequency matrix. Will it also affect the search? I changed you code like this:

library(MotifDb)
library(GenomicFeatures)
library(org.Hs.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(motifStack)

getTFtranscripts <- function(pwm, txdb, genome,
upstream=2000, downstream=200, min.score="80%",
ICcutoff=.4)
{
prom <- suppressWarnings(
promoters(txdb, upstream=upstream, downstream=downstream,
columns=c("tx_name", "gene_id")))
prom <- trim(prom)
genome <- getBSgenome(genome)
prom_seqs <- getSeq(genome, prom)
pfm <- new("pfm", mat=pwm, name=deparse(substitute(pwm)))
pfm <- trimMotif(pfm, t=ICcutoff)
pwm <- pfm2pwm(pfm)
hits_per_prom <- suppressWarnings(sapply(
seq_along(prom_seqs),
function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score)
))
ans <- mcols(prom)[hits_per_prom != 0L, ]
ans[ , "gene_id"] <- as.character(ans[ , "gene_id"])
as.data.frame(ans)
}

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]]
TP53.targets <- getTFtranscripts(TP53.jaspar,
TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
upstream=1000, downstream=250,
min.score="85%")

print( TP53.targets )
print( length( unique(TP53.targets$gene_id[!is.na(TP53.targets$gene_id)] ) ) )

MYC.jaspar <- query(query(MotifDb,"MYC"),"JASPAR_CORE")[[1]]
MYC.targets <- getTFtranscripts(MYC.jaspar,
TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
upstream=1000, downstream=250,
min.score="85%")
print( MYC.targets )
print( length( unique(MYC.targets$gene_id[!is.na(MYC.targets$gene_id)] ) ) )


0
Entering edit mode

Fantastic, thank you for the reply Jianhong.  The addition of the trimMotif() function certainly makes a huge difference (I now get reasonable numbers of targets for TP53) and solves the problem identified by Paul in his post.

I will contact the people at JASPAR in any case to make sure that this is an appropriate thing to do with their motifs.

0
Entering edit mode
@lisa-hopcroft-5130
Last seen 6.2 years ago
University of Glasgow, Glasgow

Thanks to Paul and Jianhong for contributing to this thread.  I have been in touch with the JASPAR resource.  They have recommended not to trim the motifs using trimMotif() as suggested by Jianhong: "The information content is highly related to binding strength, and you would basically be trimming away important features. Thereby you are losing what you are trying to predict."

I am currently in discussion with another researcher who has done exactly the task that I'm looking to do.  If I come to any conclusions about this, I'll try to remember to post my findings here.

0
Entering edit mode

It is interesting. I checked the original paper for the TP53 motif. The preferred consensus is the palindrome GGACATGCCCGGGCATGTCC. If we trim the motif (CCGGACATGCCCGGGCATGT) downloaded from JASPAR by information content greater than .04, It will trim the low information position at 1 and 2 and the motif will be like GGACATGCCCGGGCATGT. Acutely, I can not understand why the motif in JASPAR is start from two CC and we can not trim it?