Question

Identifying TF binding sites (Motifdb + TxDb.Hsapiens.UCSC.hg19.knownGene)

0

Entering edit mode

Lisa Hopcroft ▴ 40

@lisa-hopcroft-5130

Last seen 10.1 years ago

University of Glasgow, Glasgow

Dear all,

I would like to generate a network between human TFs and their predicted targets using sequence motif information. I have been following the workflow described here (https://www.bioconductor.org/help/workflows/gene-regulation-tfbs/) with additional methodology here (https://support.bioconductor.org/p/68495/#68597). See MWE at end of this email.

However, I’m a bit confused about some of the results that I’m getting. When I look for targets of MYC (a transcription factor known to regulate thousands of genes) with this method, I get 2553 unique gene IDs. Seems reasonable.

But when I do the same thing with TP53 (also known to transcriptionally regulate thousands of genes) and I only get 4? Seems not reasonable.

I have also tried to increase the upstream region to 5kb in case TP53 motif binding sites are particularly distant, but then the numbers of targets identified for TP53 and MYC are 12 and 7009 respectively.

Can anyone spot an obvious mistake? Or am I missing a biological issue specific to TP53?

Thanks in advance,
Lisa

=====================================================

library(MotifDb)
library(GenomicFeatures)
library(org.Hs.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

getTFtranscripts <- function(pwm, txdb, genome,
                             upstream=2000, downstream=200, min.score="80%")
{
  prom <- suppressWarnings(
    promoters(txdb, upstream=upstream, downstream=downstream,
              columns=c("tx_name", "gene_id")))
  prom <- trim(prom)
  genome <- getBSgenome(genome)
  prom_seqs <- getSeq(genome, prom)
  hits_per_prom <- suppressWarnings(sapply(
    seq_along(prom_seqs),
    function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score)
  ))
  ans <- mcols(prom)[hits_per_prom != 0L, ]
  ans[ , "gene_id"] <- as.character(ans[ , "gene_id"])
  as.data.frame(ans)
}

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]]
TP53.targets <- getTFtranscripts(TP53.jaspar,
                                 TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
                                 upstream=1000, downstream=250,
                                 min.score="85%")

print( TP53.targets )
print( length( unique(TP53.targets$gene_id[!is.na(TP53.targets$gene_id)] ) ) )

MYC.jaspar <- query(query(MotifDb,"MYC"),"JASPAR_CORE")[[1]]
MYC.targets <- getTFtranscripts(MYC.jaspar,
                                TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
                                upstream=1000, downstream=250,
                                min.score="85%")

print( MYC.targets )
print( length( unique(MYC.targets$gene_id[!is.na(MYC.targets$gene_id)] ) ) )

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 BSgenome.Hsapiens.UCSC.hg19_1.4.0       BSgenome_1.38.0
[4] rtracklayer_1.30.1                      org.Hs.eg.db_3.2.3                      RSQLite_1.0.0
[7] DBI_0.3.1                               GenomicFeatures_1.22.0                  AnnotationDbi_1.32.0
[10] Biobase_2.30.0                          GenomicRanges_1.22.0                    GenomeInfoDb_1.6.0
[13] MotifDb_1.12.0                          Biostrings_2.38.0                       XVector_0.10.0
[16] IRanges_2.4.1                           S4Vectors_0.8.0                         BiocGenerics_0.16.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0            GenomicAlignments_1.6.1    BiocParallel_1.4.0         tools_3.2.2
[5] SummarizedExperiment_1.0.0 lambda.r_1.1.7             futile.logger_1.4.1        futile.options_1.0.0
[9] bitops_1.0-6               biomaRt_2.26.0             RCurl_1.95-4.7             Rsamtools_1.22.0
[13] XML_3.98-1.3

motifdb JASPAR transcription factor binding site • 2.7k views

ADD COMMENT • link 10.2 years ago Lisa Hopcroft ▴ 40

score 1 · Answer 1 · 2015-10-30

Hi LIsa,

This is a very well-posed question! Thank you.

I contributed the MotifDb package a couple of years ago, but I am not intimate with the details of TF binding sites and their distribution. However, I notice that the JASPAR-CORE pwm for Myc is just 12 bases wide but TP53 is 20. In my naive view, that seems overly long.

So I just ran an experiment, and came up with 1696 hits for TP53:

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]][, 2:13]

This shortened motif captures the apparent core of the pattern, at positions 6-9

>  TP53.jaspar
           2         3         4         5 6 7 8 9        10         11         12         13
A 0.17647059 0.2352941 0.2941176 0.7647059 0 1 0 0 0.0000000 0.00000000 0.00000000 0.05882353
C 0.41176471 0.0000000 0.0000000 0.0000000 1 0 0 0 0.6470588 0.94117647 0.94117647 0.00000000
G 0.35294118 0.7647059 0.7058824 0.2352941 0 0 0 1 0.0000000 0.00000000 0.00000000 0.88235294
T 0.05882353 0.0000000 0.0000000 0.0000000 0 0 1 0 0.3529412 0.05882353 0.05882353 0.05882353

Making no other change to your code, this retrieves 1696 hits in 685 genes.

We should solicit the opinion of a TF binding expert to find out if my trimming is defensible.

Hope this helps,

- Paul

score 1 · Answer 2 · 2015-10-30

Hi Lisa,

The motif you retrieved from MotifDb should be position frequency matrix. Will it also affect the search? I changed you code like this:

library(MotifDb)
library(GenomicFeatures)
library(org.Hs.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(motifStack)

getTFtranscripts <- function(pwm, txdb, genome,
                             upstream=2000, downstream=200, min.score="80%",
                             ICcutoff=.4)
{
  prom <- suppressWarnings(
    promoters(txdb, upstream=upstream, downstream=downstream,
              columns=c("tx_name", "gene_id")))
  prom <- trim(prom)
  genome <- getBSgenome(genome)
  prom_seqs <- getSeq(genome, prom)
  pfm <- new("pfm", mat=pwm, name=deparse(substitute(pwm)))
  pfm <- trimMotif(pfm, t=ICcutoff)
  pwm <- pfm2pwm(pfm)
  hits_per_prom <- suppressWarnings(sapply(
    seq_along(prom_seqs),
    function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score)
  ))
  ans <- mcols(prom)[hits_per_prom != 0L, ]
  ans[ , "gene_id"] <- as.character(ans[ , "gene_id"])
  as.data.frame(ans)
}

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]]
TP53.targets <- getTFtranscripts(TP53.jaspar,
                                 TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
                                 upstream=1000, downstream=250,
                                 min.score="85%")

print( TP53.targets )
print( length( unique(TP53.targets$gene_id[!is.na(TP53.targets$gene_id)] ) ) )

MYC.jaspar <- query(query(MotifDb,"MYC"),"JASPAR_CORE")[[1]]
MYC.targets <- getTFtranscripts(MYC.jaspar,
                                TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
                                upstream=1000, downstream=250,
                                min.score="85%")
print( MYC.targets )
print( length( unique(MYC.targets$gene_id[!is.na(MYC.targets$gene_id)] ) ) )

score 0 · Answer 3 · 2015-11-04

0

Entering edit mode

Lisa Hopcroft ▴ 40

@lisa-hopcroft-5130

Last seen 10.1 years ago

University of Glasgow, Glasgow

Thanks to Paul and Jianhong for contributing to this thread. I have been in touch with the JASPAR resource. They have recommended not to trim the motifs using trimMotif() as suggested by Jianhong: "The information content is highly related to binding strength, and you would basically be trimming away important features. Thereby you are losing what you are trying to predict."

I am currently in discussion with another researcher who has done exactly the task that I'm looking to do. If I come to any conclusions about this, I'll try to remember to post my findings here.

ADD COMMENT • link 10.2 years ago Lisa Hopcroft ▴ 40

0

Entering edit mode

It is interesting. I checked the original paper for the TP53 motif. The preferred consensus is the palindrome GGACATGCCCGGGCATGTCC. If we trim the motif (CCGGACATGCCCGGGCATGT) downloaded from JASPAR by information content greater than .04, It will trim the low information position at 1 and 2 and the motif will be like GGACATGCCCGGGCATGT. Acutely, I can not understand why the motif in JASPAR is start from two CC and we can not trim it?

ADD REPLY • link 10.2 years ago Ou, Jianhong ★ 1.3k