Question: Identifying TF binding sites (Motifdb + TxDb.Hsapiens.UCSC.hg19.knownGene)
0
gravatar for Lisa Hopcroft
3.8 years ago by
University of Glasgow, Glasgow
Lisa Hopcroft40 wrote:

Dear all,

I would like to generate a network between human TFs and their predicted targets using sequence motif information.  I have been following the workflow described here (https://www.bioconductor.org/help/workflows/gene-regulation-tfbs/) with additional methodology here (https://support.bioconductor.org/p/68495/#68597).  See MWE at end of this email.

However, I’m a bit confused about some of the results that I’m getting.  When I look for targets of MYC (a transcription factor known to regulate thousands of genes) with this method, I get 2553 unique gene IDs.  Seems reasonable.

But when I do the same thing with TP53 (also known to transcriptionally regulate thousands of genes) and I only get 4?  Seems not reasonable.

I have also tried to increase the upstream region to 5kb in case TP53 motif binding sites are particularly distant, but then the numbers of targets identified for TP53 and MYC are 12 and 7009 respectively.

Can anyone spot an obvious mistake?  Or am I missing a biological issue specific to TP53?

Thanks in advance,
Lisa


=====================================================

library(MotifDb)
library(GenomicFeatures)
library(org.Hs.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

getTFtranscripts <- function(pwm, txdb, genome,
                             upstream=2000, downstream=200, min.score="80%")
{
  prom <- suppressWarnings(
    promoters(txdb, upstream=upstream, downstream=downstream,
              columns=c("tx_name", "gene_id")))
  prom <- trim(prom)
  genome <- getBSgenome(genome)
  prom_seqs <- getSeq(genome, prom)
  hits_per_prom <- suppressWarnings(sapply(
    seq_along(prom_seqs),
    function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score)
  ))
  ans <- mcols(prom)[hits_per_prom != 0L, ]
  ans[ , "gene_id"] <- as.character(ans[ , "gene_id"])
  as.data.frame(ans)
}

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]]
TP53.targets <- getTFtranscripts(TP53.jaspar,
                                 TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
                                 upstream=1000, downstream=250,
                                 min.score="85%")

print( TP53.targets )
print( length( unique(TP53.targets$gene_id[!is.na(TP53.targets$gene_id)] ) ) )

MYC.jaspar <- query(query(MotifDb,"MYC"),"JASPAR_CORE")[[1]]
MYC.targets <- getTFtranscripts(MYC.jaspar,
                                TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
                                upstream=1000, downstream=250,
                                min.score="85%")

print( MYC.targets )
print( length( unique(MYC.targets$gene_id[!is.na(MYC.targets$gene_id)] ) ) )

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 BSgenome.Hsapiens.UCSC.hg19_1.4.0       BSgenome_1.38.0                        
 [4] rtracklayer_1.30.1                      org.Hs.eg.db_3.2.3                      RSQLite_1.0.0                          
 [7] DBI_0.3.1                               GenomicFeatures_1.22.0                  AnnotationDbi_1.32.0                   
[10] Biobase_2.30.0                          GenomicRanges_1.22.0                    GenomeInfoDb_1.6.0                     
[13] MotifDb_1.12.0                          Biostrings_2.38.0                       XVector_0.10.0                         
[16] IRanges_2.4.1                           S4Vectors_0.8.0                         BiocGenerics_0.16.0                    

loaded via a namespace (and not attached):
 [1] zlibbioc_1.16.0            GenomicAlignments_1.6.1    BiocParallel_1.4.0         tools_3.2.2               
 [5] SummarizedExperiment_1.0.0 lambda.r_1.1.7             futile.logger_1.4.1        futile.options_1.0.0      
 [9] bitops_1.0-6               biomaRt_2.26.0             RCurl_1.95-4.7             Rsamtools_1.22.0          
[13] XML_3.98-1.3  

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Lisa Hopcroft40
Answer: Identifying TF binding sites (Motifdb + TxDb.Hsapiens.UCSC.hg19.knownGene)
1
gravatar for pshannon
3.8 years ago by
pshannon90
United States
pshannon90 wrote:

Hi LIsa,

This is a very well-posed question!   Thank you.   

I contributed the MotifDb package a couple of years ago, but I am not intimate with the details of TF binding sites and their distribution.  However, I notice that the JASPAR-CORE pwm for Myc is just 12 bases wide but TP53 is 20.  In my naive view, that seems overly long. 

So I just ran an experiment, and came up with 1696 hits for TP53:

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]][, 2:13]

This shortened motif captures the apparent core of the pattern,  at positions 6-9

>  TP53.jaspar
           2         3         4         5 6 7 8 9        10         11         12         13
A 0.17647059 0.2352941 0.2941176 0.7647059 0 1 0 0 0.0000000 0.00000000 0.00000000 0.05882353
C 0.41176471 0.0000000 0.0000000 0.0000000 1 0 0 0 0.6470588 0.94117647 0.94117647 0.00000000
G 0.35294118 0.7647059 0.7058824 0.2352941 0 0 0 1 0.0000000 0.00000000 0.00000000 0.88235294
T 0.05882353 0.0000000 0.0000000 0.0000000 0 0 1 0 0.3529412 0.05882353 0.05882353 0.05882353

Making no other change to your code, this retrieves 1696 hits in 685 genes.

We should solicit the opinion of a TF binding expert to find out if my trimming is defensible.

Hope this helps,

 - Paul

 

ADD COMMENTlink written 3.8 years ago by pshannon90

Thanks for your response Paul, I'm glad my question was clear.

Ahh, didn't clock the length of the motif.  That would explain it.  I will consult with the people at JASPAR on that one (and other experts locally here), but would also appreciate any answers with respect to trimming the motif in this thread also.

 

ADD REPLYlink written 3.8 years ago by Lisa Hopcroft40
Answer: Identifying TF binding sites (Motifdb + TxDb.Hsapiens.UCSC.hg19.knownGene)
1
gravatar for Ou, Jianhong
3.8 years ago by
Ou, Jianhong1.1k
United States
Ou, Jianhong1.1k wrote:

Hi Lisa,

The motif you retrieved from MotifDb should be position frequency matrix. Will it also affect the search? I changed you code like this:

library(MotifDb)
library(GenomicFeatures)
library(org.Hs.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(motifStack)

getTFtranscripts <- function(pwm, txdb, genome,
                             upstream=2000, downstream=200, min.score="80%",
                             ICcutoff=.4)
{
  prom <- suppressWarnings(
    promoters(txdb, upstream=upstream, downstream=downstream,
              columns=c("tx_name", "gene_id")))
  prom <- trim(prom)
  genome <- getBSgenome(genome)
  prom_seqs <- getSeq(genome, prom)
  pfm <- new("pfm", mat=pwm, name=deparse(substitute(pwm)))
  pfm <- trimMotif(pfm, t=ICcutoff)
  pwm <- pfm2pwm(pfm)
  hits_per_prom <- suppressWarnings(sapply(
    seq_along(prom_seqs),
    function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score)
  ))
  ans <- mcols(prom)[hits_per_prom != 0L, ]
  ans[ , "gene_id"] <- as.character(ans[ , "gene_id"])
  as.data.frame(ans)
}

TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]]
TP53.targets <- getTFtranscripts(TP53.jaspar,
                                 TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
                                 upstream=1000, downstream=250,
                                 min.score="85%")

print( TP53.targets )
print( length( unique(TP53.targets$gene_id[!is.na(TP53.targets$gene_id)] ) ) )

MYC.jaspar <- query(query(MotifDb,"MYC"),"JASPAR_CORE")[[1]]
MYC.targets <- getTFtranscripts(MYC.jaspar,
                                TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19",
                                upstream=1000, downstream=250,
                                min.score="85%")
print( MYC.targets )
print( length( unique(MYC.targets$gene_id[!is.na(MYC.targets$gene_id)] ) ) )

 

 
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Ou, Jianhong1.1k

Fantastic, thank you for the reply Jianhong.  The addition of the trimMotif() function certainly makes a huge difference (I now get reasonable numbers of targets for TP53) and solves the problem identified by Paul in his post.

I will contact the people at JASPAR in any case to make sure that this is an appropriate thing to do with their motifs.

ADD REPLYlink written 3.8 years ago by Lisa Hopcroft40
Answer: Identifying TF binding sites (Motifdb + TxDb.Hsapiens.UCSC.hg19.knownGene)
0
gravatar for Lisa Hopcroft
3.8 years ago by
University of Glasgow, Glasgow
Lisa Hopcroft40 wrote:

Thanks to Paul and Jianhong for contributing to this thread.  I have been in touch with the JASPAR resource.  They have recommended not to trim the motifs using trimMotif() as suggested by Jianhong: "The information content is highly related to binding strength, and you would basically be trimming away important features. Thereby you are losing what you are trying to predict."

I am currently in discussion with another researcher who has done exactly the task that I'm looking to do.  If I come to any conclusions about this, I'll try to remember to post my findings here.

 

ADD COMMENTlink written 3.8 years ago by Lisa Hopcroft40

It is interesting. I checked the original paper for the TP53 motif. The preferred consensus is the palindrome GGACATGCCCGGGCATGTCC. If we trim the motif (CCGGACATGCCCGGGCATGT) downloaded from JASPAR by information content greater than .04, It will trim the low information position at 1 and 2 and the motif will be like GGACATGCCCGGGCATGT. Acutely, I can not understand why the motif in JASPAR is start from two CC and we can not trim it?

ADD REPLYlink written 3.8 years ago by Ou, Jianhong1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 305 users visited in the last hour