A basic question about retrieve the absolute genome position from the output of of function matchPWM .
1
0
Entering edit mode
xyang2 ▴ 120
@xyang2-4387
Last seen 3.4 years ago
Dear all, I am learning how to "Finding Candidate Binding Sites for Known Transcription Factors via Sequence Matching" from the Bioconductor online help file at http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/ When I ran the following codes, I can not match the results back to the UCSC genome browser results. ||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG)) ||||>|| pfm.grhl <- query(MotifDb,"GRHL")[[1]] > pcm.grhl <- round(100 * pfm.grhl) > chromosomal.loc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs] > promoter.seqs <- getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0) > promoter.seqs <- unlist(promoter.seqs) > matchPWM(pcm.grhl, promoter.seqs[[1]], "80%") Views on a 1000-letter DNAString subject subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCGCACGTCGT TCCCCGCCCT views: start end width [1] 347 354 8 [AAGCCTCA] > chromosomal.loc GRangesList of length 1: $387 GRanges with 2 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 [2] chr3 [49396579, 49449526] - | 15518 uc003cwu.3 --- seqlengths: chr1 chr2 ... chrUn_gl000249 249250621 243199373 ... 38502 As|||49396579-1000+347=49395926, I use it as an absolute position.| However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.|| I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.| So, could anyone tell me how to calculate the absolute genome position from the output of of function matchPWM ?. Thanks in advance. Holly | [[alternative HTML version deleted]]
• 2.6k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States
Hi Holly, On Sat, Oct 27, 2012 at 7:23 PM, Holly <xyang2 at="" uchicago.edu=""> wrote: > Dear all, > I am learning how to "Finding Candidate Binding Sites for Known > Transcription Factors via Sequence Matching" from the Bioconductor > online help file at > http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/ > > When I ran the following codes, I can not match the results back to the > UCSC genome browser results. > > ||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG)) > ||||>|| pfm.grhl <- query(MotifDb,"GRHL")[[1]] >> pcm.grhl <- round(100 * pfm.grhl) >> chromosomal.loc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs] >> promoter.seqs <- getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0) >> promoter.seqs <- unlist(promoter.seqs) >> matchPWM(pcm.grhl, promoter.seqs[[1]], "80%") > Views on a 1000-letter DNAString subject > subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCGCACGTC GTTCCCCGCCCT > views: > start end width > [1] 347 354 8 [AAGCCTCA] > >> chromosomal.loc > GRangesList of length 1: > $387 > GRanges with 2 ranges and 2 metadata columns: > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 > [2] chr3 [49396579, 49449526] - | 15518 uc003cwu.3 > > --- > seqlengths: > chr1 chr2 ... chrUn_gl000249 > 249250621 243199373 ... 38502 > > As|||49396579-1000+347=49395926, I use it as an absolute position.| > However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.|| > I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.| > > So, could anyone tell me how to calculate the absolute genome position from the output of of function matchPWM ?. Beware of the evil strand. This transcript is on the negative strand, so the 1000bp upstream of the first transcript you've listed starts at its end, namely `(end+1):(end+999)` HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Hi, Steve, Thank you. 49449428+999-354 =49450073 49449428+999-347= 49450080 By searching chr3:49450073-49450080 I got [GGTGAGGC], which is reversely complementary to [AAGCCTCA] as expected. > reverseComplement(DNAString("GGTGAGGC")) 8-letter "DNAString" instance seq: CAGCCTCA More question is, to screen the "promoter" region, what should I do? Should I focus on the "upstream" of "positive" strand? Does it equally to search the "downstream" of "negative" strand, if the program reports sequence of only negative strain? Thanks again, Holly On 10/27/2012 6:19 PM, Steve Lianoglou wrote: > Hi Holly, > > On Sat, Oct 27, 2012 at 7:23 PM, Holly <xyang2 at="" uchicago.edu=""> wrote: >> Dear all, >> I am learning how to "Finding Candidate Binding Sites for Known >> Transcription Factors via Sequence Matching" from the Bioconductor >> online help file at >> http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/ >> >> When I ran the following codes, I can not match the results back to the >> UCSC genome browser results. >> >> ||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG)) >> ||||>|| pfm.grhl <- query(MotifDb,"GRHL")[[1]] >>> pcm.grhl <- round(100 * pfm.grhl) >>> chromosomal.loc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs] >>> promoter.seqs <- getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0) >>> promoter.seqs <- unlist(promoter.seqs) >>> matchPWM(pcm.grhl, promoter.seqs[[1]], "80%") >> Views on a 1000-letter DNAString subject >> subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCGCACGT CGTTCCCCGCCCT >> views: >> start end width >> [1] 347 354 8 [AAGCCTCA] >> >>> chromosomal.loc >> GRangesList of length 1: >> $387 >> GRanges with 2 ranges and 2 metadata columns: >> seqnames ranges strand | tx_id tx_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 >> [2] chr3 [49396579, 49449526] - | 15518 uc003cwu.3 >> >> --- >> seqlengths: >> chr1 chr2 ... chrUn_gl000249 >> 249250621 243199373 ... 38502 >> >> As|||49396579-1000+347=49395926, I use it as an absolute position.| >> However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.|| >> I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.| >> >> So, could anyone tell me how to calculate the absolute genome position from the output of of function matchPWM ?. > Beware of the evil strand. > > This transcript is on the negative strand, so the 1000bp upstream of > the first transcript you've listed starts at its end, namely > `(end+1):(end+999)` > > HTH, > -steve >
ADD REPLY
0
Entering edit mode
Hi, On Sat, Oct 27, 2012 at 10:54 PM, Holly <xyang2 at="" uchicago.edu=""> wrote: > Hi, Steve, > Thank you. No problem. [snip] > More question is, to screen the "promoter" region, what should I do? > Should I focus on the "upstream" of "positive" strand? Does it equally to > search the "downstream" of "negative" strand, if the program reports > sequence of only negative strain? Sorry, I don't follow your question. The promoter of a gene/transcript is "defined" as some region *around* the transcription start site. I say "around" instead of "upstream" because people often/occasionally/might-be-inclined-to include sequence up to and including the first intron of a gene when looking for regulatory motifs. That having been said, a standard/run-of-the-mill analysis might start by simply focusing on sequence *upstream* from the TSS. You don't go "downtream" of a negative strand ... the region I pointed you to look at for your sequence motif is still *upstream* of the gene, even though you are moving "to the rightt" on the sequence. You don't orient upstream and downstream by something being left or right of a position ... these terms are relative to the strand of transcription. So for features located on the positive strand, you "look left" for "upstream. Genes annotated on the negative strand, their upstream region is "to the right" I'm not sure if that what you were asking, but I hope it's helpful and made some sense.same. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi, Steve, I should pay attention to the arrows near the gene RHOA, e.g. chr3:49,396,579-49,449,428. Yes, left and right are relatively. "people often/occasionally/might-be-inclined-to include sequence up to and including the first intron of a gene when looking for regulatory motifs" - If you have citation for this, will be even more helpful. Thanks again, Holly
ADD REPLY
0
Entering edit mode
Hi Holly, On Sat, Oct 27, 2012 at 11:46 PM, Holly <xyang2 at="" uchicago.edu=""> wrote: > Hi, Steve, > I should pay attention to the arrows near the gene RHOA, e.g. > chr3:49,396,579-49,449,428. > Yes, left and right are relatively. > > > "people often/occasionally/might-be-inclined-to include > sequence up to and including the first intron of a gene when looking > for regulatory motifs" - If you have citation for this, will be even more > helpful. You can start here: https://www.google.com/search?q=transcription+regulation+first+intron HTH :-) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Steve, As the motif I am interested is [AAGTTTAA]. Then I got a result on the negative strand as chr3:49450073-49450080, [GGTGAGGC], which is reversely complementary to [AAGCCTCA]. Should I compare the reversely complementary hit to the motif? If a hit is on a positive strand, then to compare the hit directly to the motif? Say, a literature review/database gives the motif by default for a positive strand? Thanks a lot, Holly
ADD REPLY
0
Entering edit mode
Hi, On Sun, Oct 28, 2012 at 12:01 AM, Holly <xyang2 at="" uchicago.edu=""> wrote: > Steve, > > As the motif I am interested is [AAGTTTAA]. > Then I got a result on the negative strand as chr3:49450073-49450080, > [GGTGAGGC], which is reversely complementary to [AAGCCTCA]. > > Should I compare the reversely complementary hit to the motif? Well -- think about your protein of interest as it binds to the DNA. Do you think it's important which strand the protein is binding to? Does your protein only bind single stranded DNA? > If a hit is on a positive strand, then to compare the hit directly to the > motif? I'm not sure what you're asking here, but again -- I'm guessing the biological scenario you are trying to model is when the DNA is paired w/ its cognate/opposite strand and you are scanning for some transcription factor binding site ... so I guess it's worth asking yourself if the strand that "the correct" sequence motif is on is important here. > Say, a literature review/database gives the motif by default for a positive > strand? I'm not actually sure what the consensus (if there is one) for which strand to use in order to report the motif in such databases. -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi Holly, See comments and suggestions below. On Oct 27, 2012, at 7:54 PM, Holly wrote: > Hi, Steve, > Thank you. > > 49449428+999-354 =49450073 > 49449428+999-347= 49450080 > By searching > > chr3:49450073-49450080 I got [GGTGAGGC], which is reversely complementary to [AAGCCTCA] as expected. > >> reverseComplement(DNAString("GGTGAGGC")) > 8-letter "DNAString" instance > seq: CAGCCTCA > > More question is, to screen the "promoter" region, what should I do? > Should I focus on the "upstream" of "positive" strand? Does it equally to search the "downstream" of "negative" strand, if the program reports sequence of only negative strain? getPromoterSeq is strand-aware in two ways. For gene transcripts annotated to the minus strand: 1) It interprets 'upstream' as 'increasing chromosomal coordinates'. 2) It returns the reverse complement of the extracted sequence, which is what you probably want. In the following, I specify an upstream length of just 10, to make experimentation easier: library(org.Hs.eg.db) library(MotifDb) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(BSgenome.Hsapiens.UCSC.hg19) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene geneID <-as.character (get ("RHOA", org.Hs.egSYMBOL2EG)) loc <- transcriptsBy(txdb, by="gene") [geneID][[1]][1] GRanges with 1 range and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 getPromoterSeq(loc, Hsapiens, upstream=10, downstream=0) A DNAStringSet instance of length 1 width seq 10 TCCCCGCCCT If you point the UCSC genome browser, using genome=hg19, to chr3:49449428:49449437, click the 'complement bases' arrow in the top left, you will see, reading from right to left, and starting at chr3:49449437, the same sequence returned by getPromoterSeq: TCCCCGCCCT. Another handy trick is to ask getPromoterSeq to return 20 bases upstream. This is a long enough sequence to use UCSC BLAT, which will take you right to the proper location in the genome browser. http://genome.ucsc.edu/cgi-bin/hgBlat?command=start. BLAT will try to match your input sequence on both strands, and report which one it found. I believe that all of this demonstrates that getPromoterSeq and its upstream and downstream arguments are relative to the gene transcript's TSS, and are corrected to return a "gene's-eye" view of the sequence. For a minus strand gene, upstream is in increasing genome coordinates, and the returned sequence has been reverse complemented. For a plus strand gene, upstream is decreasing genome coordinates, and the returned sequence can be read off the genome from left to right. You don't have to worry about strand, nor compensate for it. Another thing to keep in mind here is something suggested by Steve a couple of weeks ago: it may be better to think in terms of proximal regulatory regions, a full 5000 bp upstream AND downstream of the transcription start site. This is especially applicable to metazoans, and to mammals in particular. I hope this helps, - Paul > Thanks again, > Holly > > On 10/27/2012 6:19 PM, Steve Lianoglou wrote: > >> Hi Holly, >> >> On Sat, Oct 27, 2012 at 7:23 PM, Holly <xyang2 at="" uchicago.edu=""> wrote: >>> Dear all, >>> I am learning how to "Finding Candidate Binding Sites for Known >>> Transcription Factors via Sequence Matching" from the Bioconductor >>> online help file at >>> http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/ >>> >>> When I ran the following codes, I can not match the results back to the >>> UCSC genome browser results. >>> >>> ||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG)) >>> ||||>|| pfm.grhl <- query(MotifDb,"GRHL")[[1]] >>>> pcm.grhl <- round(100 * pfm.grhl) >>>> chromosomal.loc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs] >>>> promoter.seqs <- getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0) >>>> promoter.seqs <- unlist(promoter.seqs) >>>> matchPWM(pcm.grhl, promoter.seqs[[1]], "80%") >>> Views on a 1000-letter DNAString subject >>> subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCGCACG TCGTTCCCCGCCCT >>> views: >>> start end width >>> [1] 347 354 8 [AAGCCTCA] >>> >>>> chromosomal.loc >>> GRangesList of length 1: >>> $387 >>> GRanges with 2 ranges and 2 metadata columns: >>> seqnames ranges strand | tx_id tx_name >>> <rle> <iranges> <rle> | <integer> <character> >>> [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 >>> [2] chr3 [49396579, 49449526] - | 15518 uc003cwu.3 >>> >>> --- >>> seqlengths: >>> chr1 chr2 ... chrUn_gl000249 >>> 249250621 243199373 ... 38502 >>> >>> As|||49396579-1000+347=49395926, I use it as an absolute position.| >>> However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.|| >>> I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.| >>> >>> So, could anyone tell me how to calculate the absolute genome position from the output of of function matchPWM ?. >> Beware of the evil strand. >> >> This transcript is on the negative strand, so the 1000bp upstream of >> the first transcript you've listed starts at its end, namely >> `(end+1):(end+999)` >> >> HTH, >> -steve >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Paul, Your demon is much clear for a beginner. Thanks a lot. Talking back for searching putative binding site, when you say " Another thing to keep in mind here is something suggested by Steve a couple of weeks ago: it may be better to think in terms of proximal regulatory regions, a full 5000 bp upstream AND downstream of the transcription start site. This is especially applicable to metazoans, and to mammals in particular." How to get the sequence of, e.g. "100bp downstream of the transcription start site" using Bioconductor packages and R code? I appreciate if you can provide such demo code as well. As I understand, let*TSS* =(*transcription start site*), *TTS* =(*transcription* termination site getPromoterSeq(loc, Hsapiens, upstream=0, downstream=10) , we get the sequence TTS:TTS+10 getPromoterSeq(loc, Hsapiens, upstream=10, downstream=0) , we get the sequence TSS-10:TSS Now, we want to look at TSS-10:TSS+10, if I understand you correctly. Thank you, Holly On 10/27/2012 11:56 PM, Paul Shannon wrote: > Hi Holly, > > See comments and suggestions below. > > > On Oct 27, 2012, at 7:54 PM, Holly wrote: > >> >Hi, Steve, >> >Thank you. >> > >> >49449428+999-354 =49450073 >> >49449428+999-347= 49450080 >> >By searching >> > >> >chr3:49450073-49450080 I got [GGTGAGGC], which is reversely complementary to [AAGCCTCA] as expected. >> > >>> >>reverseComplement(DNAString("GGTGAGGC")) >> > 8-letter "DNAString" instance >> >seq: CAGCCTCA >> > >> >More question is, to screen the "promoter" region, what should I do? >> >Should I focus on the "upstream" of "positive" strand? Does it equally to search the "downstream" of "negative" strand, if the program reports sequence of only negative strain? > getPromoterSeq is strand-aware in two ways. For gene transcripts annotated to the minus strand: > > 1) It interprets 'upstream' as 'increasing chromosomal coordinates'. > 2) It returns the reverse complement of the extracted sequence, which is what you probably want. > > In the following, I specify an upstream length of just 10, to make experimentation easier: > > library(org.Hs.eg.db) > library(MotifDb) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > library(BSgenome.Hsapiens.UCSC.hg19) > > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > geneID <-as.character (get ("RHOA", org.Hs.egSYMBOL2EG)) > loc <- transcriptsBy(txdb, by="gene") [geneID][[1]][1] > GRanges with 1 range and 2 metadata columns: > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 > > getPromoterSeq(loc, Hsapiens, upstream=10, downstream=0) > A DNAStringSet instance of length 1 > width seq > 10 TCCCCGCCCT > > If you point the UCSC genome browser, using genome=hg19, to chr3:49449428:49449437, click the 'complement bases' arrow in the top left, you will see, reading from right to left, and starting at chr3:49449437, the same sequence returned by getPromoterSeq: TCCCCGCCCT. > > Another handy trick is to ask getPromoterSeq to return 20 bases upstream. This is a long enough sequence to use UCSC BLAT, which will take you right to the proper location in the genome browser.http://genome.ucsc.edu/cgi-bin/hgBlat?command=start BLAT will try to match your input sequence on both strands, and report which one it found. > > I believe that all of this demonstrates that getPromoterSeq and its upstream and downstream arguments are relative to the gene transcript's TSS, and are corrected to return a "gene's-eye" view of the sequence. For a minus strand gene, upstream is in increasing genome coordinates, and the returned sequence has been reverse complemented. For a plus strand gene, upstream is decreasing genome coordinates, and the returned sequence can be read off the genome from left to right. You don't have to worry about strand, nor compensate for it. > > Another thing to keep in mind here is something suggested by Steve a couple of weeks ago: it may be better to think in terms of proximal regulatory regions, a full 5000 bp upstream AND downstream of the transcription start site. This is especially applicable to metazoans, and to mammals in particular. > > I hope this helps, > > - Paul > > >> >Thanks again, >> >Holly >> > >> >On 10/27/2012 6:19 PM, Steve Lianoglou wrote: >> > >>> >>Hi Holly, >>> >> >>> >>On Sat, Oct 27, 2012 at 7:23 PM, Holly<xyang2@uchicago.edu> wrote: >>>> >>>Dear all, >>>> >>>I am learning how to "Finding Candidate Binding Sites for Known >>>> >>>Transcription Factors via Sequence Matching" from the Bioconductor >>>> >>>online help file at >>>> >>>http://www.bioconductor.org/help/workflows/gene-regulation- tfbs/ >>>> >>> >>>> >>>When I ran the following codes, I can not match the results back to the >>>> >>>UCSC genome browser results. >>>> >>> >>>> >>>||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG)) >>>> >>>||||>|| pfm.grhl <- query(MotifDb,"GRHL")[[1]] >>>>> >>>>pcm.grhl <- round(100 * pfm.grhl) >>>>> >>>>chromosomal.loc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs] >>>>> >>>>promoter.seqs <- getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0) >>>>> >>>>promoter.seqs <- unlist(promoter.seqs) >>>>> >>>>matchPWM(pcm.grhl, promoter.seqs[[1]], "80%") >>>> >>> Views on a 1000-letter DNAString subject >>>> >>>subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCG CACGTCGTTCCCCGCCCT >>>> >>>views: >>>> >>> start end width >>>> >>>[1] 347 354 8 [AAGCCTCA] >>>> >>> >>>>> >>>>chromosomal.loc >>>> >>>GRangesList of length 1: >>>> >>>$387 >>>> >>>GRanges with 2 ranges and 2 metadata columns: >>>> >>> seqnames ranges strand | tx_id tx_name >>>> >>> <rle> <iranges> <rle> | <integer> <character> >>>> >>> [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 >>>> >>> [2] chr3 [49396579, 49449526] - | 15518 uc003cwu.3 >>>> >>> >>>> >>>--- >>>> >>>seqlengths: >>>> >>> chr1 chr2 ... chrUn_gl000249 >>>> >>> 249250621 243199373 ... 38502 >>>> >>> >>>> >>>As|||49396579-1000+347=49395926, I use it as an absolute position.| >>>> >>>However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.|| >>>> >>> I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.| >>>> >>> >>>> >>>So, could anyone tell me how to calculate the absolute genome position from the output of of function matchPWM ?. >>> >>Beware of the evil strand. >>> >> >>> >>This transcript is on the negative strand, so the 1000bp upstream of >>> >>the first transcript you've listed starts at its end, namely >>> >>`(end+1):(end+999)` >>> >> >>> >>HTH, >>> >>-steve >>> >> >> > >> >_______________________________________________ >> >Bioconductor mailing list >> >Bioconductor@r-project.org >> >https://stat.ethz.ch/mailman/listinfo/bioconductor >> >Search the archives:http://news.gmane.org/gmane.science.biology.in formatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Holly, Proposed simple solution inline below? - Paul On Oct 28, 2012, at 10:42 AM, Holly wrote: > Paul, > Your demon is much clear for a beginner. Thanks a lot. > Talking back for searching putative binding site, when you say " > Another thing to keep in mind here is something suggested by Steve a couple of weeks ago: it may be better to think in terms of proximal regulatory regions, a full 5000 bp upstream AND downstream of the transcription start site. This is especially applicable to metazoans, and to mammals in particular." > > How to get the sequence of, e.g. "100bp downstream of the transcription start site" using Bioconductor packages and R code? > I appreciate if you can provide such demo code as well. > > As I understand, let > TSS =(transcription start site), TTS =(transcription termination site > getPromoterSeq(loc, Hsapiens, upstream=0, downstream=10) , we get the sequence TTS:TTS+10 > getPromoterSeq(loc, Hsapiens, upstream=10, downstream=0) , we get the sequence TSS-10:TSS > Now, we want to look at TSS-10:TSS+10, if I understand you correctly. Just one small modification to the example code is needed: getPromoterSeq(loc, Hsapiens, upstream=10, downstream=10) > > Thank you, > Holly > > On 10/27/2012 11:56 PM, Paul Shannon wrote: >> Hi Holly, >> >> See comments and suggestions below. >> >> >> On Oct 27, 2012, at 7:54 PM, Holly wrote: >> >> >>> > >>> Hi, Steve, >>> >>> > >>> Thank you. >>> >>> > >>> > >>> 49449428+999-354 =49450073 >>> >>> > >>> 49449428+999-347= 49450080 >>> >>> > >>> By searching >>> >>> > >>> > >>> chr3:49450073-49450080 I got [GGTGAGGC], which is reversely complementary to [AAGCCTCA] as expected. >>> >>> > >>>> >> >>>> reverseComplement(DNAString("GGTGAGGC")) >>>> >>> > >>> 8-letter "DNAString" instance >>> >>> > >>> seq: CAGCCTCA >>> >>> > >>> > >>> More question is, to screen the "promoter" region, what should I do? >>> >>> > >>> Should I focus on the "upstream" of "positive" strand? Does it equally to search the "downstream" of "negative" strand, if the program reports sequence of only negative strain? >>> >> getPromoterSeq is strand-aware in two ways. For gene transcripts annotated to the minus strand: >> >> 1) It interprets 'upstream' as 'increasing chromosomal coordinates'. >> 2) It returns the reverse complement of the extracted sequence, which is what you probably want. >> >> In the following, I specify an upstream length of just 10, to make experimentation easier: >> >> library(org.Hs.eg.db) >> library(MotifDb) >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> library(BSgenome.Hsapiens.UCSC.hg19) >> >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> geneID <-as.character (get ("RHOA", org.Hs.egSYMBOL2EG)) >> loc <- transcriptsBy(txdb, by="gene") [geneID][[1]][1] >> GRanges with 1 range and 2 metadata columns: >> seqnames ranges strand | tx_id tx_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 >> >> getPromoterSeq(loc, Hsapiens, upstream=10, downstream=0) >> A DNAStringSet instance of length 1 >> width seq >> 10 TCCCCGCCCT >> >> If you point the UCSC genome browser, using genome=hg19, to chr3:49449428:49449437, click the 'complement bases' arrow in the top left, you will see, reading from right to left, and starting at chr3:49449437, the same sequence returned by getPromoterSeq: TCCCCGCCCT. >> >> Another handy trick is to ask getPromoterSeq to return 20 bases upstream. This is a long enough sequence to use UCSC BLAT, which will take you right to the proper location in the genome browser. >> http://genome.ucsc.edu/cgi-bin/hgBlat?command=start >> . BLAT will try to match your input sequence on both strands, and report which one it found. >> >> I believe that all of this demonstrates that getPromoterSeq and its upstream and downstream arguments are relative to the gene transcript's TSS, and are corrected to return a "gene's-eye" view of the sequence. For a minus strand gene, upstream is in increasing genome coordinates, and the returned sequence has been reverse complemented. For a plus strand gene, upstream is decreasing genome coordinates, and the returned sequence can be read off the genome from left to right. You don't have to worry about strand, nor compensate for it. >> >> Another thing to keep in mind here is something suggested by Steve a couple of weeks ago: it may be better to think in terms of proximal regulatory regions, a full 5000 bp upstream AND downstream of the transcription start site. This is especially applicable to metazoans, and to mammals in particular. >> >> I hope this helps, >> >> - Paul >> >> >> >>> > >>> Thanks again, >>> >>> > >>> Holly >>> >>> > >>> > >>> On 10/27/2012 6:19 PM, Steve Lianoglou wrote: >>> >>> > >>>> >> >>>> Hi Holly, >>>> >>>> >> >>>> >> On Sat, Oct 27, 2012 at 7:23 PM, Holly <xyang2 at="" uchicago.edu=""> >>>> wrote: >>>> >>>>> >>> >>>>> Dear all, >>>>> >>>>> >>> >>>>> I am learning how to "Finding Candidate Binding Sites for Known >>>>> >>>>> >>> >>>>> Transcription Factors via Sequence Matching" from the Bioconductor >>>>> >>>>> >>> >>>>> online help file at >>>>> >>>>> >>> http://www.bioconductor.org/help/workflows/gene-regulation- tfbs/ >>>>> >>> >>>>> >>> >>>>> When I ran the following codes, I can not match the results back to the >>>>> >>>>> >>> >>>>> UCSC genome browser results. >>>>> >>>>> >>> >>>>> >>> >>>>> ||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG)) >>>>> >>>>> >>> >>>>> ||||>|| pfm.grhl <- query(MotifDb,"GRHL")[[1]] >>>>> >>>>>> >>>> >>>>>> pcm.grhl <- round(100 * pfm.grhl) >>>>>> >>>>>> >>>> >>>>>> chromosomal.loc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs] >>>>>> >>>>>> >>>> >>>>>> promoter.seqs <- getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0) >>>>>> >>>>>> >>>> >>>>>> promoter.seqs <- unlist(promoter.seqs) >>>>>> >>>>>> >>>> >>>>>> matchPWM(pcm.grhl, promoter.seqs[[1]], "80%") >>>>>> >>>>> >>> >>>>> Views on a 1000-letter DNAString subject >>>>> >>>>> >>> >>>>> subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCGCA CGTCGTTCCCCGCCCT >>>>> >>>>> >>> >>>>> views: >>>>> >>>>> >>> >>>>> start end width >>>>> >>>>> >>> >>>>> [1] 347 354 8 [AAGCCTCA] >>>>> >>>>> >>> >>>>>> >>>> >>>>>> chromosomal.loc >>>>>> >>>>> >>> >>>>> GRangesList of length 1: >>>>> >>>>> >>> >>>>> $387 >>>>> >>>>> >>> >>>>> GRanges with 2 ranges and 2 metadata columns: >>>>> >>>>> >>> >>>>> seqnames ranges strand | tx_id tx_name >>>>> >>>>> >>> >>>>> <rle> <iranges> <rle> | <integer> <character> >>>>> >>>>> >>> >>>>> [1] chr3 [49396579, 49449428] - | 15517 uc010hku.3 >>>>> >>>>> >>> >>>>> [2] chr3 [49396579, 49449526] - | 15518 uc003cwu.3 >>>>> >>>>> >>> >>>>> >>> >>>>> --- >>>>> >>>>> >>> >>>>> seqlengths: >>>>> >>>>> >>> >>>>> chr1 chr2 ... chrUn_gl000249 >>>>> >>>>> >>> >>>>> 249250621 243199373 ... 38502 >>>>> >>>>> >>> >>>>> >>> >>>>> As|||49396579-1000+347=49395926, I use it as an absolute position.| >>>>> >>>>> >>> >>>>> However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.|| >>>>> >>>>> >>> >>>>> I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.| >>>>> >>>>> >>> >>>>> >>> >>>>> So, could anyone tell me how to calculate the absolute genome position from the output of of function matchPWM ?. >>>>> >>>> >> >>>> Beware of the evil strand. >>>> >>>> >> >>>> >> >>>> This transcript is on the negative strand, so the 1000bp upstream of >>>> >>>> >> >>>> the first transcript you've listed starts at its end, namely >>>> >>>> >> >>>> `(end+1):(end+999)` >>>> >>>> >> >>>> >> >>>> HTH, >>>> >>>> >> >>>> -steve >>>> >>>> >> >>> > >>> > >>> _______________________________________________ >>> >>> > >>> Bioconductor mailing list >>> >>> > Bioconductor at r-project.org >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY

Login before adding your answer.

Traffic: 754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6