get promoter regions (bed format) of genes hg19
1
2
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Dear all, I'm trying to obtain promoter regions of a list of genes. This is my granges example: htr_genes_gr = GRanges( seqnames = rep("chr1", 4), ranges = IRanges(start = c(131025, 761586, 762988, 860260), end = c(134836, 794826, 794826, 879955))) I wanted to use promoters(htr_genes_gr, upstream=2000) to get the upstream regions. However, I get something, but Im afraid that it does not match with the USCS genome browser. So, I take the output and view it in IGV with hg19 loaded. But the promoter seems shifted, i.e. it does not start where the gene begins. When I look at the IRanges web page, there is an indication that the genome version might be hg18. I am not sure if this is the problem. Anyways, does anyone have a suggestion how I can get the bed file of the promoter regions. I need the promoter regions of all genes in the end. Thanks a lot! -- output of sessionInfo(): sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Sent via the guest posting facility at bioconductor.org.
IRanges IRanges • 5.9k views
ADD COMMENT
1
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
Hi Ninni, A more complete description of the promoters() function is on the man page in the GenomicRanges package. You can find this by specifying the full method name: ?'promoters,GenomicRanges-method' The GRanges or GRangesList you supply as 'x' is assumed to contain gene ranges. Here is a snip from the man page stating that the assumption is that start(x) is the TSS: > ? ?promoters? returns an object of the same type and length as > ?x? containing promoter ranges. Promoter ranges extend around > the transcription start site (TSS) which is defined as > ?start(x)?. The ?upsteam? and ?downstream? arguments define > the number of nucleotides in the 5' and 3' direction, > respectively. The full range is defined as, > > (start(x) - upstream) to (start(x) + downstream - 1). So promoters() simply returns the region upstream and/or downstream of the range you supply as 'x'. If you have a generic range and want to determine if it falls within a gene you can use findOverlaps() on a TranscriptDb and your ranges. library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene genes <- transcriptsBy(txdb, "gene") hits <- findOverlaps(htr_genes_gr, genes) 'queryHits' is an index of 'htr_genes_gr' ranges that maps to the 'subjectHits' index of 'genes' ranges. Here we see that both range 2 and 4 in 'htr_genes_gr' each map to 2 different genes. >> hits > Hits of length 6 > queryLength: 4 > subjectLength: 23459 > queryHits subjectHits > <integer> <integer> > 1 1 19185 > 2 2 17448 > 3 2 20105 > 4 3 17448 > 5 4 5026 > 6 4 8270 Isolate the genes that the ranges in 'htr_genes_gr' hit: mygenes <- genes[subjectHits(hits)] Call promoters() on the gene ranges of interest. promoters(mygenes, upstream=2000) Valerie On 01/22/2014 01:12 AM, Ninni Nahm [guest] wrote: > > Dear all, > > I'm trying to obtain promoter regions of a list of genes. > This is my granges example: > > htr_genes_gr = GRanges( seqnames = rep("chr1", 4), > ranges = IRanges(start = c(131025, 761586, 762988, 860260), end = c(134836, 794826, 794826, 879955))) > > I wanted to use promoters(htr_genes_gr, upstream=2000) > to get the upstream regions. However, I get something, but Im afraid that it does not match with the USCS genome browser. So, I take the output and view it in IGV with hg19 loaded. But the promoter seems shifted, i.e. it does not start where the gene begins. When I look at the IRanges web page, there is an indication that the genome version might be hg18. I am not sure if this is the problem. Anyways, does anyone have a suggestion how I can get the bed file of the promoter regions. I need the promoter regions of all genes in the end. > Thanks a lot! > > > > -- output of sessionInfo(): > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Thanks, Val, for the details. Now I have what is either a shameless plug or a request for explanation. matchGenes() in _bumphunter_ is similar, although definitely not the same. I am not sure how to translate between the results, except to note that matchGenes uses nearest() with either select="arbitrary", by default, or select="all" -- rather than findOverlaps(). It is much faster than what you did, in effect because your transcriptsBy call is "compiled in". At any rate, the user just does library(bumphunter) # this uses nearest(select="arbitrary") matchGenes(htr_genes_gr, promoterDist=2000) or # this uses nearest(select="all") matchGenes(htr_genes_gr, promoterDist=2000, all=TRUE) The results are annotated based on the matching transcript's TSS, but neither answer has length 6 in this example. Maybe someone can elucidate the other differences (and perhaps some similarities). On Jan 22, 2014, at 11:56 AM, Valerie Obenchain wrote: > Hi Ninni, > > A more complete description of the promoters() function is on the man page in the GenomicRanges package. You can find this by specifying the full method name: > > ?'promoters,GenomicRanges-method' > > The GRanges or GRangesList you supply as 'x' is assumed to contain gene ranges. Here is a snip from the man page stating that the assumption is that start(x) is the TSS: > >> ? ?promoters? returns an object of the same type and length as >> ?x? containing promoter ranges. Promoter ranges extend around >> the transcription start site (TSS) which is defined as >> ?start(x)?. The ?upsteam? and ?downstream? arguments define >> the number of nucleotides in the 5' and 3' direction, >> respectively. The full range is defined as, >> >> (start(x) - upstream) to (start(x) + downstream - 1). > > So promoters() simply returns the region upstream and/or downstream of the range you supply as 'x'. If you have a generic range and want to determine if it falls within a gene you can use findOverlaps() on a TranscriptDb and your ranges. > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > genes <- transcriptsBy(txdb, "gene") > hits <- findOverlaps(htr_genes_gr, genes) > > 'queryHits' is an index of 'htr_genes_gr' ranges that maps to the 'subjectHits' index of 'genes' ranges. Here we see that both range 2 and 4 in 'htr_genes_gr' each map to 2 different genes. > >>> hits >> Hits of length 6 >> queryLength: 4 >> subjectLength: 23459 >> queryHits subjectHits >> <integer> <integer> >> 1 1 19185 >> 2 2 17448 >> 3 2 20105 >> 4 3 17448 >> 5 4 5026 >> 6 4 8270 > > Isolate the genes that the ranges in 'htr_genes_gr' hit: > > mygenes <- genes[subjectHits(hits)] > > Call promoters() on the gene ranges of interest. > > promoters(mygenes, upstream=2000) > > > Valerie > > On 01/22/2014 01:12 AM, Ninni Nahm [guest] wrote: >> >> Dear all, >> >> I'm trying to obtain promoter regions of a list of genes. >> This is my granges example: >> >> htr_genes_gr = GRanges( seqnames = rep("chr1", 4), >> ranges = IRanges(start = c(131025, 761586, 762988, 860260), end = c(134836, 794826, 794826, 879955))) >> >> I wanted to use promoters(htr_genes_gr, upstream=2000) >> to get the upstream regions. However, I get something, but Im afraid that it does not match with the USCS genome browser. So, I take the output and view it in IGV with hg19 loaded. But the promoter seems shifted, i.e. it does not start where the gene begins. When I look at the IRanges web page, there is an indication that the genome version might be hg18. I am not sure if this is the problem. Anyways, does anyone have a suggestion how I can get the bed file of the promoter regions. I need the promoter regions of all genes in the end. >> Thanks a lot! >> >> >> >> -- output of sessionInfo(): >> >> sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Valerie Obenchain > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B155 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: vobencha at fhcrc.org > Phone: (206) 667-3158 > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6