getting gene ranges from a TranscriptDb (GenomicFeatures package)

0

Entering edit mode

Elizabeth Purdom ▴ 210

@elizabeth-purdom-2486

Last seen 2.4 years ago

USA/ Berkeley/UC Berkeley

Hello, I would like to get the range of genes from a TranscriptDb object. Ideally, I would like to do: txdb <- loadFeatures(system.file("extdata", "UCSC_knownGene_sample.sqlite", package="GenomicFeatures")) genes(txdb) And have this return the entire genomic range of the gene (much like transcripts() returns the genomic range of the transcript, including the introns). However, this does not exist. So I have made a little function to do this: genes<-function(txdb){ txByGene<-transcriptsBy(txdb,"gene") geneStarts<-sapply(start(txByGene),min) geneEnds<-sapply(end(txByGene),max) geneChr<-sapply(seqnames(txByGene),unique) geneStrand<-sapply(strand(txByGene),unique) GRanges(seqnames=geneChr,strand=geneStrand,ranges=IRanges(geneStarts,g eneEnds),gene_name=names(txByGene)) } However, this has run me into all kinds of problems because the same gene name can be used for transcripts on different chromosomes, namely the X and Y chromosome (which makes me wonder about the safety of the assumption that gene names are unique -- does this count as unique?). Before I go further down this route, is there not a simple way to get this information? Thanks, Elizabeth -- Elizabeth Purdom Assistant Professor Department of Statistics UC, Berkeley Office: 433 Evans Hall (510) 642-6154 (office) (510) 642-7892 (fax) Mailing: 367 Evans Hall Berkeley, CA 94720-3860 epurdom at stat.berkeley.edu

GO TranscriptDb GO TranscriptDb • 1.6k views

ADD COMMENT • link updated 13.7 years ago by Sean Davis 21k • written 13.7 years ago by Elizabeth Purdom ▴ 210

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 months ago

United States

On Tue, Sep 14, 2010 at 5:43 PM, Elizabeth Purdom <epurdom@stat.berkeley.edu> wrote: > Hello, > > I would like to get the range of genes from a TranscriptDb object. Ideally, > I would like to do: > > txdb <- loadFeatures(system.file("extdata", "UCSC_knownGene_sample.sqlite", > package="GenomicFeatures")) > genes(txdb) > > And have this return the entire genomic range of the gene (much like > transcripts() returns the genomic range of the transcript, including the > introns). However, this does not exist. So I have made a little function to > do this: > > genes<-function(txdb){ > txByGene<-transcriptsBy(txdb,"gene") > geneStarts<-sapply(start(txByGene),min) > geneEnds<-sapply(end(txByGene),max) > geneChr<-sapply(seqnames(txByGene),unique) > geneStrand<-sapply(strand(txByGene),unique) > GRanges(seqnames=geneChr,strand=geneStrand,ranges=IRanges(geneStart s,geneEnds),gene_name=names(txByGene)) > > } > > However, this has run me into all kinds of problems because the same gene > name can be used for transcripts on different chromosomes, namely the X and > Y chromosome (which makes me wonder about the safety of the assumption that > gene names are unique -- does this count as unique?). > http://en.wikipedia.org/wiki/Pseudoautosomal_region > > Before I go further down this route, is there not a simple way to get this > information? > > Stating the obvious, probably, but this is not a question with a simple answer. You may have to settle on some sort of heuristic (choose a random region, the longest region, the one containing the most transcripts, etc.) to sort out such cases or simply ignore them altogether (throw them away). Unfortunately, these oddities are present in the human genome (and other expanded genomes) and it is not possible to totally avoid them. Sean [[alternative HTML version deleted]]

ADD COMMENT • link 13.7 years ago Sean Davis 21k

0

Entering edit mode

> > > Before I go further down this route, is there not a simple way to > get this information? > > > Stating the obvious, probably, but this is not a question with a > simple answer. You may have to settle on some sort of heuristic > (choose a random region, the longest region, the one containing the > most transcripts, etc.) to sort out such cases or simply ignore them > altogether (throw them away). Unfortunately, these oddities are > present in the human genome (and other expanded genomes) and it is not > possible to totally avoid them. > I understand the possible pitfalls (and the biology for why they would happen), but at the same time, it is still a pretty standard question that many databases give some answer to. Generally, I understand gene boundaries as the largest region that contains the transcripts annotated to be in that gene. This is the case for Ensembl, I believe (where you can definitely query the database and get out gene start and stop points for a gene id). If there are multiple chromosomes for the same gene id, these could be returned as separate entries. I would like to have a database query that accomplishes this, like the transcripts and exons functions, because my naive implementation (in addition to the problems with the multiple chrX and chrY mappings) is extremely slow. The reason I want this is that I want to pull up all reads from a bam file associated with this gene, regardless of the transcript. If I give per transcript information, I get reads coming up multiple times, because the transcripts from the same gene substantially overlap and the reads are pulled up per transcript. This might be useful in some setting, but generally I'm going to want to then post-process the reads per gene to deal with the multiple transcript problem (yes I know genes can overlap too, but this is a problem of smaller magnitude, and I think I can deal with that issue separately). Moreover, the number of genes is much smaller than the number of transcripts, and I want to reduce the number of queries of the bam file, as well as the number of reads I pull into memory at a time. Thanks, Elizabeth [[alternative HTML version deleted]]

ADD REPLY • link 13.7 years ago Elizabeth Purdom ▴ 210

0

Entering edit mode

On Tue, Sep 14, 2010 at 7:53 PM, Elizabeth Purdom <epurdom at="" stat.berkeley.edu=""> wrote: > >> >> >> ? ? Before I go further down this route, is there not a simple way to >> ? ? get this information? >> >> >> Stating the obvious, probably, but this is not a question with a >> simple answer. ?You may have to settle on some sort of heuristic >> (choose a random region, the longest region, the one containing the >> most transcripts, etc.) to sort out such cases or simply ignore them >> altogether (throw them away). ?Unfortunately, these oddities are >> present in the human genome (and other expanded genomes) and it is not >> possible to totally avoid them. >> > I understand the possible pitfalls (and the biology for why they would > happen), but at the same time, it is still a pretty standard question > that many databases give some answer to. Generally, I understand gene > boundaries as the largest region that contains the transcripts annotated > to be in that gene. This is the case for Ensembl, I believe (where you > can definitely query the database and get out gene start and stop points You could use org.Hs.eg.db for example and acquire the CHRLOC and CHRLOCEND data for each Gene ID. Note, however > mk = mappedkeys(org.Hs.egCHRLOC) > table(sapply(mget(mk, org.Hs.egCHRLOC), length)) 1 2 3 4 5 6 7 8 9 10 12 13 14 17626 2753 622 196 72 56 65 55 14 20 2 1 11 15 16 17 18 19 24 6 5 2 2 3 1 you will need to define a policy for dealing with the multiplicities, e.g. > which(sapply(mget(mk, org.Hs.egCHRLOC), length)==16) 259197 5696 6891 8859 9374 6510 14049 16604 20056 20884 > get("5696", org.Hs.egSYMBOL) [1] "PSMB8" > get("5696", org.Hs.egCHRLOC) 6 6 6_apd_hap1 6_apd_hap1 6_cox_hap2 6_cox_hap2 -32808496 -32808496 -4095502 -4095502 -4253026 -4253026 6_dbb_hap3 6_dbb_hap3 6_mann_hap4 6_mann_hap4 6_mcf_hap5 6_mcf_hap5 -4089876 -4089876 -4265694 -4265694 -4145372 -4145372 6_qbl_hap6 6_qbl_hap6 6_ssto_hap7 6_ssto_hap7 -4040605 -4040605 -4239268 -4239268 > get("5696", org.Hs.egCHRLOCEND) 6 6 6_apd_hap1 6_apd_hap1 6_cox_hap2 6_cox_hap2 -32812712 -32811816 -4099716 -4098820 -4257240 -4256344 6_dbb_hap3 6_dbb_hap3 6_mann_hap4 6_mann_hap4 6_mcf_hap5 6_mcf_hap5 -4094090 -4093194 -4269908 -4269012 -4149579 -4148693 6_qbl_hap6 6_qbl_hap6 6_ssto_hap7 6_ssto_hap7 -4044819 -4043923 -4243485 -4242589 it is not hard to think of code that will implement the policy you allude to. i think it is unwise for bioconductor core packages to devise implementations of such policies -- these policies need to be explicitly selected and justified by the investigators using them > for a gene id). If there are multiple chromosomes for the same gene id, > these could be returned as separate entries. I would like to have a > database query that accomplishes this, like the transcripts and exons > functions, because my naive implementation (in addition to the problems > with the multiple chrX and chrY mappings) is extremely slow. Perhaps it is slow, but it is clear what is being done, and it is a one-off computation. You get the GRanges structure you want, and save it until versions change, and that's that. Now if we look at the different approaches suggested so far, we see > library(GenomicFeatures) > hg19 = makeTranscriptDbFromUCSC("hg19") Download the knownGene table ... OK Download the knownToLocusLink table ... OK Extract the 'transcripts' data frame ... OK Extract the 'splicings' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Prepare the 'metadata' data frame ... OK Make the TranscriptDb object ... OK > transcriptsBy(hg19, "gene")$"100" GRanges with 2 ranges and 2 elementMetadata values seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr20 [43248164, 43280376] - | 70158 uc002xmj.2 [2] chr20 [43248164, 43280376] - | 70159 uc010ggt.2 seqlengths chr1 chr2 ... chr18_gl000207_random 249250621 243199373 ... 4262 > get("100", org.Hs.egUCSCKG) [1] "uc002xmj.2" "uc010ggt.2" > get("100", org.Hs.egCHRLOC) 20 -43248163 > get("100", org.Hs.egCHRLOCEND) 20 -43280376 For this example, the transcript table and the org.Hs.egCHRLOC data seem consistent. i believe that programming the gene boundary GRanges that you want from the org.Hs.eg will be easier than the ranges manipulations you find slow, and multiplicity handling can be factored out from range manipulation. > > The reason I want this is that I want to pull up all reads from a bam > file associated with this gene, regardless of the transcript. If I give > per transcript information, I get reads coming up multiple times, > because the transcripts from the same gene substantially overlap and the > reads are pulled up per transcript. This might be useful in some > setting, but generally I'm going to want to then post-process the reads > per gene to deal with the multiple transcript problem (yes I know genes > can overlap too, but this is a problem of smaller magnitude, and I think > I can deal with that issue separately). Moreover, the number of genes is > much smaller than the number of transcripts, and I want to reduce the > number of queries of the bam file, as well as the number of reads I pull > into memory at a time. > > Thanks, > Elizabeth > > > > > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.7 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

On Tue, Sep 14, 2010 at 7:53 PM, Elizabeth Purdom <epurdom@stat.berkeley.edu> wrote: > > > > > > > Before I go further down this route, is there not a simple way to > > get this information? > > > > > > Stating the obvious, probably, but this is not a question with a > > simple answer. You may have to settle on some sort of heuristic > > (choose a random region, the longest region, the one containing the > > most transcripts, etc.) to sort out such cases or simply ignore them > > altogether (throw them away). Unfortunately, these oddities are > > present in the human genome (and other expanded genomes) and it is not > > possible to totally avoid them. > > > I understand the possible pitfalls (and the biology for why they would > happen), but at the same time, it is still a pretty standard question > that many databases give some answer to. Generally, I understand gene > boundaries as the largest region that contains the transcripts annotated > to be in that gene. This is the case for Ensembl, I believe (where you > can definitely query the database and get out gene start and stop points > for a gene id). If there are multiple chromosomes for the same gene id, > these could be returned as separate entries. I would like to have a > database query that accomplishes this, Hi, Elizabeth. Just to answer your question directly.... It is possible to do this type of thing in SQL and connecting to the public mysql server at UCSC is possible using RMySQL or some other mysql client (see here: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29). I provide a few (very lightly tested) examples. Each runs in less than a second on our local UCSC mysql mirror. This little SQL grabs the min(start) and max(end) for each locuslink directly from the UCSC database. select value as locuslink,min(txStart) as s,max(txEnd) as e,chrom from knownGene kg join knownToLocusLink kl on kg.name=kl.name group by value,chrom; The next just counts the number of such records returned from above that end up with records on more than one chromosome (281 LocusLinks, hg18, knownGene) select locuslink,count(*) from (select value as locuslink,min(txStart) as s,max(txEnd) as e, max(txEnd)-min(txStart) as d,chrom from knownGene kg join knownToLocusLink kl on kg.name=kl.name group by value,chrom) t group by t.locuslink having count(*)>1; This final one pulls out the single longest LocusLink record for each LocusLink. Results will then be unique. select t.* from (select value as locuslink,min(txStart) as s,max(txEnd) as e, max(txEnd)-min(txStart) as d,chrom from knownGene kg join knownToLocusLink kl on kg.name=kl.name group by value,chrom) t group by t.locuslink having d=max(d); I use LocusLink here to match the ancient UCSC terminology, but, of course, we are talking about Entrez Gene IDs. Hope that helps a bit. Sean > like the transcripts and exons > functions, because my naive implementation (in addition to the problems > with the multiple chrX and chrY mappings) is extremely slow. > > The reason I want this is that I want to pull up all reads from a bam > file associated with this gene, regardless of the transcript. If I give > per transcript information, I get reads coming up multiple times, > because the transcripts from the same gene substantially overlap and the > reads are pulled up per transcript. This might be useful in some > setting, but generally I'm going to want to then post-process the reads > per gene to deal with the multiple transcript problem (yes I know genes > can overlap too, but this is a problem of smaller magnitude, and I think > I can deal with that issue separately). Moreover, the number of genes is > much smaller than the number of transcripts, and I want to reduce the > number of queries of the bam file, as well as the number of reads I pull > into memory at a time. > > Thanks, > Elizabeth > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 13.7 years ago Sean Davis 21k

Login before adding your answer.