Genome position to miRNA or gene name

0

Entering edit mode

Daniel Brewer ★ 1.9k

@daniel-brewer-1791

Last seen 10.1 years ago

Hello, I have got a set of human genome locations that I have found using Biostrings and BSGenome alignment e.g. seqname start end strand patternID chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT What I would like to know is whether this genome location is within a known miRNA or gene. What would the best way be to go about this? Many thanks Dan -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}

miRNA Alignment GO Cancer BSgenome BSgenome miRNA Alignment GO Cancer BSgenome BSgenome • 1.7k views

ADD COMMENT • link updated 15.7 years ago by Martin Morgan 25k • written 15.7 years ago by Daniel Brewer ★ 1.9k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 19 months ago

United States

Hi, > I have got a set of human genome locations that I have found using > Biostrings and BSGenome alignment e.g. > > seqname start end strand patternID > chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT > chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT > chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT > chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT > > What I would like to know is whether this genome location is within a > known miRNA or gene. What would the best way be to go about this? One way could be to grab the appropriate GTF file for Hsapiens here: ftp://ftp.ensembl.org/pub/current_gtf/ It's just a tab delimited file with genome annotations. You can just collect the lines for miRNA annotations and see if your positions fall in the bounds of the known/annotated miRNA's. For example, here's one: 18 miRNA exon 38162 38272 . + . gene_id "ENSG00000221441" ... You can also get miRNA data from miRBase (http://microrna.sanger.ac.uk/sequences/ ) perhaps it's a more complete set of data?). The data file I have from them doesn't have chromosome positions, but does have the stem- loop sequence, so that would require an intermediate alignment step before getting at what you're after. Probably not the best way, but a way nonetheless. If you're comfortable using biomaRt, I'm sure there's a way to pull down the same annotation info and do the comparison from there. Sorry ... no R code, but hopefully it's helpful nonetheless :-) -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University http://cbio.mskcc.org/~lianos

ADD COMMENT • link 15.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 weeks ago

United States

Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> writes: > Hi, > >> I have got a set of human genome locations that I have found using >> Biostrings and BSGenome alignment e.g. >> >> seqname start end strand patternID >> chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT >> chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT >> chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT >> chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT >> >> What I would like to know is whether this genome location is within a >> known miRNA or gene. What would the best way be to go about this? > > > One way could be to grab the appropriate GTF file for Hsapiens here: > > ftp://ftp.ensembl.org/pub/current_gtf/ > > It's just a tab delimited file with genome annotations. You can just > collect the lines for miRNA annotations and see if your positions fall > in the bounds of the known/annotated miRNA's. > > For example, here's one: > > 18 miRNA exon 38162 38272 . + . gene_id > "ENSG00000221441" ... > > You can also get miRNA data from miRBase > (http://microrna.sanger.ac.uk/sequences/ ) perhaps it's a more > complete set of data?). The data file I have from them doesn't have > chromosome positions, but does have the stem- > loop sequence, so that would require an intermediate alignment step > before getting at what you're after. > > Probably not the best way, but a way nonetheless. If you're > comfortable using biomaRt, I'm sure there's a way to pull down the > same annotation info and do the comparison from there. > > Sorry ... no R code, but hopefully it's helpful nonetheless :-) Very roughly, this fl <- "ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff" seems to be the Human miRNA data base from the Sanger. It parses in to an R data frame with gff <- read.table(fl) (hey, that's cool -- pulling the data directly from ftp!). The fourth and fifth columns are chromosomal positions; there is also information on chromosome (column 1) and strand (column 7). My strategy would be to loop over your aligned sequences by chromosome and strand, and for each subset construct an IRanges object (from the IRanges package) that contains the start and end position of all sequenes. Suppose we have the 'start' and 'end' of each sequence on chromosome 1 seqs <- IRanges(start, end) and the same for the gff data miRNAs <- with(gff[gff$V1 == "1" & gff$V7 == "+",], IRanges(V4, V5)) Then use 'overlap' from IRanges. Along the lines of overlap(miRNAs, seqs, multiple=FALSE) A little messy to keep track of everything, but should be computationally quite efficient, even for very large numbers of sequences. I think also that import.gff (in rtracklayer) and the alignment data structures in Biostrings are all based on the same classes as the IRanges, and that with a little bit of cleverness the software might do all the book-keeping for you. Martin > -steve > > -- > Steve Lianoglou > Graduate Student: Physiology, Biophysics and Systems Biology > Weill Medical College of Cornell University > > http://cbio.mskcc.org/~lianos > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 15.7 years ago Martin Morgan 25k

0

Entering edit mode

Hi, On Jan 22, 2009, at 12:27 AM, Martin Morgan wrote: > ... [lot's of snippage] ... > Very roughly, this > > fl <- "ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff > " > > seems to be the Human miRNA data base from the Sanger. It parses in to > an R data frame with > > gff <- read.table(fl) > > > (hey, that's cool -- pulling the data directly from ftp!). The fourth > and fifth columns are chromosomal positions; there is also information > on chromosome (column 1) and strand (column 7). I never paid close enough attention to what types of sources were expected here to realize you can do that ... that's sweet. > My strategy would be to loop over your aligned sequences by chromosome > and strand, and for each subset construct an IRanges object (from the > IRanges package) that contains the start and end position of all > sequenes. Suppose we have the 'start' and 'end' of each sequence on > chromosome 1 > > seqs <- IRanges(start, end) > > and the same for the gff data > > miRNAs <- with(gff[gff$V1 == "1" & gff$V7 == "+",], IRanges(V4, V5)) > > Then use 'overlap' from IRanges. Along the lines of > > overlap(miRNAs, seqs, multiple=FALSE) When I did a similar thing to what was asked for a while ago (essentially seeing where my own regions of the genome "collided" with annotated regions), I did pretty much what I described before. It involved subtracting my start sites from the annotated start sites and checking to see if the subtracted start/ends combined in one of several ways to give me a collision/hit. There was a fare share of tedium involved in that ... Thanks for pointing out how to do it the smart way by leveraging code already written in the packages, which lead to tripping over the Interval Tree data structure as well. This might have been the coolest thing I've seen all day ;-) Note to self is to look at the IRanges and related packages much more closely ... Thanks again, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University http://cbio.mskcc.org/~lianos

ADD REPLY • link 15.7 years ago Steve Lianoglou ★ 13k

Login before adding your answer.