Genome position to miRNA or gene name
2
0
Entering edit mode
Daniel Brewer ★ 1.9k
@daniel-brewer-1791
Last seen 9.6 years ago
Hello, I have got a set of human genome locations that I have found using Biostrings and BSGenome alignment e.g. seqname start end strand patternID chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT What I would like to know is whether this genome location is within a known miRNA or gene. What would the best way be to go about this? Many thanks Dan -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}
miRNA Alignment GO Cancer BSgenome BSgenome miRNA Alignment GO Cancer BSgenome BSgenome • 1.5k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States
Hi, > I have got a set of human genome locations that I have found using > Biostrings and BSGenome alignment e.g. > > seqname start end strand patternID > chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT > chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT > chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT > chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT > > What I would like to know is whether this genome location is within a > known miRNA or gene. What would the best way be to go about this? One way could be to grab the appropriate GTF file for Hsapiens here: ftp://ftp.ensembl.org/pub/current_gtf/ It's just a tab delimited file with genome annotations. You can just collect the lines for miRNA annotations and see if your positions fall in the bounds of the known/annotated miRNA's. For example, here's one: 18 miRNA exon 38162 38272 . + . gene_id "ENSG00000221441" ... You can also get miRNA data from miRBase (http://microrna.sanger.ac.uk/sequences/ ) perhaps it's a more complete set of data?). The data file I have from them doesn't have chromosome positions, but does have the stem- loop sequence, so that would require an intermediate alignment step before getting at what you're after. Probably not the best way, but a way nonetheless. If you're comfortable using biomaRt, I'm sure there's a way to pull down the same annotation info and do the comparison from there. Sorry ... no R code, but hopefully it's helpful nonetheless :-) -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University http://cbio.mskcc.org/~lianos
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 2 hours ago
United States
Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> writes: > Hi, > >> I have got a set of human genome locations that I have found using >> Biostrings and BSGenome alignment e.g. >> >> seqname start end strand patternID >> chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT >> chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT >> chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT >> chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT >> >> What I would like to know is whether this genome location is within a >> known miRNA or gene. What would the best way be to go about this? > > > One way could be to grab the appropriate GTF file for Hsapiens here: > > ftp://ftp.ensembl.org/pub/current_gtf/ > > It's just a tab delimited file with genome annotations. You can just > collect the lines for miRNA annotations and see if your positions fall > in the bounds of the known/annotated miRNA's. > > For example, here's one: > > 18 miRNA exon 38162 38272 . + . gene_id > "ENSG00000221441" ... > > You can also get miRNA data from miRBase > (http://microrna.sanger.ac.uk/sequences/ ) perhaps it's a more > complete set of data?). The data file I have from them doesn't have > chromosome positions, but does have the stem- > loop sequence, so that would require an intermediate alignment step > before getting at what you're after. > > Probably not the best way, but a way nonetheless. If you're > comfortable using biomaRt, I'm sure there's a way to pull down the > same annotation info and do the comparison from there. > > Sorry ... no R code, but hopefully it's helpful nonetheless :-) Very roughly, this fl <- "ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff" seems to be the Human miRNA data base from the Sanger. It parses in to an R data frame with gff <- read.table(fl) (hey, that's cool -- pulling the data directly from ftp!). The fourth and fifth columns are chromosomal positions; there is also information on chromosome (column 1) and strand (column 7). My strategy would be to loop over your aligned sequences by chromosome and strand, and for each subset construct an IRanges object (from the IRanges package) that contains the start and end position of all sequenes. Suppose we have the 'start' and 'end' of each sequence on chromosome 1 seqs <- IRanges(start, end) and the same for the gff data miRNAs <- with(gff[gff$V1 == "1" & gff$V7 == "+",], IRanges(V4, V5)) Then use 'overlap' from IRanges. Along the lines of overlap(miRNAs, seqs, multiple=FALSE) A little messy to keep track of everything, but should be computationally quite efficient, even for very large numbers of sequences. I think also that import.gff (in rtracklayer) and the alignment data structures in Biostrings are all based on the same classes as the IRanges, and that with a little bit of cleverness the software might do all the book-keeping for you. Martin > -steve > > -- > Steve Lianoglou > Graduate Student: Physiology, Biophysics and Systems Biology > Weill Medical College of Cornell University > > http://cbio.mskcc.org/~lianos > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Hi, On Jan 22, 2009, at 12:27 AM, Martin Morgan wrote: > ... [lot's of snippage] ... > Very roughly, this > > fl <- "ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff > " > > seems to be the Human miRNA data base from the Sanger. It parses in to > an R data frame with > > gff <- read.table(fl) > > > (hey, that's cool -- pulling the data directly from ftp!). The fourth > and fifth columns are chromosomal positions; there is also information > on chromosome (column 1) and strand (column 7). I never paid close enough attention to what types of sources were expected here to realize you can do that ... that's sweet. > My strategy would be to loop over your aligned sequences by chromosome > and strand, and for each subset construct an IRanges object (from the > IRanges package) that contains the start and end position of all > sequenes. Suppose we have the 'start' and 'end' of each sequence on > chromosome 1 > > seqs <- IRanges(start, end) > > and the same for the gff data > > miRNAs <- with(gff[gff$V1 == "1" & gff$V7 == "+",], IRanges(V4, V5)) > > Then use 'overlap' from IRanges. Along the lines of > > overlap(miRNAs, seqs, multiple=FALSE) When I did a similar thing to what was asked for a while ago (essentially seeing where my own regions of the genome "collided" with annotated regions), I did pretty much what I described before. It involved subtracting my start sites from the annotated start sites and checking to see if the subtracted start/ends combined in one of several ways to give me a collision/hit. There was a fare share of tedium involved in that ... Thanks for pointing out how to do it the smart way by leveraging code already written in the packages, which lead to tripping over the Interval Tree data structure as well. This might have been the coolest thing I've seen all day ;-) Note to self is to look at the IRanges and related packages much more closely ... Thanks again, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University http://cbio.mskcc.org/~lianos
ADD REPLY

Login before adding your answer.

Traffic: 821 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6