Hello,
I have got a set of human genome locations that I have found using
Biostrings and BSGenome alignment e.g.
seqname start end strand patternID
chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT
chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT
chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT
chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT
What I would like to know is whether this genome location is within a
known miRNA or gene. What would the best way be to go about this?
Many thanks
Dan
--
**************************************************************
Daniel Brewer, Ph.D.
Institute of Cancer Research
Molecular Carcinogenesis
Email: daniel.brewer at icr.ac.uk
**************************************************************
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
This e-mail message is confidential and for use by the
a...{{dropped:2}}
Hi,
> I have got a set of human genome locations that I have found using
> Biostrings and BSGenome alignment e.g.
>
> seqname start end strand patternID
> chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT
> chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT
> chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT
> chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT
>
> What I would like to know is whether this genome location is within
a
> known miRNA or gene. What would the best way be to go about this?
One way could be to grab the appropriate GTF file for Hsapiens here:
ftp://ftp.ensembl.org/pub/current_gtf/
It's just a tab delimited file with genome annotations. You can just
collect the lines for miRNA annotations and see if your positions fall
in the bounds of the known/annotated miRNA's.
For example, here's one:
18 miRNA exon 38162 38272 . + . gene_id
"ENSG00000221441" ...
You can also get miRNA data from miRBase
(http://microrna.sanger.ac.uk/sequences/
) perhaps it's a more complete set of data?). The data file I have
from them doesn't have chromosome positions, but does have the stem-
loop sequence, so that would require an intermediate alignment step
before getting at what you're after.
Probably not the best way, but a way nonetheless. If you're
comfortable using biomaRt, I'm sure there's a way to pull down the
same annotation info and do the comparison from there.
Sorry ... no R code, but hopefully it's helpful nonetheless :-)
-steve
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Medical College of Cornell University
http://cbio.mskcc.org/~lianos
Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> writes:
> Hi,
>
>> I have got a set of human genome locations that I have found using
>> Biostrings and BSGenome alignment e.g.
>>
>> seqname start end strand patternID
>> chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT
>> chr11 121522487 121522507 -
TGAGGTAGTAGGTTGTATAGT
>> chr22 44887296 44887316 +
TGAGGTAGTAGGTTGTATAGT
>> chr22 44888235 44888256 +
TGAGGTAGTAGGTTGTGTGGTT
>>
>> What I would like to know is whether this genome location is within
a
>> known miRNA or gene. What would the best way be to go about this?
>
>
> One way could be to grab the appropriate GTF file for Hsapiens here:
>
> ftp://ftp.ensembl.org/pub/current_gtf/
>
> It's just a tab delimited file with genome annotations. You can just
> collect the lines for miRNA annotations and see if your positions
fall
> in the bounds of the known/annotated miRNA's.
>
> For example, here's one:
>
> 18 miRNA exon 38162 38272 . + . gene_id
> "ENSG00000221441" ...
>
> You can also get miRNA data from miRBase
> (http://microrna.sanger.ac.uk/sequences/ ) perhaps it's a more
> complete set of data?). The data file I have from them doesn't have
> chromosome positions, but does have the stem-
> loop sequence, so that would require an intermediate alignment step
> before getting at what you're after.
>
> Probably not the best way, but a way nonetheless. If you're
> comfortable using biomaRt, I'm sure there's a way to pull down the
> same annotation info and do the comparison from there.
>
> Sorry ... no R code, but hopefully it's helpful nonetheless :-)
Very roughly, this
fl <-
"ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff"
seems to be the Human miRNA data base from the Sanger. It parses in to
an R data frame with
gff <- read.table(fl)
(hey, that's cool -- pulling the data directly from ftp!). The fourth
and fifth columns are chromosomal positions; there is also information
on chromosome (column 1) and strand (column 7).
My strategy would be to loop over your aligned sequences by chromosome
and strand, and for each subset construct an IRanges object (from the
IRanges package) that contains the start and end position of all
sequenes. Suppose we have the 'start' and 'end' of each sequence on
chromosome 1
seqs <- IRanges(start, end)
and the same for the gff data
miRNAs <- with(gff[gff$V1 == "1" & gff$V7 == "+",], IRanges(V4, V5))
Then use 'overlap' from IRanges. Along the lines of
overlap(miRNAs, seqs, multiple=FALSE)
A little messy to keep track of everything, but should be
computationally quite efficient, even for very large numbers of
sequences. I think also that import.gff (in rtracklayer) and the
alignment data structures in Biostrings are all based on the same
classes as the IRanges, and that with a little bit of cleverness the
software might do all the book-keeping for you.
Martin
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Physiology, Biophysics and Systems Biology
> Weill Medical College of Cornell University
>
> http://cbio.mskcc.org/~lianos
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
Hi,
On Jan 22, 2009, at 12:27 AM, Martin Morgan wrote:
> ... [lot's of snippage] ...
> Very roughly, this
>
> fl <-
"ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff
> "
>
> seems to be the Human miRNA data base from the Sanger. It parses in
to
> an R data frame with
>
> gff <- read.table(fl)
>
>
> (hey, that's cool -- pulling the data directly from ftp!). The
fourth
> and fifth columns are chromosomal positions; there is also
information
> on chromosome (column 1) and strand (column 7).
I never paid close enough attention to what types of sources were
expected here to realize you can do that ... that's sweet.
> My strategy would be to loop over your aligned sequences by
chromosome
> and strand, and for each subset construct an IRanges object (from
the
> IRanges package) that contains the start and end position of all
> sequenes. Suppose we have the 'start' and 'end' of each sequence on
> chromosome 1
>
> seqs <- IRanges(start, end)
>
> and the same for the gff data
>
> miRNAs <- with(gff[gff$V1 == "1" & gff$V7 == "+",], IRanges(V4,
V5))
>
> Then use 'overlap' from IRanges. Along the lines of
>
> overlap(miRNAs, seqs, multiple=FALSE)
When I did a similar thing to what was asked for a while ago
(essentially seeing where my own regions of the genome "collided" with
annotated regions), I did pretty much what I described before.
It involved subtracting my start sites from the annotated start sites
and checking to see if the subtracted start/ends combined in one of
several ways to give me a collision/hit. There was a fare share of
tedium involved in that ...
Thanks for pointing out how to do it the smart way by leveraging code
already written in the packages, which lead to tripping over the
Interval Tree data structure as well. This might have been the coolest
thing I've seen all day ;-)
Note to self is to look at the IRanges and related packages much more
closely ...
Thanks again,
-steve
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Medical College of Cornell University
http://cbio.mskcc.org/~lianos