Question

Question about using Biostrings & BSgenome

0

Entering edit mode

J.delasHeras@ed.ac.uk ★ 1.9k

@jdelasherasedacuk-1189

Last seen 8.7 years ago

United Kingdom

I haven't yet used either of these packages, but it looks like something I may want to look at. I was wondering if I can use these packages together with something like 'BSgenome.Hsapiens.UCSC.hg18' to extract sequences around every TSS, for instance. I have a couple of different oligo array designs, both in human and mouse, and I would like to subset probes according to a number of criteria, such as "promoter", "intergenic", etc... I'm not yet familiar with these packages but I suspect they will provide all teh tools I need to extract and "play" with genomic sequences. Am I right? Anybody has some examples to help me get a better overview, beyond those in the vignettes? Thanks. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

oligo oligo • 1.2k views

ADD COMMENT • link updated 15.6 years ago by Joern Toedling ▴ 730 • written 15.6 years ago by J.delasHeras@ed.ac.uk ★ 1.9k

score 0 · Answer 1 · 2008-09-17

0

Entering edit mode

Joern Toedling ▴ 730

@joern-toedling-1244

Last seen 9.6 years ago

Hello, Biostrings and BSgenome can certainly be used to retrieve genomic sequences. For instance, here's a very basic function I have used many times to retrieve the sequence of short genome segments on either strand of budding yeast. getYeastSeq <- function(chr, start, end, strand="+"){ stopifnot(length(chr)==1, length(start)==1, length(end)==1) require("BSgenome.Scerevisiae.UCSC.sacCer1") strand <- match.arg(strand, c("+","-")) thisSeq <- gsub("[[:space:]]","", as.character(getSeq(Scerevisiae, gsub("17","M",paste("chr",chr,sep="")), start=start, end=end))) if (strand=="-") thisSeq <- as.character(reverseComplement(DNAString(thisSeq))) return(thisSeq) }#getYeastSeq getYeastSeq(chr=2, start=200000, end=200020) ## test Biostrings offers many utility functions to work with DNA sequences. And you can always convert the sequences into character vectors and use basic R operations on those. Not sure what other games you have in mind when you say "play", but I guess a more precise question whether you can do XYZ with Biostrings or any other Bioconductor package will result in a more informative answer. Regards, Joern J.delasHeras at ed.ac.uk wrote: > > I haven't yet used either of these packages, but it looks like > something I may want to look at. > > I was wondering if I can use these packages together with something > like 'BSgenome.Hsapiens.UCSC.hg18' to extract sequences around every > TSS, for instance. > I have a couple of different oligo array designs, both in human and > mouse, and I would like to subset probes according to a number of > criteria, such as "promoter", "intergenic", etc... > I'm not yet familiar with these packages but I suspect they will > provide all teh tools I need to extract and "play" with genomic > sequences. > > Am I right? > > Anybody has some examples to help me get a better overview, beyond > those in the vignettes? > > Thanks. > > Jose > -- Joern Toedling EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD United Kingdom Phone +44(0)1223 492566 Email toedling at ebi.ac.uk

ADD COMMENT • link 15.6 years ago Joern Toedling ▴ 730

0

Entering edit mode

Hi Joern, that was useful, thank you! I have some new homework to do now. :-) As for what I'm after exactly... it'll be various things at various times, but I can give you one very specific example right now. I have a human promoter array in my hands (and soon a mouse one). Each probeset covers a region of around 2.2kb upstream and 0.5kb downstream the TSS. Now... in reality, some genes have multiple TSSs... sometimes they are close, sometimes far apart. Also, each probeset may be longer than the 2.7Kb expected, for instance if you have two genes going in different directions starting in a short region. I want to dissect all this out. I want to find all the genes, all the TSSs, and create "my own" probesets (from the probes available to me in the array) based on these TSSs and covering a region defined by me also (I may choose to create probesets just +/-400bp around the TSS, and other perhaps covering the 1kb region located -1000 to -2000bp from the TSS) etc. And later on I may have another requirement, depending on my findings and whatever I may be looking for. So I need to locate the TSSs. Then I have to decide for each gene with multiple TSSs, which ones are just too close to make any significant difference to my results so that I can treat them as one, and which ones are further apart so that I treat them as distinct (different promoter regions for a single gene). I would do that based on the TSS locations (and orientation), so it seems simple enough. Then with those locations, I can search the array annotation and figure out which ones are located within the subareas I want. I can do that based on positions alone, but I'd like to have the actual sequences (not just the probes, but the whole region) because in some cases I am looking for particular motifs, and even something simple like restriction sites... For promoter arrays this won't apply, but I also have tiling arrays for a couple of human chromosomes, and in this case I'll find it interesting to separate probesets from exons, introns... I want to sometimes consider a region of x bp around the 5' end of the transcript and another around the 3'... I already have some annotation provided, but I think it's probably easier to look it up myself (from teh probe locations & their given sequence) and that way create the annotation I find useful for my purposes, than adapting whatever was given to me. Especially as it seems (on paper) a relatively simple procedure that can be achieved now entirely from R. I will come up with more detailed questions probably once I start applying these tools to my problems. Jose Quoting Joern Toedling <toedling at="" ebi.ac.uk="">: > Hello, > > Biostrings and BSgenome can certainly be used to retrieve genomic > sequences. For instance, here's a very basic function I have used many > times to retrieve the sequence of short genome segments on either strand > of budding yeast. > > getYeastSeq <- function(chr, start, end, strand="+"){ > stopifnot(length(chr)==1, length(start)==1, length(end)==1) > require("BSgenome.Scerevisiae.UCSC.sacCer1") > strand <- match.arg(strand, c("+","-")) > thisSeq <- gsub("[[:space:]]","", as.character(getSeq(Scerevisiae, > gsub("17","M",paste("chr",chr,sep="")), start=start, end=end))) > if (strand=="-") > thisSeq <- as.character(reverseComplement(DNAString(thisSeq))) > return(thisSeq) > }#getYeastSeq > > getYeastSeq(chr=2, start=200000, end=200020) ## test > > Biostrings offers many utility functions to work with DNA sequences. And > you can always convert the sequences into character vectors and use > basic R operations on those. Not sure what other games you have in mind > when you say "play", but I guess a more precise question whether you can > do XYZ with Biostrings or any other Bioconductor package will result in > a more informative answer. > > Regards, > Joern > > > J.delasHeras at ed.ac.uk wrote: >> >> I haven't yet used either of these packages, but it looks like >> something I may want to look at. >> >> I was wondering if I can use these packages together with something >> like 'BSgenome.Hsapiens.UCSC.hg18' to extract sequences around every >> TSS, for instance. >> I have a couple of different oligo array designs, both in human and >> mouse, and I would like to subset probes according to a number of >> criteria, such as "promoter", "intergenic", etc... >> I'm not yet familiar with these packages but I suspect they will >> provide all teh tools I need to extract and "play" with genomic >> sequences. >> >> Am I right? >> >> Anybody has some examples to help me get a better overview, beyond >> those in the vignettes? >> >> Thanks. >> >> Jose >> > > -- > Joern Toedling > EMBL - European Bioinformatics Institute > Wellcome Trust Genome Campus > Hinxton, Cambridge CB10 1SD > United Kingdom > Phone +44(0)1223 492566 > Email toedling at ebi.ac.uk > > > -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD REPLY • link 15.6 years ago J.delasHeras@ed.ac.uk ★ 1.9k