BSgenomes vs ENSEMBL

0

Entering edit mode

Guido Hooiveld ★ 4.0k

@guido-hooiveld-2020

Last seen 9 hours ago

Wageningen University, Wageningen, the …

Dear list, I am a novice in genome builds and have therefore some basic questions. My ultimate goal is to identify the exact locations in the mouse genome of several 'fixed' sequences, e.g. how many times is this specific sequence "aaggggaaaaggtca", a putative transcription factor binding site, present in the mouse genome, and more importantly, which genes are closest to a match. After searching the archive I came to the conclusion that the libraries Biostrings + BSGenome likely can do what I am after. http://thread.gmane.org/gmane.science.biology.informatics.conductor/17 47 1 I understand the mouse genome in BSgenome.Mmusculus.UCSC.mm9 is build based on data made available by the UCSC. I also noticed that the UCSC MM9 assembly is also known as NCBI Build 37. However, my co-worker always uses ENSEMBL to find info on genes...., but apparently ENSEMBL also uses the same assembly (i.e. NCBI m37 mouse). Therefore: - Am i correct; in other words, USCS and ENSEMBL use the same, identical genome assambly? - Thus only the annotation of the genome differs between UCSC and ENSEMBL? - As a result, I can use the Bs.genome.xxx.mm9 to identify the locations at the genome of a specific sequence, which I then can annotate using ENSEMBL to identify the gene(s) that are closest to a match? And what would be the best way of doing this? BiomaRt? Thanks, Guido ------------------------------------------------ Guido Hooiveld, PhD Nutrition, Metabolism & Genomics Group Division of Human Nutrition Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen the Netherlands tel: (+)31 317 485788 fax: (+)31 317 483342 internet: http://nutrigene.4t.com <http: nutrigene.4t.com=""/> email: guido.hooiveld@wur.nl [[alternative HTML version deleted]]

Transcription Annotation BSgenome annotate Biostrings BSgenome Transcription Annotation • 2.3k views

ADD COMMENT • link updated 15.8 years ago by Hervé Pagès 16k • written 15.8 years ago by Guido Hooiveld ★ 4.0k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 5 hours ago

Seattle, WA, United States

Hi Guido, My understanding is that UCSC generally doesn't assemble a genome themselves but get it from someone who assembled it like NCBI in the case of hg18 (NCBI Build 36.1) or mm9 (NCBI Build 37). See this table: http://genome.ucsc.edu/FAQ/FAQreleases#release1 The "RELEASE NAME" column tells you who assembled the genome. As you can see, all the genomes provided by UCSC have been assembled by someone else (except old genomes hg1 to hg8). If ENSEMBL claims that they use the "NCBI m37" assembly, one might be confident that this means that they use the same assembly as mm9 from UCSC. If this is the case, the chromosome sequences should be strictly identical. As for the annotations, yes, I would expect them to differ between UCSC and ENSEMBL but someone more familiar with this topic would need to confirm this. So yes you could in principle (1) use BSgenome.Mmusculus.UCSC.mm9 to find the locations of your short sequences and then (2) annotate them with ENSEMBL annotations. I don't know what would be the best way of doing (2) though. For (1) there are several options available in Biostrings depending on the "size" of the problem (i.e. how many short sequences you need to match/align, how big they are and how big the reference genome is) and whether you want to do exact matching, or allow some mismatches only or allow indels too. See pairwiseAlignment() for finding the alignments of a small number of short patterns against a small genome. It implements a Smith-Waterman or Needleman-Wunsch algorithm so replacements (aka mismatches) and indels are fully supported. See matchPattern() for exact matching and inexact matching (with a small number of mismatches only, no indels) of a small number of short patterns against a small or big genome. See matchPDict() for doing the same thing than matchPattern() (with some restrictions though) but when you have a lot (thousands or millions) of short patterns against a small or big genome. (See this recent post on this list for some hints on how to use matchPDict: https://stat.ethz.ch/pipermail/bioconductor/2008-October/024629.html ) Cheers, H. Hooiveld, Guido wrote: > > Dear list, > I am a novice in genome builds and have therefore some basic questions. > > My ultimate goal is to identify the exact locations in the mouse genome > of several 'fixed' sequences, e.g. how many times is this specific > sequence "aaggggaaaaggtca", a putative transcription factor binding > site, present in the mouse genome, and more importantly, which genes are > closest to a match. After searching the archive I came to the conclusion > that the libraries Biostrings + BSGenome likely can do what I am after. > http://thread.gmane.org/gmane.science.biology.informatics.conductor/ 1747 > 1 > > I understand the mouse genome in BSgenome.Mmusculus.UCSC.mm9 is build > based on data made available by the UCSC. I also noticed that the UCSC > MM9 assembly is also known as NCBI Build 37. However, my co-worker > always uses ENSEMBL to find info on genes...., but apparently ENSEMBL > also uses the same assembly (i.e. NCBI m37 mouse). Therefore: > - Am i correct; in other words, USCS and ENSEMBL use the same, identical > genome assambly? > - Thus only the annotation of the genome differs between UCSC and > ENSEMBL? > - As a result, I can use the Bs.genome.xxx.mm9 to identify the locations > at the genome of a specific sequence, which I then can annotate using > ENSEMBL to identify the gene(s) that are closest to a match? And what > would be the best way of doing this? BiomaRt? > > Thanks, > Guido > > ------------------------------------------------ > Guido Hooiveld, PhD > Nutrition, Metabolism & Genomics Group > Division of Human Nutrition > Wageningen University > Biotechnion, Bomenweg 2 > NL-6703 HD Wageningen > the Netherlands > tel: (+)31 317 485788 > fax: (+)31 317 483342 > internet: http://nutrigene.4t.com <http: nutrigene.4t.com=""/> > email: guido.hooiveld at wur.nl > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.8 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks Herve and Sean for answering so quickly. However, before actually starting there is already one thing I would like to know (from the vignette of the Bsgenome package): [quote] 5 Masking the chromosome sequences Starting with Bioconductor 2.2, some BSgenome data packages provide built-in masks for the chromosome sequences. For example, each chromosome in BSgenome.Hsapiens.UCSC.hg18 has 3 masks on it: the mask of assembly gaps, the mask of repeat regions that were determined by the RepeatMasker software, and the mask of repeat regions that were determined by the Tandem Repeats Finder software (where only repeats with period less than or equal to 12 were kept). [/quote] Therefore, when doing a genome-wide scan, is it best to use the UNMASKED sequences (=default), or would enabling masking provide better, more biologically-relevant results? Again, I have a set of 18 (exact) sequences of 15-17bp [=putative TF binding sites], of which I would like to find their location in the mouse genome. Thanks, Guido > -----Original Message----- > From: bioconductor-bounces at stat.math.ethz.ch > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of > Herve Pages > Sent: 18 October 2008 00:21 > To: Hooiveld, Guido > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] BSgenomes vs ENSEMBL > > Hi Guido, > > My understanding is that UCSC generally doesn't assemble a > genome themselves but get it from someone who assembled it > like NCBI in the case of hg18 (NCBI Build 36.1) or mm9 (NCBI > Build 37). > See this table: > > http://genome.ucsc.edu/FAQ/FAQreleases#release1 > > The "RELEASE NAME" column tells you who assembled the genome. > As you can see, all the genomes provided by UCSC have been > assembled by someone else (except old genomes hg1 to hg8). > If ENSEMBL claims that they use the "NCBI m37" assembly, one > might be confident that this means that they use the same > assembly as mm9 from UCSC. If this is the case, the > chromosome sequences should be strictly identical. > > As for the annotations, yes, I would expect them to differ > between UCSC and ENSEMBL but someone more familiar with this > topic would need to confirm this. > > So yes you could in principle (1) use > BSgenome.Mmusculus.UCSC.mm9 to find the locations of your > short sequences and then (2) annotate them with ENSEMBL > annotations. I don't know what would be the best way of doing > (2) though. > > For (1) there are several options available in Biostrings > depending on the "size" of the problem (i.e. how many short > sequences you need to match/align, how big they are and how > big the reference genome is) and whether you want to do exact > matching, or allow some mismatches only or allow indels too. > > See pairwiseAlignment() for finding the alignments of a small > number of short patterns against a small genome. It > implements a Smith-Waterman or Needleman-Wunsch algorithm so > replacements (aka mismatches) and indels are fully supported. > > See matchPattern() for exact matching and inexact matching > (with a small number of mismatches only, no indels) of a > small number of short patterns against a small or big genome. > > See matchPDict() for doing the same thing than matchPattern() > (with some restrictions though) but when you have a lot > (thousands or millions) of short patterns against a small or > big genome. (See this recent post on this list for some hints > on how to use matchPDict: > https://stat.ethz.ch/pipermail/bioconductor/2008-October/024629.html > ) > > Cheers, > H. > > > Hooiveld, Guido wrote: > > > > Dear list, > > I am a novice in genome builds and have therefore some > basic questions. > > > > My ultimate goal is to identify the exact locations in the mouse > > genome of several 'fixed' sequences, e.g. how many times is this > > specific sequence "aaggggaaaaggtca", a putative > transcription factor > > binding site, present in the mouse genome, and more > importantly, which > > genes are closest to a match. After searching the archive I came to > > the conclusion that the libraries Biostrings + BSGenome > likely can do what I am after. > > > http://thread.gmane.org/gmane.science.biology.informatics.conductor/17 > > 47 > > 1 > > > > I understand the mouse genome in > BSgenome.Mmusculus.UCSC.mm9 is build > > based on data made available by the UCSC. I also noticed > that the UCSC > > MM9 assembly is also known as NCBI Build 37. However, my co-worker > > always uses ENSEMBL to find info on genes...., but > apparently ENSEMBL > > also uses the same assembly (i.e. NCBI m37 mouse). Therefore: > > - Am i correct; in other words, USCS and ENSEMBL use the same, > > identical genome assambly? > > - Thus only the annotation of the genome differs between UCSC and > > ENSEMBL? > > - As a result, I can use the Bs.genome.xxx.mm9 to identify the > > locations at the genome of a specific sequence, which I then can > > annotate using ENSEMBL to identify the gene(s) that are > closest to a > > match? And what would be the best way of doing this? BiomaRt? > > > > Thanks, > > Guido > > > > ------------------------------------------------ > > Guido Hooiveld, PhD > > Nutrition, Metabolism & Genomics Group Division of Human Nutrition > > Wageningen University Biotechnion, Bomenweg 2 > > NL-6703 HD Wageningen > > the Netherlands > > tel: (+)31 317 485788 > > fax: (+)31 317 483342 > > internet: http://nutrigene.4t.com <http: nutrigene.4t.com=""/> > > email: guido.hooiveld at wur.nl > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 15.8 years ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

On Sat, Oct 18, 2008 at 3:48 PM, Hooiveld, Guido <guido.hooiveld at="" wur.nl=""> wrote: > > Thanks Herve and Sean for answering so quickly. However, before actually > starting there is already one thing I would like to know (from the > vignette of the Bsgenome package): > [quote] > 5 Masking the chromosome sequences > Starting with Bioconductor 2.2, some BSgenome data packages provide > built-in masks for the chromosome sequences. For example, each > chromosome in BSgenome.Hsapiens.UCSC.hg18 has 3 masks on it: the mask of > assembly gaps, the mask of repeat regions that were determined by the > RepeatMasker software, and the mask of repeat regions that were > determined by the Tandem Repeats Finder software (where only repeats > with period less than or equal to 12 were kept). > [/quote] > > Therefore, when doing a genome-wide scan, is it best to use the UNMASKED > sequences (=default), or would enabling masking provide better, more > biologically-relevant results? Again, I have a set of 18 (exact) > sequences of 15-17bp [=putative TF binding sites], of which I would like > to find their location in the mouse genome. It depends ENTIRELY on the question you are trying to answer, what you are going to do with the locations, experimental design, etc. There is not a general "right" answer. Sean >> -----Original Message----- >> From: bioconductor-bounces at stat.math.ethz.ch >> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of >> Herve Pages >> Sent: 18 October 2008 00:21 >> To: Hooiveld, Guido >> Cc: bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] BSgenomes vs ENSEMBL >> >> Hi Guido, >> >> My understanding is that UCSC generally doesn't assemble a >> genome themselves but get it from someone who assembled it >> like NCBI in the case of hg18 (NCBI Build 36.1) or mm9 (NCBI >> Build 37). >> See this table: >> >> http://genome.ucsc.edu/FAQ/FAQreleases#release1 >> >> The "RELEASE NAME" column tells you who assembled the genome. >> As you can see, all the genomes provided by UCSC have been >> assembled by someone else (except old genomes hg1 to hg8). >> If ENSEMBL claims that they use the "NCBI m37" assembly, one >> might be confident that this means that they use the same >> assembly as mm9 from UCSC. If this is the case, the >> chromosome sequences should be strictly identical. >> >> As for the annotations, yes, I would expect them to differ >> between UCSC and ENSEMBL but someone more familiar with this >> topic would need to confirm this. >> >> So yes you could in principle (1) use >> BSgenome.Mmusculus.UCSC.mm9 to find the locations of your >> short sequences and then (2) annotate them with ENSEMBL >> annotations. I don't know what would be the best way of doing >> (2) though. >> >> For (1) there are several options available in Biostrings >> depending on the "size" of the problem (i.e. how many short >> sequences you need to match/align, how big they are and how >> big the reference genome is) and whether you want to do exact >> matching, or allow some mismatches only or allow indels too. >> >> See pairwiseAlignment() for finding the alignments of a small >> number of short patterns against a small genome. It >> implements a Smith-Waterman or Needleman-Wunsch algorithm so >> replacements (aka mismatches) and indels are fully supported. >> >> See matchPattern() for exact matching and inexact matching >> (with a small number of mismatches only, no indels) of a >> small number of short patterns against a small or big genome. >> >> See matchPDict() for doing the same thing than matchPattern() >> (with some restrictions though) but when you have a lot >> (thousands or millions) of short patterns against a small or >> big genome. (See this recent post on this list for some hints >> on how to use matchPDict: >> https://stat.ethz.ch/pipermail/bioconductor/2008-October/024629.html >> ) >> >> Cheers, >> H. >> >> >> Hooiveld, Guido wrote: >> > >> > Dear list, >> > I am a novice in genome builds and have therefore some >> basic questions. >> > >> > My ultimate goal is to identify the exact locations in the mouse >> > genome of several 'fixed' sequences, e.g. how many times is this >> > specific sequence "aaggggaaaaggtca", a putative >> transcription factor >> > binding site, present in the mouse genome, and more >> importantly, which >> > genes are closest to a match. After searching the archive I came to >> > the conclusion that the libraries Biostrings + BSGenome >> likely can do what I am after. >> > >> http://thread.gmane.org/gmane.science.biology.informatics.conductor/17 >> > 47 >> > 1 >> > >> > I understand the mouse genome in >> BSgenome.Mmusculus.UCSC.mm9 is build >> > based on data made available by the UCSC. I also noticed >> that the UCSC >> > MM9 assembly is also known as NCBI Build 37. However, my co- worker >> > always uses ENSEMBL to find info on genes...., but >> apparently ENSEMBL >> > also uses the same assembly (i.e. NCBI m37 mouse). Therefore: >> > - Am i correct; in other words, USCS and ENSEMBL use the same, >> > identical genome assambly? >> > - Thus only the annotation of the genome differs between UCSC and >> > ENSEMBL? >> > - As a result, I can use the Bs.genome.xxx.mm9 to identify the >> > locations at the genome of a specific sequence, which I then can >> > annotate using ENSEMBL to identify the gene(s) that are >> closest to a >> > match? And what would be the best way of doing this? BiomaRt? >> > >> > Thanks, >> > Guido >> > >> > ------------------------------------------------ >> > Guido Hooiveld, PhD >> > Nutrition, Metabolism & Genomics Group Division of Human Nutrition >> > Wageningen University Biotechnion, Bomenweg 2 >> > NL-6703 HD Wageningen >> > the Netherlands >> > tel: (+)31 317 485788 >> > fax: (+)31 317 483342 >> > internet: http://nutrigene.4t.com <http: nutrigene.4t.com=""/> >> > email: guido.hooiveld at wur.nl >> > >> > >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at stat.math.ethz.ch >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 15.8 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 hours ago

United States

On Fri, Oct 17, 2008 at 5:16 PM, Hooiveld, Guido <guido.hooiveld at="" wur.nl=""> wrote: > > Dear list, > I am a novice in genome builds and have therefore some basic questions. > > My ultimate goal is to identify the exact locations in the mouse genome > of several 'fixed' sequences, e.g. how many times is this specific > sequence "aaggggaaaaggtca", a putative transcription factor binding > site, present in the mouse genome, and more importantly, which genes are > closest to a match. After searching the archive I came to the conclusion > that the libraries Biostrings + BSGenome likely can do what I am after. > http://thread.gmane.org/gmane.science.biology.informatics.conductor/ 1747 > 1 > > I understand the mouse genome in BSgenome.Mmusculus.UCSC.mm9 is build > based on data made available by the UCSC. I also noticed that the UCSC > MM9 assembly is also known as NCBI Build 37. However, my co-worker > always uses ENSEMBL to find info on genes...., but apparently ENSEMBL > also uses the same assembly (i.e. NCBI m37 mouse). Therefore: > - Am i correct; in other words, USCS and ENSEMBL use the same, identical > genome assambly? > - Thus only the annotation of the genome differs between UCSC and > ENSEMBL? > - As a result, I can use the Bs.genome.xxx.mm9 to identify the locations > at the genome of a specific sequence, which I then can annotate using > ENSEMBL to identify the gene(s) that are closest to a match? And what > would be the best way of doing this? BiomaRt? Everything you said above is correct. And biomaRt would be a good choice. Sean

ADD COMMENT • link 15.8 years ago Sean Davis 21k

Login before adding your answer.