chromosome name match among vcf, txdb,BSgenome

0

Entering edit mode

sun ▴ 100

@sun-5534

Last seen 8.8 years ago

United States

Hi All, I am going to use "coding <- predictCoding(vcf, txdb, seqSource=Athaliana)" to detect coding SNPs. The problem is that the chromosome names are not consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr name is "chr*" . I know I can use renameSeqlevels() to adjust the seqlevels (chromosome names) of the VCF object to match that of the txdb annotation. But how can I adjust the chr name of BSgenome or TranscriptDB? Thanks, Rebecca [[alternative HTML version deleted]]

Annotation BSgenome BSgenome Annotation BSgenome BSgenome • 2.3k views

ADD COMMENT • link updated 12.2 years ago by Hervé Pagès 16k • written 12.2 years ago by sun ▴ 100

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 4.3 years ago

United States

don't forget SNPlocs, where it is 'ch' :-D On Thu, Oct 4, 2012 at 12:10 PM, sun <fireflysrb@gmail.com> wrote: > Hi All, > > I am going to use "coding <- predictCoding(vcf, txdb, seqSource=Athaliana)" > to detect coding SNPs. The problem is that the chromosome names are not > consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is > "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr name is > "chr*" . > > I know I can use renameSeqlevels() to adjust the seqlevels (chromosome > names) of the VCF object to match that of the txdb annotation. But how can > I adjust the chr name of BSgenome or TranscriptDB? > > Thanks, > > Rebecca > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD COMMENT • link 12.2 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 15 hours ago

Seattle, WA, United States

Hi Rebecca, On 10/04/2012 12:10 PM, sun wrote: > Hi All, > > I am going to use "coding <- predictCoding(vcf, txdb, seqSource=Athaliana)" > to detect coding SNPs. The problem is that the chromosome names are not > consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is > "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr name is > "chr*" . > > I know I can use renameSeqlevels() to adjust the seqlevels (chromosome > names) of the VCF object to match that of the txdb annotation. But how can > I adjust the chr name of BSgenome or TranscriptDB? In BioC 2.11 (released yesterday), you can rename the chromosomes of a TranscriptDb object, so you could rename the chromosomes of your VCF and TranscriptDb objects to match the names of the BSgenome object. E.g. for the TranscriptDb object: seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb)) Note that renaming the chromosomes of a TranscriptDb object is a new feature and is not fully implemented yet. For example, if you use select() on the object you'll still get the original names (those stored in the db), and if you try to specify a chromosome name thru the 'vals' arg of the transcripts(), exons() and cds() extractors, you still need to use the original names. This will be addressed soon. Our plan is to also support renaming of the chromosomes of BSgenome and SNPlocs objects very soon. Also, an additional level of convenience will be provided via the seqnameStyle() getter and setter, so you'll be able to quickly rename with something like: seqnameStyle(x) <- "UCSC" or seqnameStyle(vcf) <- seqnameStyle(txdb) <- seqnameStyle(genome) This will work on almost any 'x' object that contains chromosome names (GRanges, GRangesList, GappedAlignments, TranscriptDb, VCF, BSgenome, SNPlocs, etc...) Cheers, H. > > Thanks, > > Rebecca > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 12.2 years ago Hervé Pagès 16k

0

Entering edit mode

This is a terrific addition, thanks so much Herve for implementing it. On Thu, Oct 4, 2012 at 1:18 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Rebecca, > > > On 10/04/2012 12:10 PM, sun wrote: > >> Hi All, >> >> I am going to use "coding <- predictCoding(vcf, txdb, >> seqSource=Athaliana)" >> to detect coding SNPs. The problem is that the chromosome names are not >> consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is >> "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr name is >> "chr*" . >> >> I know I can use renameSeqlevels() to adjust the seqlevels (chromosome >> names) of the VCF object to match that of the txdb annotation. But how can >> I adjust the chr name of BSgenome or TranscriptDB? >> > > In BioC 2.11 (released yesterday), you can rename the chromosomes of a > TranscriptDb object, so you could rename the chromosomes of your > VCF and TranscriptDb objects to match the names of the BSgenome object. > > E.g. for the TranscriptDb object: > > seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb)) > > Note that renaming the chromosomes of a TranscriptDb object is a new > feature and is not fully implemented yet. For example, if you use > select() on the object you'll still get the original names (those > stored in the db), and if you try to specify a chromosome name thru > the 'vals' arg of the transcripts(), exons() and cds() extractors, > you still need to use the original names. This will be addressed soon. > > Our plan is to also support renaming of the chromosomes of BSgenome > and SNPlocs objects very soon. > > Also, an additional level of convenience will be provided via the > seqnameStyle() getter and setter, so you'll be able to quickly rename > with something like: > > seqnameStyle(x) <- "UCSC" > > or > > seqnameStyle(vcf) <- seqnameStyle(txdb) <- seqnameStyle(genome) > > This will work on almost any 'x' object that contains chromosome > names (GRanges, GRangesList, GappedAlignments, TranscriptDb, VCF, > BSgenome, SNPlocs, etc...) > > Cheers, > H. > > > > >> Thanks, >> >> Rebecca >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.2 years ago Tim Triche ★ 4.2k

0

Entering edit mode

On 10/04/2012 03:13 PM, Tim Triche, Jr. wrote: > This is a terrific addition, thanks so much Herve for implementing it. Glad you like it Tim. Thanks! H. > > > On Thu, Oct 4, 2012 at 1:18 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Rebecca, > > > On 10/04/2012 12:10 PM, sun wrote: > > Hi All, > > I am going to use "coding <- predictCoding(vcf, txdb, > seqSource=Athaliana)" > to detect coding SNPs. The problem is that the chromosome names > are not > consistent among VCF, txdb and BSgenome. In vcf, the chromosome > name is > "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr > name is > "chr*" . > > I know I can use renameSeqlevels() to adjust the seqlevels > (chromosome > names) of the VCF object to match that of the txdb annotation. > But how can > I adjust the chr name of BSgenome or TranscriptDB? > > > In BioC 2.11 (released yesterday), you can rename the chromosomes of a > TranscriptDb object, so you could rename the chromosomes of your > VCF and TranscriptDb objects to match the names of the BSgenome object. > > E.g. for the TranscriptDb object: > > seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb)) > > Note that renaming the chromosomes of a TranscriptDb object is a new > feature and is not fully implemented yet. For example, if you use > select() on the object you'll still get the original names (those > stored in the db), and if you try to specify a chromosome name thru > the 'vals' arg of the transcripts(), exons() and cds() extractors, > you still need to use the original names. This will be addressed soon. > > Our plan is to also support renaming of the chromosomes of BSgenome > and SNPlocs objects very soon. > > Also, an additional level of convenience will be provided via the > seqnameStyle() getter and setter, so you'll be able to quickly rename > with something like: > > seqnameStyle(x) <- "UCSC" > > or > > seqnameStyle(vcf) <- seqnameStyle(txdb) <- seqnameStyle(genome) > > This will work on almost any 'x' object that contains chromosome > names (GRanges, GRangesList, GappedAlignments, TranscriptDb, VCF, > BSgenome, SNPlocs, etc...) > > Cheers, > H. > > > > > Thanks, > > Rebecca > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.2 years ago Hervé Pagès 16k

0

Entering edit mode

This sounds awesome. For this, it may be worthwhile to be able to specify it universally, through an option or something. I expect most of us will choose one style and stick to it. I hope style will also imply an ordering chr1 < ... < chr10 On Thu, Oct 4, 2012 at 4:18 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Rebecca, > > > On 10/04/2012 12:10 PM, sun wrote: >> >> Hi All, >> >> I am going to use "coding <- predictCoding(vcf, txdb, >> seqSource=Athaliana)" >> to detect coding SNPs. The problem is that the chromosome names are not >> consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is >> "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr name is >> "chr*" . >> >> I know I can use renameSeqlevels() to adjust the seqlevels (chromosome >> names) of the VCF object to match that of the txdb annotation. But how can >> I adjust the chr name of BSgenome or TranscriptDB? > > > In BioC 2.11 (released yesterday), you can rename the chromosomes of a > TranscriptDb object, so you could rename the chromosomes of your > VCF and TranscriptDb objects to match the names of the BSgenome object. > > E.g. for the TranscriptDb object: > > seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb)) > > Note that renaming the chromosomes of a TranscriptDb object is a new > feature and is not fully implemented yet. For example, if you use > select() on the object you'll still get the original names (those > stored in the db), and if you try to specify a chromosome name thru > the 'vals' arg of the transcripts(), exons() and cds() extractors, > you still need to use the original names. This will be addressed soon. > > Our plan is to also support renaming of the chromosomes of BSgenome > and SNPlocs objects very soon. > > Also, an additional level of convenience will be provided via the > seqnameStyle() getter and setter, so you'll be able to quickly rename > with something like: > > seqnameStyle(x) <- "UCSC" > > or > > seqnameStyle(vcf) <- seqnameStyle(txdb) <- seqnameStyle(genome) > > This will work on almost any 'x' object that contains chromosome > names (GRanges, GRangesList, GappedAlignments, TranscriptDb, VCF, > BSgenome, SNPlocs, etc...) > > Cheers, > H. > > > >> >> Thanks, >> >> Rebecca >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.2 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Hi Kasper, On 10/04/2012 02:40 PM, Kasper Daniel Hansen wrote: > This sounds awesome. > > For this, it may be worthwhile to be able to specify it universally, > through an option or something. I expect most of us will choose one > style and stick to it. Yes why not, sounds worth exploring. For the purpose of troubleshooting and code sharing thru the mailing list, people will need to remember to show us what global settings they have though (unfortunately, this doesn't show up in the sessionInfo()). Nothing new here, we already have this situation with the stringsAsFactors=FALSE global option that alters the semantic of some functions (fortunately, very few people seem to be using this non-default setting). Some people argue that global options should not alter the value returned by core functions, but only affect cosmetic things like the width of the display, the primary & secondary prompt character, etc... and I tend to agree with that. > > I hope style will also imply an ordering chr1 < ... < chr10 We are not planning to support re-ordering of the chromosomes for things like TranscriptDb or BSgenome objects at the moment but we are trying to make sure that those objects are generated with the "main chromosomes" always coming first and in the natural order i.e. chr1 < chr2 < ... < chr10 chrI < chrII < ... < chrX followed by the sex chromosomes (which should not be present if roman numbers are in use), followed by the mitochondrian chromosome if any. After that, it's a mess: there is a bunch of stuff that varies from one genome build to the other (see hg18 vs hg19), even from one genome provider to the other for the *same* genome build. See for example hg19 vs GRCh37.p10 where, except for chrM, GRCh37.p10 is a superset of hg19 but they use very different naming conventions so it's hard to map the sequence names between the 2 assemblies. But would there be much to be gained? The typical situation where inconsistent naming styles are hurting the user is when a binary operation like findOverlaps() requires that the 2 input objects are based on the same reference genome. Right now, findOverlaps() reject the objects if they don't use the same naming style, but if they do, and if the chromosome lengths stored in each object are the same (when they are stored of course, which is not required), then it works as expected. The chromosome don't need to be stored in the same order. BTW, it's worth mentioning that comparing the chromosome names and lengths is not a guarantee that the chromosomes are coming from the same reference genome. The chromosome can have the same name and length, but come from 2 different assemblies, have different DNA sequences, and therefore the annotations provided for the 2 assemblies are different. Let's keep in mind that, by making it easy to alter chromosome names, we also make it easy for the user to pass objects that are not based on the same reference genomes to tools like findOverlaps(), and to silently get a result that doesn't make sense. Finally, and FWIW, there is a universal/unique ID for DNA sequences, which is the RefSeq ID. Storing this ID (as an additional field) in the little Seqinfo table contained in our objects sounds like maybe it could help? I don't know how many objects would actually end up with IDs instead of NAs in that field, but, for example, it would not be too hard to add those IDs to the BSgenome and SNPlocs packages we mantain. Not so sure for other objects like TranscriptDb objects or GappedAlignments objects though... Cheers, H. > > On Thu, Oct 4, 2012 at 4:18 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi Rebecca, >> >> >> On 10/04/2012 12:10 PM, sun wrote: >>> >>> Hi All, >>> >>> I am going to use "coding <- predictCoding(vcf, txdb, >>> seqSource=Athaliana)" >>> to detect coding SNPs. The problem is that the chromosome names are not >>> consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is >>> "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr name is >>> "chr*" . >>> >>> I know I can use renameSeqlevels() to adjust the seqlevels (chromosome >>> names) of the VCF object to match that of the txdb annotation. But how can >>> I adjust the chr name of BSgenome or TranscriptDB? >> >> >> In BioC 2.11 (released yesterday), you can rename the chromosomes of a >> TranscriptDb object, so you could rename the chromosomes of your >> VCF and TranscriptDb objects to match the names of the BSgenome object. >> >> E.g. for the TranscriptDb object: >> >> seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb)) >> >> Note that renaming the chromosomes of a TranscriptDb object is a new >> feature and is not fully implemented yet. For example, if you use >> select() on the object you'll still get the original names (those >> stored in the db), and if you try to specify a chromosome name thru >> the 'vals' arg of the transcripts(), exons() and cds() extractors, >> you still need to use the original names. This will be addressed soon. >> >> Our plan is to also support renaming of the chromosomes of BSgenome >> and SNPlocs objects very soon. >> >> Also, an additional level of convenience will be provided via the >> seqnameStyle() getter and setter, so you'll be able to quickly rename >> with something like: >> >> seqnameStyle(x) <- "UCSC" >> >> or >> >> seqnameStyle(vcf) <- seqnameStyle(txdb) <- seqnameStyle(genome) >> >> This will work on almost any 'x' object that contains chromosome >> names (GRanges, GRangesList, GappedAlignments, TranscriptDb, VCF, >> BSgenome, SNPlocs, etc...) >> >> Cheers, >> H. >> >> >> >>> >>> Thanks, >>> >>> Rebecca >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.2 years ago Hervé Pagès 16k

Login before adding your answer.