variantAnnotation: alternative GENETIC_CODE, and circular chromosomes?

0

Entering edit mode

Janet Young ▴ 740

@janet-young-2360

Last seen 6.3 years ago

Fred Hutchinson Cancer Research Center,…

Hi there, (I think it'll probably be Valerie looking at this question - hi Valerie), I'm just beginning to look at using VariantAnnotation to annotate some SNPs I've called on some yeast data (sacCer3). I can see this will be a really useful package for me - thanks! I can see that chrM (mitochiondrial) SNPs are currently not included in the output of predictCoding, and then using locateVariants, all of chrM SNPs get annotated as intergenic/NA (with a warning, that we ignore circular chromosomes). I can understand why that is - circular chromosomes, and a different genetic code make it trickier. Fair enough. I'm wondering what the prospects are regarding chrM SNPs in the future - any plans to include those later? I'm also wondering whether I can use some hacks to get chrM SNPs annotated. Two questions/potential issues related to that I wanted to ask you guys about: 1. are alternative codon tables already supported anywhere in Bioconductor? Using "?GENETIC_CODE" it looks like this is defined in Biostrings, and it looks like only the standard nuclear code is defined. Are the various alternative genetic codes defined anywhere? For this project, I'm interested in the yeast mitochondrial code, and for another I'm interested in the fly mitochondrial code. It'd be great if we could have all the codes available (I've got another project looking at ciliate nuclear sequences, for example - not working with translations yet, but maybe later...) With a little work, I'll be able to save flat files from NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read those in and transform them to a character vector that looks like GENETIC_CODE. But I realise it might be something useful to have encoded more centrally, so thought I'd ask. 2. What issues should I think about for the circular chromosomes? I'm thinking of a slightly hacky solution where I ignore any annotated ORFs that wrap around from the end of the chromosome to the beginning, and then just treating it as a linear chromosome. Actually, in my case (using sacCer3) there are no ORFs spanning the break in the circular chromosome, so I don't think I'll miss any annotations. Turns out the same is true for human (hg19 knownGene annotations), so maybe the circular chromosome issue isn't such a big issue after all? It seems like that should work, but any thoughts from you - you've thought about these questions a lot more than I have? Looking forward to hearing any thoughts you have. I know sometimes people just ignore the chrM SNPs, but it'd be nice to take a slightly more comprehensive approach if possible. thanks in advance for any input you have, Janet ------------------------------------------------------------------- Dr. Janet Young Malik lab http://research.fhcrc.org/malik/en.html Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA. tel: (206) 667 4512 email: jayoung ...at... fhcrc.org ------------------------------------------------------------------- [[alternative HTML version deleted]]

VariantAnnotation Cancer Yeast annotate Biostrings VariantAnnotation VariantAnnotation • 2.8k views

ADD COMMENT • link updated 12.0 years ago by Hervé Pagès 16k • written 12.0 years ago by Janet Young ▴ 740

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 14 hours ago

Seattle, WA, United States

Hi Janet, On 02/03/2014 07:47 PM, Janet Young wrote: > Hi there, (I think it'll probably be Valerie looking at this question - hi Valerie), > > I'm just beginning to look at using VariantAnnotation to annotate some SNPs I've called on some yeast data (sacCer3). I can see this will be a really useful package for me - thanks! > > I can see that chrM (mitochiondrial) SNPs are currently not included in the output of predictCoding, and then using locateVariants, all of chrM SNPs get annotated as intergenic/NA (with a warning, that we ignore circular chromosomes). I can understand why that is - circular chromosomes, and a different genetic code make it trickier. Fair enough. > > I'm wondering what the prospects are regarding chrM SNPs in the future - any plans to include those later? > > I'm also wondering whether I can use some hacks to get chrM SNPs annotated. Two questions/potential issues related to that I wanted to ask you guys about: > > 1. are alternative codon tables already supported anywhere in Bioconductor? Using "?GENETIC_CODE" it looks like this is defined in Biostrings, and it looks like only the standard nuclear code is defined. Are the various alternative genetic codes defined anywhere? For this project, I'm interested in the yeast mitochondrial code, and for another I'm interested in the fly mitochondrial code. It'd be great if we could have all the codes available (I've got another project looking at ciliate nuclear sequences, for example - not working with translations yet, but maybe later...) > > With a little work, I'll be able to save flat files from NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read those in and transform them to a character vector that looks like GENETIC_CODE. But I realise it might be something useful to have encoded more centrally, so thought I'd ask. What a timely question! I'll let Val answer the questions about support of mitochiondrial DNA in predictCoding() but I can answer that particular one. Last week I added a bunch of non standard genetic codes to Biostrings (2.31.12). To get the genetic code for Yeast Mitochondrial, do: > getGeneticCode("SGC2") TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "T" "T" "T" "T" CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG ACT ACC ACA ACG "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" "T" "T" "T" "T" AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT GAC GAA GAG "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" "D" "D" "E" "E" GGT GGC GGA GGG "G" "G" "G" "G" Its format is the same as for GENETIC_CODE. See ?GENETIC_CODE for the details. I also added the 'genetic.code' arg to translate() so you can supply an alternate genetic code to use for translation. See ?translate for the details. Please let me know if you find any issues, have questions, or want to suggest improvements to these new features. Thanks, H. > > 2. What issues should I think about for the circular chromosomes? I'm thinking of a slightly hacky solution where I ignore any annotated ORFs that wrap around from the end of the chromosome to the beginning, and then just treating it as a linear chromosome. Actually, in my case (using sacCer3) there are no ORFs spanning the break in the circular chromosome, so I don't think I'll miss any annotations. Turns out the same is true for human (hg19 knownGene annotations), so maybe the circular chromosome issue isn't such a big issue after all? > > It seems like that should work, but any thoughts from you - you've thought about these questions a lot more than I have? > > Looking forward to hearing any thoughts you have. I know sometimes people just ignore the chrM SNPs, but it'd be nice to take a slightly more comprehensive approach if possible. > > thanks in advance for any input you have, > > Janet > > > ------------------------------------------------------------------- > > Dr. Janet Young > > Malik lab > http://research.fhcrc.org/malik/en.html > > Fred Hutchinson Cancer Research Center > 1100 Fairview Avenue N., A2-025, > P.O. Box 19024, Seattle, WA 98109-1024, USA. > > tel: (206) 667 4512 > email: jayoung ...at... fhcrc.org > > ------------------------------------------------------------------- > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 12.0 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Janet, Last week we enabled findOverlaps(..., type='within') to work on circular chromosomes. It was this restriction that prevented locateVariants() and predictCoding() from handling ChrM. VariantAnnotation 1.9.34 in devel has the most recent changes. The output of locateVariants() includes ChrM because the function is reporting where the range falls wrt the gene (coding, utr, intron, etc.). predictCoding() however only reports coding variants. If the annotation you supply does not have any coding regions for ChrM or if the ranges don't fall in the coding regions then none will be reported. To confirm your annotation has coding regions for ChrM: cds <- cdsBy(txdb) cds[seqnames(cds) %in% "chrM"] ## or whatever the proper name is #1: I've added support for the 'genetic.code' and 'if.fuzzy.codon' args to translate(). To use a different genetic code just pass the named arg ('genetic.code') to predictCoding(). #2: We've tried to handle the issue of ChrM through making findOverlaps() behave appropriatly with circular chromosomes. The 'type' argument has several options (start, end, any, within, equal). This allows quite a bit of flexibility. When the annotation has an ORF that spans the start/end findOverlaps() will still behave appropriately according to 'type'. ## annotation with seqlength 9 genes <- GRanges(seqnames=rep.int("A", 4), IRanges(start=c(2, 4, 6, 8), width=3)) seqinfo(genes) <- Seqinfo(seqnames="A", seqlengths=9, isCircular=TRUE) ## both ranges span the start/end ranges <- GRanges(seqnames=rep.int("A", 2), IRanges(9, width=c(2, 4))) >> findOverlaps(ranges, genes, type="any") > Hits of length 3 > queryLength: 2 > subjectLength: 4 > queryHits subjectHits > <integer> <integer> > 1 1 4 > 2 2 1 > 3 2 4 >> findOverlaps(ranges, genes, type="within") > Hits of length 1 > queryLength: 2 > subjectLength: 4 > queryHits subjectHits > <integer> <integer> > 1 1 4 With the combination of findOverlaps() now working on circular chromosomes for all values of 'type' and Herve adding the new genetic codes to Biostrings there should be no need to ignore ChrM. If you run into trouble or if anything look strange please let us know. Thanks. Valerie On 02/04/2014 04:08 AM, Hervé Pagès wrote: > Hi Janet, > > On 02/03/2014 07:47 PM, Janet Young wrote: >> Hi there, (I think it'll probably be Valerie looking at this question >> - hi Valerie), >> >> I'm just beginning to look at using VariantAnnotation to annotate some >> SNPs I've called on some yeast data (sacCer3). I can see this will be >> a really useful package for me - thanks! >> >> I can see that chrM (mitochiondrial) SNPs are currently not included >> in the output of predictCoding, and then using locateVariants, all of >> chrM SNPs get annotated as intergenic/NA (with a warning, that we >> ignore circular chromosomes). I can understand why that is - circular >> chromosomes, and a different genetic code make it trickier. Fair enough. >> >> I'm wondering what the prospects are regarding chrM SNPs in the future >> - any plans to include those later? >> >> I'm also wondering whether I can use some hacks to get chrM SNPs >> annotated. Two questions/potential issues related to that I wanted to >> ask you guys about: >> >> 1. are alternative codon tables already supported anywhere in >> Bioconductor? Using "?GENETIC_CODE" it looks like this is defined >> in Biostrings, and it looks like only the standard nuclear code is >> defined. Are the various alternative genetic codes defined anywhere? >> For this project, I'm interested in the yeast mitochondrial code, and >> for another I'm interested in the fly mitochondrial code. It'd be >> great if we could have all the codes available (I've got another >> project looking at ciliate nuclear sequences, for example - not >> working with translations yet, but maybe later...) >> >> With a little work, I'll be able to save flat files from NCBI >> (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read >> those in and transform them to a character vector that looks like >> GENETIC_CODE. But I realise it might be something useful to have >> encoded more centrally, so thought I'd ask. > > What a timely question! I'll let Val answer the questions about > support of mitochiondrial DNA in predictCoding() but I can answer > that particular one. Last week I added a bunch of non standard genetic > codes to Biostrings (2.31.12). To get the genetic code for Yeast > Mitochondrial, do: > > > getGeneticCode("SGC2") > TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT > CTC CTA CTG > "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "T" > "T" "T" "T" > CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG ACT > ACC ACA ACG > "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" "T" > "T" "T" "T" > AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT > GAC GAA GAG > "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" "D" > "D" "E" "E" > GGT GGC GGA GGG > "G" "G" "G" "G" > > Its format is the same as for GENETIC_CODE. See ?GENETIC_CODE for > the details. > > I also added the 'genetic.code' arg to translate() so you can supply > an alternate genetic code to use for translation. See ?translate for > the details. > > Please let me know if you find any issues, have questions, or want > to suggest improvements to these new features. > > Thanks, > H. > >> >> 2. What issues should I think about for the circular chromosomes? >> I'm thinking of a slightly hacky solution where I ignore any >> annotated ORFs that wrap around from the end of the chromosome to the >> beginning, and then just treating it as a linear chromosome. >> Actually, in my case (using sacCer3) there are no ORFs spanning the >> break in the circular chromosome, so I don't think I'll miss any >> annotations. Turns out the same is true for human (hg19 knownGene >> annotations), so maybe the circular chromosome issue isn't such a big >> issue after all? >> >> It seems like that should work, but any thoughts from you - you've >> thought about these questions a lot more than I have? >> >> Looking forward to hearing any thoughts you have. I know sometimes >> people just ignore the chrM SNPs, but it'd be nice to take a slightly >> more comprehensive approach if possible. >> >> thanks in advance for any input you have, >> >> Janet >> >> >> ------------------------------------------------------------------- >> >> Dr. Janet Young >> >> Malik lab >> http://research.fhcrc.org/malik/en.html >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Avenue N., A2-025, >> P.O. Box 19024, Seattle, WA 98109-1024, USA. >> >> tel: (206) 667 4512 >> email: jayoung ...at... fhcrc.org >> >> ------------------------------------------------------------------- >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Another thing I forgot to mention ... If you're getting into variant work you may find ensemblVEP useful (also VariantTools). The ensemblVEP package wraps the Ensembl Variant Effect Predictor and allows detailed annotation of variants. You do need to install this perl script locally: http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html I've found the installation straightforward and the VEP is a great resource. Valerie On 02/04/2014 11:38 AM, Valerie Obenchain wrote: > Hi Janet, > > Last week we enabled findOverlaps(..., type='within') to work on > circular chromosomes. It was this restriction that prevented > locateVariants() and predictCoding() from handling ChrM. > VariantAnnotation 1.9.34 in devel has the most recent changes. > > The output of locateVariants() includes ChrM because the function is > reporting where the range falls wrt the gene (coding, utr, intron, > etc.). predictCoding() however only reports coding variants. If the > annotation you supply does not have any coding regions for ChrM or if > the ranges don't fall in the coding regions then none will be reported. > To confirm your annotation has coding regions for ChrM: > > cds <- cdsBy(txdb) > cds[seqnames(cds) %in% "chrM"] ## or whatever the proper name is > > #1: > I've added support for the 'genetic.code' and 'if.fuzzy.codon' args to > translate(). To use a different genetic code just pass the named arg > ('genetic.code') to predictCoding(). > > #2: > We've tried to handle the issue of ChrM through making findOverlaps() > behave appropriatly with circular chromosomes. The 'type' argument has > several options (start, end, any, within, equal). This allows quite a > bit of flexibility. When the annotation has an ORF that spans the > start/end findOverlaps() will still behave appropriately according to > 'type'. > > ## annotation with seqlength 9 > genes <- GRanges(seqnames=rep.int("A", 4), > IRanges(start=c(2, 4, 6, 8), width=3)) > seqinfo(genes) <- Seqinfo(seqnames="A", seqlengths=9, isCircular=TRUE) > > ## both ranges span the start/end > ranges <- GRanges(seqnames=rep.int("A", 2), > IRanges(9, width=c(2, 4))) > >>> findOverlaps(ranges, genes, type="any") >> Hits of length 3 >> queryLength: 2 >> subjectLength: 4 >> queryHits subjectHits >> <integer> <integer> >> 1 1 4 >> 2 2 1 >> 3 2 4 > >>> findOverlaps(ranges, genes, type="within") >> Hits of length 1 >> queryLength: 2 >> subjectLength: 4 >> queryHits subjectHits >> <integer> <integer> >> 1 1 4 > > With the combination of findOverlaps() now working on circular > chromosomes for all values of 'type' and Herve adding the new genetic > codes to Biostrings there should be no need to ignore ChrM. If you run > into trouble or if anything look strange please let us know. > > Thanks. > Valerie > > > > On 02/04/2014 04:08 AM, Hervé Pagès wrote: >> Hi Janet, >> >> On 02/03/2014 07:47 PM, Janet Young wrote: >>> Hi there, (I think it'll probably be Valerie looking at this question >>> - hi Valerie), >>> >>> I'm just beginning to look at using VariantAnnotation to annotate some >>> SNPs I've called on some yeast data (sacCer3). I can see this will be >>> a really useful package for me - thanks! >>> >>> I can see that chrM (mitochiondrial) SNPs are currently not included >>> in the output of predictCoding, and then using locateVariants, all of >>> chrM SNPs get annotated as intergenic/NA (with a warning, that we >>> ignore circular chromosomes). I can understand why that is - circular >>> chromosomes, and a different genetic code make it trickier. Fair >>> enough. >>> >>> I'm wondering what the prospects are regarding chrM SNPs in the future >>> - any plans to include those later? >>> >>> I'm also wondering whether I can use some hacks to get chrM SNPs >>> annotated. Two questions/potential issues related to that I wanted to >>> ask you guys about: >>> >>> 1. are alternative codon tables already supported anywhere in >>> Bioconductor? Using "?GENETIC_CODE" it looks like this is defined >>> in Biostrings, and it looks like only the standard nuclear code is >>> defined. Are the various alternative genetic codes defined anywhere? >>> For this project, I'm interested in the yeast mitochondrial code, and >>> for another I'm interested in the fly mitochondrial code. It'd be >>> great if we could have all the codes available (I've got another >>> project looking at ciliate nuclear sequences, for example - not >>> working with translations yet, but maybe later...) >>> >>> With a little work, I'll be able to save flat files from NCBI >>> (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read >>> those in and transform them to a character vector that looks like >>> GENETIC_CODE. But I realise it might be something useful to have >>> encoded more centrally, so thought I'd ask. >> >> What a timely question! I'll let Val answer the questions about >> support of mitochiondrial DNA in predictCoding() but I can answer >> that particular one. Last week I added a bunch of non standard genetic >> codes to Biostrings (2.31.12). To get the genetic code for Yeast >> Mitochondrial, do: >> >> > getGeneticCode("SGC2") >> TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT >> CTC CTA CTG >> "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "T" >> "T" "T" "T" >> CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG ACT >> ACC ACA ACG >> "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" "T" >> "T" "T" "T" >> AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT >> GAC GAA GAG >> "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" "D" >> "D" "E" "E" >> GGT GGC GGA GGG >> "G" "G" "G" "G" >> >> Its format is the same as for GENETIC_CODE. See ?GENETIC_CODE for >> the details. >> >> I also added the 'genetic.code' arg to translate() so you can supply >> an alternate genetic code to use for translation. See ?translate for >> the details. >> >> Please let me know if you find any issues, have questions, or want >> to suggest improvements to these new features. >> >> Thanks, >> H. >> >>> >>> 2. What issues should I think about for the circular chromosomes? >>> I'm thinking of a slightly hacky solution where I ignore any >>> annotated ORFs that wrap around from the end of the chromosome to the >>> beginning, and then just treating it as a linear chromosome. >>> Actually, in my case (using sacCer3) there are no ORFs spanning the >>> break in the circular chromosome, so I don't think I'll miss any >>> annotations. Turns out the same is true for human (hg19 knownGene >>> annotations), so maybe the circular chromosome issue isn't such a big >>> issue after all? >>> >>> It seems like that should work, but any thoughts from you - you've >>> thought about these questions a lot more than I have? >>> >>> Looking forward to hearing any thoughts you have. I know sometimes >>> people just ignore the chrM SNPs, but it'd be nice to take a slightly >>> more comprehensive approach if possible. >>> >>> thanks in advance for any input you have, >>> >>> Janet >>> >>> >>> ------------------------------------------------------------------- >>> >>> Dr. Janet Young >>> >>> Malik lab >>> http://research.fhcrc.org/malik/en.html >>> >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Avenue N., A2-025, >>> P.O. Box 19024, Seattle, WA 98109-1024, USA. >>> >>> tel: (206) 667 4512 >>> email: jayoung ...at... fhcrc.org >>> >>> ------------------------------------------------------------------- >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > > -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, Thanks for both of these emails. It's very nice for me that you have implemented these changes already! It looks like I'll have to wait until tomorrow for version 1.9.34 of VariantAnnotation - I'll definitely try it then. I'm also taking a look at ensemblVEP now - thanks for the tip. I actually began this whole mission last week by trying to use the VEP script on my vcf files from the command line. The script (and the bioconductor wrapper) works fine for me on test human data. I love all the extra annotations that are available via VEP for human (SIFT, polyphen). I'd previously run into problems with my yeast data, due to the vagaries of differences in chromosome naming at Ensembl and elsewhere (my yeast SNP mappings all refer to chrI, chrII, chrM etc, rather than Ensembl's I, II and Mito). That's one reason I thought it might be easier to use VariantAnnotation. Anyway, I'll bite the bullet and will try hacking my input vcf file to rename the chromosomes as Ensembl wants them. These differences in chromosome naming will drive me batty, I think.... Janet On Feb 4, 2014, at 11:43 AM, Valerie Obenchain wrote: > Another thing I forgot to mention ... > > If you're getting into variant work you may find ensemblVEP useful (also VariantTools). The ensemblVEP package wraps the Ensembl Variant Effect Predictor and allows detailed annotation of variants. You do need to install this perl script locally: > > http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html > > I've found the installation straightforward and the VEP is a great resource. > > Valerie > > On 02/04/2014 11:38 AM, Valerie Obenchain wrote: >> Hi Janet, >> >> Last week we enabled findOverlaps(..., type='within') to work on >> circular chromosomes. It was this restriction that prevented >> locateVariants() and predictCoding() from handling ChrM. >> VariantAnnotation 1.9.34 in devel has the most recent changes. >> >> The output of locateVariants() includes ChrM because the function is >> reporting where the range falls wrt the gene (coding, utr, intron, >> etc.). predictCoding() however only reports coding variants. If the >> annotation you supply does not have any coding regions for ChrM or if >> the ranges don't fall in the coding regions then none will be reported. >> To confirm your annotation has coding regions for ChrM: >> >> cds <- cdsBy(txdb) >> cds[seqnames(cds) %in% "chrM"] ## or whatever the proper name is >> >> #1: >> I've added support for the 'genetic.code' and 'if.fuzzy.codon' args to >> translate(). To use a different genetic code just pass the named arg >> ('genetic.code') to predictCoding(). >> >> #2: >> We've tried to handle the issue of ChrM through making findOverlaps() >> behave appropriatly with circular chromosomes. The 'type' argument has >> several options (start, end, any, within, equal). This allows quite a >> bit of flexibility. When the annotation has an ORF that spans the >> start/end findOverlaps() will still behave appropriately according to >> 'type'. >> >> ## annotation with seqlength 9 >> genes <- GRanges(seqnames=rep.int("A", 4), >> IRanges(start=c(2, 4, 6, 8), width=3)) >> seqinfo(genes) <- Seqinfo(seqnames="A", seqlengths=9, isCircular=TRUE) >> >> ## both ranges span the start/end >> ranges <- GRanges(seqnames=rep.int("A", 2), >> IRanges(9, width=c(2, 4))) >> >>>> findOverlaps(ranges, genes, type="any") >>> Hits of length 3 >>> queryLength: 2 >>> subjectLength: 4 >>> queryHits subjectHits >>> <integer> <integer> >>> 1 1 4 >>> 2 2 1 >>> 3 2 4 >> >>>> findOverlaps(ranges, genes, type="within") >>> Hits of length 1 >>> queryLength: 2 >>> subjectLength: 4 >>> queryHits subjectHits >>> <integer> <integer> >>> 1 1 4 >> >> With the combination of findOverlaps() now working on circular >> chromosomes for all values of 'type' and Herve adding the new genetic >> codes to Biostrings there should be no need to ignore ChrM. If you run >> into trouble or if anything look strange please let us know. >> >> Thanks. >> Valerie >> >> >> >> On 02/04/2014 04:08 AM, Hervé Pagès wrote: >>> Hi Janet, >>> >>> On 02/03/2014 07:47 PM, Janet Young wrote: >>>> Hi there, (I think it'll probably be Valerie looking at this question >>>> - hi Valerie), >>>> >>>> I'm just beginning to look at using VariantAnnotation to annotate some >>>> SNPs I've called on some yeast data (sacCer3). I can see this will be >>>> a really useful package for me - thanks! >>>> >>>> I can see that chrM (mitochiondrial) SNPs are currently not included >>>> in the output of predictCoding, and then using locateVariants, all of >>>> chrM SNPs get annotated as intergenic/NA (with a warning, that we >>>> ignore circular chromosomes). I can understand why that is - circular >>>> chromosomes, and a different genetic code make it trickier. Fair >>>> enough. >>>> >>>> I'm wondering what the prospects are regarding chrM SNPs in the future >>>> - any plans to include those later? >>>> >>>> I'm also wondering whether I can use some hacks to get chrM SNPs >>>> annotated. Two questions/potential issues related to that I wanted to >>>> ask you guys about: >>>> >>>> 1. are alternative codon tables already supported anywhere in >>>> Bioconductor? Using "?GENETIC_CODE" it looks like this is defined >>>> in Biostrings, and it looks like only the standard nuclear code is >>>> defined. Are the various alternative genetic codes defined anywhere? >>>> For this project, I'm interested in the yeast mitochondrial code, and >>>> for another I'm interested in the fly mitochondrial code. It'd be >>>> great if we could have all the codes available (I've got another >>>> project looking at ciliate nuclear sequences, for example - not >>>> working with translations yet, but maybe later...) >>>> >>>> With a little work, I'll be able to save flat files from NCBI >>>> (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read >>>> those in and transform them to a character vector that looks like >>>> GENETIC_CODE. But I realise it might be something useful to have >>>> encoded more centrally, so thought I'd ask. >>> >>> What a timely question! I'll let Val answer the questions about >>> support of mitochiondrial DNA in predictCoding() but I can answer >>> that particular one. Last week I added a bunch of non standard genetic >>> codes to Biostrings (2.31.12). To get the genetic code for Yeast >>> Mitochondrial, do: >>> >>> > getGeneticCode("SGC2") >>> TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT >>> CTC CTA CTG >>> "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "T" >>> "T" "T" "T" >>> CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG ACT >>> ACC ACA ACG >>> "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" "T" >>> "T" "T" "T" >>> AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT >>> GAC GAA GAG >>> "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" "D" >>> "D" "E" "E" >>> GGT GGC GGA GGG >>> "G" "G" "G" "G" >>> >>> Its format is the same as for GENETIC_CODE. See ?GENETIC_CODE for >>> the details. >>> >>> I also added the 'genetic.code' arg to translate() so you can supply >>> an alternate genetic code to use for translation. See ?translate for >>> the details. >>> >>> Please let me know if you find any issues, have questions, or want >>> to suggest improvements to these new features. >>> >>> Thanks, >>> H. >>> >>>> >>>> 2. What issues should I think about for the circular chromosomes? >>>> I'm thinking of a slightly hacky solution where I ignore any >>>> annotated ORFs that wrap around from the end of the chromosome to the >>>> beginning, and then just treating it as a linear chromosome. >>>> Actually, in my case (using sacCer3) there are no ORFs spanning the >>>> break in the circular chromosome, so I don't think I'll miss any >>>> annotations. Turns out the same is true for human (hg19 knownGene >>>> annotations), so maybe the circular chromosome issue isn't such a big >>>> issue after all? >>>> >>>> It seems like that should work, but any thoughts from you - you've >>>> thought about these questions a lot more than I have? >>>> >>>> Looking forward to hearing any thoughts you have. I know sometimes >>>> people just ignore the chrM SNPs, but it'd be nice to take a slightly >>>> more comprehensive approach if possible. >>>> >>>> thanks in advance for any input you have, >>>> >>>> Janet >>>> >>>> >>>> ------------------------------------------------------------------- >>>> >>>> Dr. Janet Young >>>> >>>> Malik lab >>>> http://research.fhcrc.org/malik/en.html >>>> >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Avenue N., A2-025, >>>> P.O. Box 19024, Seattle, WA 98109-1024, USA. >>>> >>>> tel: (206) 667 4512 >>>> email: jayoung ...at... fhcrc.org >>>> >>>> ------------------------------------------------------------------- >>>> >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >> >> > > > -- > Valerie Obenchain > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B155 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: vobencha at fhcrc.org > Phone: (206) 667-3158 > Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Janet Young ▴ 740

0

Entering edit mode

Hi, On Tue, Feb 4, 2014 at 6:46 PM, Janet Young <jayoung at="" fhcrc.org=""> wrote: > Hi Valerie, > > Thanks for both of these emails. It's very nice for me that you have implemented these changes already! It looks like I'll have to wait until tomorrow for version 1.9.34 of VariantAnnotation - I'll definitely try it then. > > I'm also taking a look at ensemblVEP now - thanks for the tip. I actually began this whole mission last week by trying to use the VEP script on my vcf files from the command line. The script (and the bioconductor wrapper) works fine for me on test human data. I love all the extra annotations that are available via VEP for human (SIFT, polyphen). Although this isn't a bioconductor thing, I can't help but point out this new paper since VEP came up: A general framework for estimating the relative pathogenicity of human genetic variants http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.2892.html They also provide precomputed "CADD scores" (something like polyphen or sift) for "all 8.6 billion possible single nucleotide variants (SNVs) of the reference genome, as well as all variants from the 1000 Genome and ESP variant releases and enable scoring of short insertions/deletions." Thought it might be useful ... -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 12.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Yes, the chromosome names are a pain. A new member in our group, Sonali, is working on a package called Seqnames that will help with renaming, conversion between styles, etc.. When it's ready it will be announced on the list. Valerie On 02/04/2014 06:46 PM, Janet Young wrote: > Hi Valerie, > > Thanks for both of these emails. It's very nice for me that you have implemented these changes already! It looks like I'll have to wait until tomorrow for version 1.9.34 of VariantAnnotation - I'll definitely try it then. > > I'm also taking a look at ensemblVEP now - thanks for the tip. I actually began this whole mission last week by trying to use the VEP script on my vcf files from the command line. The script (and the bioconductor wrapper) works fine for me on test human data. I love all the extra annotations that are available via VEP for human (SIFT, polyphen). > > I'd previously run into problems with my yeast data, due to the vagaries of differences in chromosome naming at Ensembl and elsewhere (my yeast SNP mappings all refer to chrI, chrII, chrM etc, rather than Ensembl's I, II and Mito). That's one reason I thought it might be easier to use VariantAnnotation. Anyway, I'll bite the bullet and will try hacking my input vcf file to rename the chromosomes as Ensembl wants them. These differences in chromosome naming will drive me batty, I think.... > > Janet > > > > > > On Feb 4, 2014, at 11:43 AM, Valerie Obenchain wrote: > >> Another thing I forgot to mention ... >> >> If you're getting into variant work you may find ensemblVEP useful (also VariantTools). The ensemblVEP package wraps the Ensembl Variant Effect Predictor and allows detailed annotation of variants. You do need to install this perl script locally: >> >> http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html >> >> I've found the installation straightforward and the VEP is a great resource. >> >> Valerie >> >> On 02/04/2014 11:38 AM, Valerie Obenchain wrote: >>> Hi Janet, >>> >>> Last week we enabled findOverlaps(..., type='within') to work on >>> circular chromosomes. It was this restriction that prevented >>> locateVariants() and predictCoding() from handling ChrM. >>> VariantAnnotation 1.9.34 in devel has the most recent changes. >>> >>> The output of locateVariants() includes ChrM because the function is >>> reporting where the range falls wrt the gene (coding, utr, intron, >>> etc.). predictCoding() however only reports coding variants. If the >>> annotation you supply does not have any coding regions for ChrM or if >>> the ranges don't fall in the coding regions then none will be reported. >>> To confirm your annotation has coding regions for ChrM: >>> >>> cds <- cdsBy(txdb) >>> cds[seqnames(cds) %in% "chrM"] ## or whatever the proper name is >>> >>> #1: >>> I've added support for the 'genetic.code' and 'if.fuzzy.codon' args to >>> translate(). To use a different genetic code just pass the named arg >>> ('genetic.code') to predictCoding(). >>> >>> #2: >>> We've tried to handle the issue of ChrM through making findOverlaps() >>> behave appropriatly with circular chromosomes. The 'type' argument has >>> several options (start, end, any, within, equal). This allows quite a >>> bit of flexibility. When the annotation has an ORF that spans the >>> start/end findOverlaps() will still behave appropriately according to >>> 'type'. >>> >>> ## annotation with seqlength 9 >>> genes <- GRanges(seqnames=rep.int("A", 4), >>> IRanges(start=c(2, 4, 6, 8), width=3)) >>> seqinfo(genes) <- Seqinfo(seqnames="A", seqlengths=9, isCircular=TRUE) >>> >>> ## both ranges span the start/end >>> ranges <- GRanges(seqnames=rep.int("A", 2), >>> IRanges(9, width=c(2, 4))) >>> >>>>> findOverlaps(ranges, genes, type="any") >>>> Hits of length 3 >>>> queryLength: 2 >>>> subjectLength: 4 >>>> queryHits subjectHits >>>> <integer> <integer> >>>> 1 1 4 >>>> 2 2 1 >>>> 3 2 4 >>> >>>>> findOverlaps(ranges, genes, type="within") >>>> Hits of length 1 >>>> queryLength: 2 >>>> subjectLength: 4 >>>> queryHits subjectHits >>>> <integer> <integer> >>>> 1 1 4 >>> >>> With the combination of findOverlaps() now working on circular >>> chromosomes for all values of 'type' and Herve adding the new genetic >>> codes to Biostrings there should be no need to ignore ChrM. If you run >>> into trouble or if anything look strange please let us know. >>> >>> Thanks. >>> Valerie >>> >>> >>> >>> On 02/04/2014 04:08 AM, Hervé Pagès wrote: >>>> Hi Janet, >>>> >>>> On 02/03/2014 07:47 PM, Janet Young wrote: >>>>> Hi there, (I think it'll probably be Valerie looking at this question >>>>> - hi Valerie), >>>>> >>>>> I'm just beginning to look at using VariantAnnotation to annotate some >>>>> SNPs I've called on some yeast data (sacCer3). I can see this will be >>>>> a really useful package for me - thanks! >>>>> >>>>> I can see that chrM (mitochiondrial) SNPs are currently not included >>>>> in the output of predictCoding, and then using locateVariants, all of >>>>> chrM SNPs get annotated as intergenic/NA (with a warning, that we >>>>> ignore circular chromosomes). I can understand why that is - circular >>>>> chromosomes, and a different genetic code make it trickier. Fair >>>>> enough. >>>>> >>>>> I'm wondering what the prospects are regarding chrM SNPs in the future >>>>> - any plans to include those later? >>>>> >>>>> I'm also wondering whether I can use some hacks to get chrM SNPs >>>>> annotated. Two questions/potential issues related to that I wanted to >>>>> ask you guys about: >>>>> >>>>> 1. are alternative codon tables already supported anywhere in >>>>> Bioconductor? Using "?GENETIC_CODE" it looks like this is defined >>>>> in Biostrings, and it looks like only the standard nuclear code is >>>>> defined. Are the various alternative genetic codes defined anywhere? >>>>> For this project, I'm interested in the yeast mitochondrial code, and >>>>> for another I'm interested in the fly mitochondrial code. It'd be >>>>> great if we could have all the codes available (I've got another >>>>> project looking at ciliate nuclear sequences, for example - not >>>>> working with translations yet, but maybe later...) >>>>> >>>>> With a little work, I'll be able to save flat files from NCBI >>>>> (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read >>>>> those in and transform them to a character vector that looks like >>>>> GENETIC_CODE. But I realise it might be something useful to have >>>>> encoded more centrally, so thought I'd ask. >>>> >>>> What a timely question! I'll let Val answer the questions about >>>> support of mitochiondrial DNA in predictCoding() but I can answer >>>> that particular one. Last week I added a bunch of non standard genetic >>>> codes to Biostrings (2.31.12). To get the genetic code for Yeast >>>> Mitochondrial, do: >>>> >>>> > getGeneticCode("SGC2") >>>> TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT >>>> CTC CTA CTG >>>> "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "T" >>>> "T" "T" "T" >>>> CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG ACT >>>> ACC ACA ACG >>>> "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" "T" >>>> "T" "T" "T" >>>> AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT >>>> GAC GAA GAG >>>> "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" "D" >>>> "D" "E" "E" >>>> GGT GGC GGA GGG >>>> "G" "G" "G" "G" >>>> >>>> Its format is the same as for GENETIC_CODE. See ?GENETIC_CODE for >>>> the details. >>>> >>>> I also added the 'genetic.code' arg to translate() so you can supply >>>> an alternate genetic code to use for translation. See ?translate for >>>> the details. >>>> >>>> Please let me know if you find any issues, have questions, or want >>>> to suggest improvements to these new features. >>>> >>>> Thanks, >>>> H. >>>> >>>>> >>>>> 2. What issues should I think about for the circular chromosomes? >>>>> I'm thinking of a slightly hacky solution where I ignore any >>>>> annotated ORFs that wrap around from the end of the chromosome to the >>>>> beginning, and then just treating it as a linear chromosome. >>>>> Actually, in my case (using sacCer3) there are no ORFs spanning the >>>>> break in the circular chromosome, so I don't think I'll miss any >>>>> annotations. Turns out the same is true for human (hg19 knownGene >>>>> annotations), so maybe the circular chromosome issue isn't such a big >>>>> issue after all? >>>>> >>>>> It seems like that should work, but any thoughts from you - you've >>>>> thought about these questions a lot more than I have? >>>>> >>>>> Looking forward to hearing any thoughts you have. I know sometimes >>>>> people just ignore the chrM SNPs, but it'd be nice to take a slightly >>>>> more comprehensive approach if possible. >>>>> >>>>> thanks in advance for any input you have, >>>>> >>>>> Janet >>>>> >>>>> >>>>> ------------------------------------------------------------------- >>>>> >>>>> Dr. Janet Young >>>>> >>>>> Malik lab >>>>> http://research.fhcrc.org/malik/en.html >>>>> >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Avenue N., A2-025, >>>>> P.O. Box 19024, Seattle, WA 98109-1024, USA. >>>>> >>>>> tel: (206) 667 4512 >>>>> email: jayoung ...at... fhcrc.org >>>>> >>>>> ------------------------------------------------------------------- >>>>> >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>> >>> >> >> >> -- >> Valerie Obenchain >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B155 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: vobencha at fhcrc.org >> Phone: (206) 667-3158 >> Fax: (206) 667-1319 > -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Valerie, I am excited about these changes and can't wait to use them. For clarification, to use genetic.code, I would pass in an argument like this: predictCoding(..., genetic.code=mycode) where 'mycode' is a named vector of same format as GENETIC_CODE used in translate(). I haven't been using devel in while. Any idea when this might make its way into release? Thanks, Sean > -----Original Message----- > From: Valerie Obenchain [mailto:vobencha at fhcrc.org] > Sent: Tuesday, February 04, 2014 11:38 AM > To: Pages, Herve; Young, Janet; bioconductor at r-project.org; Taylor, Sean D > Subject: Re: [BioC] variantAnnotation: alternative GENETIC_CODE, and > circular chromosomes? > > Hi Janet, > > Last week we enabled findOverlaps(..., type='within') to work on circular > chromosomes. It was this restriction that prevented > locateVariants() and predictCoding() from handling ChrM. > VariantAnnotation 1.9.34 in devel has the most recent changes. > > The output of locateVariants() includes ChrM because the function is > reporting where the range falls wrt the gene (coding, utr, intron, etc.). > predictCoding() however only reports coding variants. If the annotation you > supply does not have any coding regions for ChrM or if the ranges don't fall in > the coding regions then none will be reported. > To confirm your annotation has coding regions for ChrM: > > cds <- cdsBy(txdb) > cds[seqnames(cds) %in% "chrM"] ## or whatever the proper name is > > #1: > I've added support for the 'genetic.code' and 'if.fuzzy.codon' args to > translate(). To use a different genetic code just pass the named arg > ('genetic.code') to predictCoding(). > > #2: > We've tried to handle the issue of ChrM through making findOverlaps() > behave appropriatly with circular chromosomes. The 'type' argument has > several options (start, end, any, within, equal). This allows quite a bit of > flexibility. When the annotation has an ORF that spans the start/end > findOverlaps() will still behave appropriately according to 'type'. > > ## annotation with seqlength 9 > genes <- GRanges(seqnames=rep.int("A", 4), > IRanges(start=c(2, 4, 6, 8), width=3)) > seqinfo(genes) <- Seqinfo(seqnames="A", seqlengths=9, isCircular=TRUE) > > ## both ranges span the start/end > ranges <- GRanges(seqnames=rep.int("A", 2), > IRanges(9, width=c(2, 4))) > > >> findOverlaps(ranges, genes, type="any") > > Hits of length 3 > > queryLength: 2 > > subjectLength: 4 > > queryHits subjectHits > > <integer> <integer> > > 1 1 4 > > 2 2 1 > > 3 2 4 > > >> findOverlaps(ranges, genes, type="within") > > Hits of length 1 > > queryLength: 2 > > subjectLength: 4 > > queryHits subjectHits > > <integer> <integer> > > 1 1 4 > > With the combination of findOverlaps() now working on circular > chromosomes for all values of 'type' and Herve adding the new genetic codes > to Biostrings there should be no need to ignore ChrM. If you run into trouble > or if anything look strange please let us know. > > Thanks. > Valerie > > > > On 02/04/2014 04:08 AM, Hervé Pagès wrote: > > Hi Janet, > > > > On 02/03/2014 07:47 PM, Janet Young wrote: > >> Hi there, (I think it'll probably be Valerie looking at this > >> question > >> - hi Valerie), > >> > >> I'm just beginning to look at using VariantAnnotation to annotate > >> some SNPs I've called on some yeast data (sacCer3). I can see this > >> will be a really useful package for me - thanks! > >> > >> I can see that chrM (mitochiondrial) SNPs are currently not included > >> in the output of predictCoding, and then using locateVariants, all of > >> chrM SNPs get annotated as intergenic/NA (with a warning, that we > >> ignore circular chromosomes). I can understand why that is - > >> circular chromosomes, and a different genetic code make it trickier. Fair > enough. > >> > >> I'm wondering what the prospects are regarding chrM SNPs in the > >> future > >> - any plans to include those later? > >> > >> I'm also wondering whether I can use some hacks to get chrM SNPs > >> annotated. Two questions/potential issues related to that I wanted to > >> ask you guys about: > >> > >> 1. are alternative codon tables already supported anywhere in > >> Bioconductor? Using "?GENETIC_CODE" it looks like this is defined > >> in Biostrings, and it looks like only the standard nuclear code is > >> defined. Are the various alternative genetic codes defined anywhere? > >> For this project, I'm interested in the yeast mitochondrial code, and > >> for another I'm interested in the fly mitochondrial code. It'd be > >> great if we could have all the codes available (I've got another > >> project looking at ciliate nuclear sequences, for example - not > >> working with translations yet, but maybe later...) > >> > >> With a little work, I'll be able to save flat files from NCBI > >> (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read > >> those in and transform them to a character vector that looks like > >> GENETIC_CODE. But I realise it might be something useful to have > >> encoded more centrally, so thought I'd ask. > > > > What a timely question! I'll let Val answer the questions about > > support of mitochiondrial DNA in predictCoding() but I can answer that > > particular one. Last week I added a bunch of non standard genetic > > codes to Biostrings (2.31.12). To get the genetic code for Yeast > > Mitochondrial, do: > > > > > getGeneticCode("SGC2") > > TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG > CTT > > CTC CTA CTG > > "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "T" > > "T" "T" "T" > > CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG > ACT > > ACC ACA ACG > > "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" "T" > > "T" "T" "T" > > AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA > GCG GAT > > GAC GAA GAG > > "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" "D" > > "D" "E" "E" > > GGT GGC GGA GGG > > "G" "G" "G" "G" > > > > Its format is the same as for GENETIC_CODE. See ?GENETIC_CODE for the > > details. > > > > I also added the 'genetic.code' arg to translate() so you can supply > > an alternate genetic code to use for translation. See ?translate for > > the details. > > > > Please let me know if you find any issues, have questions, or want to > > suggest improvements to these new features. > > > > Thanks, > > H. > > > >> > >> 2. What issues should I think about for the circular chromosomes? > >> I'm thinking of a slightly hacky solution where I ignore any > >> annotated ORFs that wrap around from the end of the chromosome to > the > >> beginning, and then just treating it as a linear chromosome. > >> Actually, in my case (using sacCer3) there are no ORFs spanning the > >> break in the circular chromosome, so I don't think I'll miss any > >> annotations. Turns out the same is true for human (hg19 knownGene > >> annotations), so maybe the circular chromosome issue isn't such a big > >> issue after all? > >> > >> It seems like that should work, but any thoughts from you - you've > >> thought about these questions a lot more than I have? > >> > >> Looking forward to hearing any thoughts you have. I know sometimes > >> people just ignore the chrM SNPs, but it'd be nice to take a slightly > >> more comprehensive approach if possible. > >> > >> thanks in advance for any input you have, > >> > >> Janet > >> > >> > >> ------------------------------------------------------------------- > >> > >> Dr. Janet Young > >> > >> Malik lab > >> http://research.fhcrc.org/malik/en.html > >> > >> Fred Hutchinson Cancer Research Center > >> 1100 Fairview Avenue N., A2-025, > >> P.O. Box 19024, Seattle, WA 98109-1024, USA. > >> > >> tel: (206) 667 4512 > >> email: jayoung ...at... fhcrc.org > >> > >> ------------------------------------------------------------------- > >> > >> > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > > -- > Valerie Obenchain > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B155 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: vobencha at fhcrc.org > Phone: (206) 667-3158 > Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Taylor, Sean D ▴ 250

0

Entering edit mode

Hi, On 02/08/2014 04:24 PM, Taylor, Sean D wrote: > Valerie, > > I am excited about these changes and can't wait to use them. > > For clarification, to use genetic.code, I would pass in an argument like this: > > predictCoding(..., genetic.code=mycode) > > where 'mycode' is a named vector of same format as GENETIC_CODE used in translate(). Yes, that's correct. You can supply your own or specify one of the new codes Herve has added to the GENETIC_CODE_TABLE object (also in devel). The vertabrate mitochondrial (#2) might be what you were looking for. >> GENETIC_CODE_TABLE$name > [1] "Standard" > [2] "Vertebrate Mitochondrial" > [3] "Yeast Mitochondrial" > [4] "Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial; Mycoplasma; Spiroplasma" > [5] "Invertebrate Mitochondrial" > [6] "Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear" > [7] "Echinoderm Mitochondrial; Flatworm Mitochondrial" > [8] "Euplotid Nuclear" > [9] "Bacterial, Archaeal and Plant Plastid" > [10] "Alternative Yeast Nuclear" > [11] "Ascidian Mitochondrial" > [12] "Alternative Flatworm Mitochondrial" > [13] "Blepharisma Macronuclear" > [14] "Chlorophycean Mitochondrial" > [15] "Trematode Mitochondrial" > [16] "Scenedesmus obliquus Mitochondrial" > [17] "Thraustochytrium Mitochondrial" > [18] "Pterobranchia Mitochondrial" > > I haven't been using devel in while. Any idea when this might make its way into release? In general we only push changes to release branch if it's a bug fix. Since this is a new feature / enhancement it will only be available in the devel branch until the next formal Bioconductor release in March. If you're interested in using devel there are instructions for installing from svn in section 1.2.1 of the 'R Installation and Administration Manual'. We can also help with any questions you might have. http://www.r-project.org/ Valerie > > Thanks, > Sean > >> -----Original Message----- >> From: Valerie Obenchain [mailto:vobencha at fhcrc.org] >> Sent: Tuesday, February 04, 2014 11:38 AM >> To: Pages, Herve; Young, Janet; bioconductor at r-project.org; Taylor, Sean D >> Subject: Re: [BioC] variantAnnotation: alternative GENETIC_CODE, and >> circular chromosomes? >> >> Hi Janet, >> >> Last week we enabled findOverlaps(..., type='within') to work on circular >> chromosomes. It was this restriction that prevented >> locateVariants() and predictCoding() from handling ChrM. >> VariantAnnotation 1.9.34 in devel has the most recent changes. >> >> The output of locateVariants() includes ChrM because the function is >> reporting where the range falls wrt the gene (coding, utr, intron, etc.). >> predictCoding() however only reports coding variants. If the annotation you >> supply does not have any coding regions for ChrM or if the ranges don't fall in >> the coding regions then none will be reported. >> To confirm your annotation has coding regions for ChrM: >> >> cds <- cdsBy(txdb) >> cds[seqnames(cds) %in% "chrM"] ## or whatever the proper name is >> >> #1: >> I've added support for the 'genetic.code' and 'if.fuzzy.codon' args to >> translate(). To use a different genetic code just pass the named arg >> ('genetic.code') to predictCoding(). >> >> #2: >> We've tried to handle the issue of ChrM through making findOverlaps() >> behave appropriatly with circular chromosomes. The 'type' argument has >> several options (start, end, any, within, equal). This allows quite a bit of >> flexibility. When the annotation has an ORF that spans the start/end >> findOverlaps() will still behave appropriately according to 'type'. >> >> ## annotation with seqlength 9 >> genes <- GRanges(seqnames=rep.int("A", 4), >> IRanges(start=c(2, 4, 6, 8), width=3)) >> seqinfo(genes) <- Seqinfo(seqnames="A", seqlengths=9, isCircular=TRUE) >> >> ## both ranges span the start/end >> ranges <- GRanges(seqnames=rep.int("A", 2), >> IRanges(9, width=c(2, 4))) >> >>>> findOverlaps(ranges, genes, type="any") >>> Hits of length 3 >>> queryLength: 2 >>> subjectLength: 4 >>> queryHits subjectHits >>> <integer> <integer> >>> 1 1 4 >>> 2 2 1 >>> 3 2 4 >> >>>> findOverlaps(ranges, genes, type="within") >>> Hits of length 1 >>> queryLength: 2 >>> subjectLength: 4 >>> queryHits subjectHits >>> <integer> <integer> >>> 1 1 4 >> >> With the combination of findOverlaps() now working on circular >> chromosomes for all values of 'type' and Herve adding the new genetic codes >> to Biostrings there should be no need to ignore ChrM. If you run into trouble >> or if anything look strange please let us know. >> >> Thanks. >> Valerie >> >> >> >> On 02/04/2014 04:08 AM, Hervé Pagès wrote: >>> Hi Janet, >>> >>> On 02/03/2014 07:47 PM, Janet Young wrote: >>>> Hi there, (I think it'll probably be Valerie looking at this >>>> question >>>> - hi Valerie), >>>> >>>> I'm just beginning to look at using VariantAnnotation to annotate >>>> some SNPs I've called on some yeast data (sacCer3). I can see this >>>> will be a really useful package for me - thanks! >>>> >>>> I can see that chrM (mitochiondrial) SNPs are currently not included >>>> in the output of predictCoding, and then using locateVariants, all of >>>> chrM SNPs get annotated as intergenic/NA (with a warning, that we >>>> ignore circular chromosomes). I can understand why that is - >>>> circular chromosomes, and a different genetic code make it trickier. Fair >> enough. >>>> >>>> I'm wondering what the prospects are regarding chrM SNPs in the >>>> future >>>> - any plans to include those later? >>>> >>>> I'm also wondering whether I can use some hacks to get chrM SNPs >>>> annotated. Two questions/potential issues related to that I wanted to >>>> ask you guys about: >>>> >>>> 1. are alternative codon tables already supported anywhere in >>>> Bioconductor? Using "?GENETIC_CODE" it looks like this is defined >>>> in Biostrings, and it looks like only the standard nuclear code is >>>> defined. Are the various alternative genetic codes defined anywhere? >>>> For this project, I'm interested in the yeast mitochondrial code, and >>>> for another I'm interested in the fly mitochondrial code. It'd be >>>> great if we could have all the codes available (I've got another >>>> project looking at ciliate nuclear sequences, for example - not >>>> working with translations yet, but maybe later...) >>>> >>>> With a little work, I'll be able to save flat files from NCBI >>>> (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read >>>> those in and transform them to a character vector that looks like >>>> GENETIC_CODE. But I realise it might be something useful to have >>>> encoded more centrally, so thought I'd ask. >>> >>> What a timely question! I'll let Val answer the questions about >>> support of mitochiondrial DNA in predictCoding() but I can answer that >>> particular one. Last week I added a bunch of non standard genetic >>> codes to Biostrings (2.31.12). To get the genetic code for Yeast >>> Mitochondrial, do: >>> >>> > getGeneticCode("SGC2") >>> TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG >> CTT >>> CTC CTA CTG >>> "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "T" >>> "T" "T" "T" >>> CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG >> ACT >>> ACC ACA ACG >>> "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" "T" >>> "T" "T" "T" >>> AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA >> GCG GAT >>> GAC GAA GAG >>> "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" "D" >>> "D" "E" "E" >>> GGT GGC GGA GGG >>> "G" "G" "G" "G" >>> >>> Its format is the same as for GENETIC_CODE. See ?GENETIC_CODE for the >>> details. >>> >>> I also added the 'genetic.code' arg to translate() so you can supply >>> an alternate genetic code to use for translation. See ?translate for >>> the details. >>> >>> Please let me know if you find any issues, have questions, or want to >>> suggest improvements to these new features. >>> >>> Thanks, >>> H. >>> >>>> >>>> 2. What issues should I think about for the circular chromosomes? >>>> I'm thinking of a slightly hacky solution where I ignore any >>>> annotated ORFs that wrap around from the end of the chromosome to >> the >>>> beginning, and then just treating it as a linear chromosome. >>>> Actually, in my case (using sacCer3) there are no ORFs spanning the >>>> break in the circular chromosome, so I don't think I'll miss any >>>> annotations. Turns out the same is true for human (hg19 knownGene >>>> annotations), so maybe the circular chromosome issue isn't such a big >>>> issue after all? >>>> >>>> It seems like that should work, but any thoughts from you - you've >>>> thought about these questions a lot more than I have? >>>> >>>> Looking forward to hearing any thoughts you have. I know sometimes >>>> people just ignore the chrM SNPs, but it'd be nice to take a slightly >>>> more comprehensive approach if possible. >>>> >>>> thanks in advance for any input you have, >>>> >>>> Janet >>>> >>>> >>>> ------------------------------------------------------------------- >>>> >>>> Dr. Janet Young >>>> >>>> Malik lab >>>> http://research.fhcrc.org/malik/en.html >>>> >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Avenue N., A2-025, >>>> P.O. Box 19024, Seattle, WA 98109-1024, USA. >>>> >>>> tel: (206) 667 4512 >>>> email: jayoung ...at... fhcrc.org >>>> >>>> ------------------------------------------------------------------- >>>> >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >> >> >> -- >> Valerie Obenchain >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B155 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: vobencha at fhcrc.org >> Phone: (206) 667-3158 >> Fax: (206) 667-1319 -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Herve, That's great! That'll be useful. Funny timing. Given that predictCoding is using the translate function, hopefully it will be quite easy for you guys to add the genetic.code arg to that translate instance somehow. I guess many SNP collections (CollapsedVCF or whatever) will contain BOTH autosomal and mitochondrial SNPs, so would use a mix of genetic codes. I'm sure in the long run you'd aim to have the function deal with that mix. But perhaps for now I will be able do something like this (pseudocode) 1. split my SNPs into autosomals and mitcohondrial SNPs 2. predictions1 <- predictCoding ( autosomalSNPs , txDB, seqSource=Scerevisiae) 3. GENETIC_CODE <- getGeneticCode("SGC2") 4. fake predictCoding into ignoring the fact that chrM is circular 5. predictions2 <- predictCoding ( mitochondrialSNPs , txDB, seqSource=Scerevisiae) 6. GENETIC_CODE <- getGeneticCode("SGC0") 7. combine predictions1 and predictions2 Does that make sense? Looking forward to Val's input on the circular chromosome issue. I'm not 100% sure how to do step 4 above right now. I thought this might work: isCircular(Scerevisiae) <- rep(FALSE, length( isCircular(Scerevisiae) ) ) but I get an error. Error in `seqinfo<-`(`*tmp*`, value = <s4 object="" of="" class="" "seqinfo"="">) : 'new2old' must be specified when replacing the 'seqinfo' of a BSgenome object I'll think on it some more, too. thanks On Feb 4, 2014, at 4:08 AM, Hervé Pagès wrote: > Hi Janet, > > On 02/03/2014 07:47 PM, Janet Young wrote: >> Hi there, (I think it'll probably be Valerie looking at this question - hi Valerie), >> >> I'm just beginning to look at using VariantAnnotation to annotate some SNPs I've called on some yeast data (sacCer3). I can see this will be a really useful package for me - thanks! >> >> I can see that chrM (mitochiondrial) SNPs are currently not included in the output of predictCoding, and then using locateVariants, all of chrM SNPs get annotated as intergenic/NA (with a warning, that we ignore circular chromosomes). I can understand why that is - circular chromosomes, and a different genetic code make it trickier. Fair enough. >> >> I'm wondering what the prospects are regarding chrM SNPs in the future - any plans to include those later? >> >> I'm also wondering whether I can use some hacks to get chrM SNPs annotated. Two questions/potential issues related to that I wanted to ask you guys about: >> >> 1. are alternative codon tables already supported anywhere in Bioconductor? Using "?GENETIC_CODE" it looks like this is defined in Biostrings, and it looks like only the standard nuclear code is defined. Are the various alternative genetic codes defined anywhere? For this project, I'm interested in the yeast mitochondrial code, and for another I'm interested in the fly mitochondrial code. It'd be great if we could have all the codes available (I've got another project looking at ciliate nuclear sequences, for example - not working with translations yet, but maybe later...) >> >> With a little work, I'll be able to save flat files from NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi), and read those in and transform them to a character vector that looks like GENETIC_CODE. But I realise it might be something useful to have encoded more centrally, so thought I'd ask. > > What a timely question! I'll let Val answer the questions about > support of mitochiondrial DNA in predictCoding() but I can answer > that particular one. Last week I added a bunch of non standard genetic > codes to Biostrings (2.31.12). To get the genetic code for Yeast > Mitochondrial, do: > > > getGeneticCode("SGC2") > TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG > "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "T" "T" "T" "T" > CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG ACT ACC ACA ACG > "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" "T" "T" "T" "T" > AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT GAC GAA GAG > "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" "D" "D" "E" "E" > GGT GGC GGA GGG > "G" "G" "G" "G" > > Its format is the same as for GENETIC_CODE. See ?GENETIC_CODE for > the details. > > I also added the 'genetic.code' arg to translate() so you can supply > an alternate genetic code to use for translation. See ?translate for > the details. > > Please let me know if you find any issues, have questions, or want > to suggest improvements to these new features. > > Thanks, > H. > >> >> 2. What issues should I think about for the circular chromosomes? I'm thinking of a slightly hacky solution where I ignore any annotated ORFs that wrap around from the end of the chromosome to the beginning, and then just treating it as a linear chromosome. Actually, in my case (using sacCer3) there are no ORFs spanning the break in the circular chromosome, so I don't think I'll miss any annotations. Turns out the same is true for human (hg19 knownGene annotations), so maybe the circular chromosome issue isn't such a big issue after all? >> >> It seems like that should work, but any thoughts from you - you've thought about these questions a lot more than I have? >> >> Looking forward to hearing any thoughts you have. I know sometimes people just ignore the chrM SNPs, but it'd be nice to take a slightly more comprehensive approach if possible. >> >> thanks in advance for any input you have, >> >> Janet >> >> >> ------------------------------------------------------------------- >> >> Dr. Janet Young >> >> Malik lab >> http://research.fhcrc.org/malik/en.html >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Avenue N., A2-025, >> P.O. Box 19024, Seattle, WA 98109-1024, USA. >> >> tel: (206) 667 4512 >> email: jayoung ...at... fhcrc.org >> >> ------------------------------------------------------------------- >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Janet Young ▴ 740

Login before adding your answer.