problems with strand in predictCoding

0

Entering edit mode

Jeremiah Degenhardt ▴ 50

@jeremiah-degenhardt-4417

Last seen 9.6 years ago

Hello BioC, I am using the predictCoding function in the VariantAnnotation package and have run into a few issues. The first issue that I found is a small one. While I can supply any txdb object that I want to the predictCoding, the function seems to still be looking for?TxDb.Hsapiens.UCSC.hg19.knownGene. Running predict coding without this package installed results in the following error: Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : ? error in evaluating the argument 'x' in selecting a method for function 'isCircular': Error: object 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found The second set of issues are a little more subtle. It seems that predictCoding is not properly handling the strand of the variants, at least in my opinion. If I provide an unstranded GRanges with variants that overlap a gene on the negative strand, the predictCoding function will not reverse complement the varAllele resulting in an incorrect functional consequence prediction. Here is some code to generate an example library(VariantAnnotation) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), variant = DNAStringSet("T")) predictCoding(GR, txdb, Hsapiens, values(GR)$variant) GRanges with 4 ranges and 13 elementMetadata cols: seqnames ranges strand | variant varAllele <rle> <iranges> <rle> | <dnastringset> <dnastringset> [1] chr17 [63554271, 63554271] * | T T [2] chr17 [63554271, 63554271] * | T T [3] chr17 [63554271, 63554271] * | T T [4] chr17 [63554271, 63554271] * | T T cdsLoc proteinLoc queryID txID cdsID <iranges> <compressedintegerlist> <integer> <character> <integer> [1] [468, 468] 156 1 65536 193561 [2] [468, 468] 156 1 65537 193561 [3] [468, 468] 156 1 65538 193561 [4] [468, 468] 156 1 65539 193564 geneID consequence refCodon varCodon refAA <character> <factor> <dnastringset> <dnastringset> <aastringset> [1] 8313 synonymous TAC TAT Y [2] 8313 synonymous TAC TAT Y [3] 8313 synonymous TAC TAT Y [4] 8313 synonymous TAC TAT Y varAA <aastringset> [1] Y [2] Y [3] Y [4] Y --- seqlengths: chr17 NA Running this the function predicts this to be TAC>?TAT, a synonymous change. However, as the gene is on the negative strand the actual result of this variant should be TAC>?TAA or a stop mutation. If I instead give predictCoding a stranded GRanges to let the function know that the base reported is with respect to the positive strand I get this result: > GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), strand="+", variant = DNAStringSet("T")) > predictCoding(GR, txdb, Hsapiens, values(GR)$variant) GRanges with 0 ranges and 0 elementMetadata cols: seqnames ranges strand <rle> <iranges> <rle> --- seqlengths: Now instead of reverse complementing the varAllele, and giving me the correct annotation, the function simply misses that the mutation overlaps with any gene and returns an empty GRanges. This issue is not really a bug in predictCoding, but results from what is (again in my opinion) an unfortunate treatment of the negative and positive strand as separate entities in GRanges. From a biological prospective this almost never makes sense and is not how I would expect the functions in GRanges to behave. Just because a gene is annotated on the negative strand does not mean that I don't want to know it overlaps with a gene on the positive strand. Maybe there are cases where this is the desired behavior, so it might make sense to add this a switch you could turn on or off but in most cases this will simply lead to erroneous results such as the one above. I think the most useful information that comes from knowing the strand is knowing when you need to reverse complement a sequence for the underlying position of interest. A third possible option here would be to call the function twice, once with for the positive strand and once for the negative strand with the variants reverse complemented and then merge the results. This however, seems quite inefficient and not like something that I should have to do. >From my perspective the proper behavior of predictCoding with respect to strand would be to treat unstranded GRanges as positive strand as this is the reference strand for things like the genome builds and vcf files. Then the variants should overlap positive and negative strand genes and should be reverse complemented for consequence prediction on the negative strand genes. It should also allow overlap of negative and positive stranded variants with both negative and positive stranded genes, but properly reverse complement the variant in each case to get the proper consequence. thanks in advance, Jeremiah here is my session info incase it is needed: > sessionInfo() R Under development (unstable) (2012-04-11 r58985) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 [2] GenomicFeatures_1.8.1 [3] AnnotationDbi_1.18.0 [4] Biobase_2.16.0 [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17 [6] BSgenome_1.24.0 [7] VariantAnnotation_1.2.5 [8] Rsamtools_1.8.1 [9] Biostrings_2.24.1 [10] GenomicRanges_1.8.3 [11] IRanges_1.14.2 [12] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] biomaRt_2.12.0 bitops_1.0-4.1 compiler_2.16.0 DBI_0.2-5 [5] grid_2.16.0 lattice_0.20-6 Matrix_1.0-6 RCurl_1.91-1 [9] RSQLite_0.11.1 rtracklayer_1.16.1 snpStats_1.6.0 splines_2.16.0 [13] stats4_2.16.0 survival_2.36-12 tools_2.16.0 XML_3.9-4 [17] zlibbioc_1.2.0 -- Jeremiah Degenhardt, Ph.D. Computational Biologist Bioinformatics and Computational Biology Genentech, Inc. degenhardt.jeremiah at gene.com

VariantAnnotation Annotation BSgenome BSgenome VariantAnnotation VariantAnnotation BSgenome • 2.6k views

ADD COMMENT • link updated 12.0 years ago by bnorthoff@web.de ▴ 20 • written 12.0 years ago by Jeremiah Degenhardt ▴ 50

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Thu, Apr 19, 2012 at 9:46 PM, Jeremiah Degenhardt <degenhardt.jeremiah at="" gene.com=""> wrote: > Hello BioC, > > I am using the predictCoding function in the VariantAnnotation package > and have run into a few issues. > > The first issue that I found is a small one. While I can supply any > txdb object that I want to the predictCoding, the function seems to > still be looking for?TxDb.Hsapiens.UCSC.hg19.knownGene. Running > predict coding without this package installed results in the following > error: > > Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : > ? error in evaluating the argument 'x' in selecting a method for > function 'isCircular': Error: object > 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found > > The second set of issues are a little more subtle. It seems that > predictCoding is not properly handling the strand of the variants, at > least in my opinion. > > If I provide an unstranded GRanges with variants that overlap a gene > on the negative strand, the predictCoding function will not reverse > complement the varAllele resulting in an incorrect functional > consequence prediction. Here is some code to generate an example > > library(VariantAnnotation) > library(BSgenome.Hsapiens.UCSC.hg19) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), > variant = DNAStringSet("T")) > predictCoding(GR, txdb, Hsapiens, values(GR)$variant) > > GRanges with 4 ranges and 13 elementMetadata cols: > ? ? ?seqnames ? ? ? ? ? ? ? ranges strand | ? ? ? ?variant > varAllele > ? ? ? ? <rle> ? ? ? ? ? ?<iranges> ?<rle> | <dnastringset> > <dnastringset> > ?[1] ? ?chr17 [63554271, 63554271] ? ? ?* | ? ? ? ? ? ? ?T > ?T > ?[2] ? ?chr17 [63554271, 63554271] ? ? ?* | ? ? ? ? ? ? ?T > ?T > ?[3] ? ?chr17 [63554271, 63554271] ? ? ?* | ? ? ? ? ? ? ?T > ?T > ?[4] ? ?chr17 [63554271, 63554271] ? ? ?* | ? ? ? ? ? ? ?T > ?T > ? ? ? ? ?cdsLoc ? ? ? ? ? ? ?proteinLoc ? queryID ? ? ? ?txID > cdsID > ? ? ? <iranges> <compressedintegerlist> <integer> <character> > <integer> > ?[1] [468, 468] ? ? ? ? ? ? ? ? ? ? 156 ? ? ? ? 1 ? ? ? 65536 > 193561 > ?[2] [468, 468] ? ? ? ? ? ? ? ? ? ? 156 ? ? ? ? 1 ? ? ? 65537 > 193561 > ?[3] [468, 468] ? ? ? ? ? ? ? ? ? ? 156 ? ? ? ? 1 ? ? ? 65538 > 193561 > ?[4] [468, 468] ? ? ? ? ? ? ? ? ? ? 156 ? ? ? ? 1 ? ? ? 65539 > 193564 > ? ? ? ? ? geneID consequence ? ? ? refCodon ? ? ? varCodon > refAA > ? ? ?<character> ? ?<factor> <dnastringset> <dnastringset> > <aastringset> > ?[1] ? ? ? ?8313 ?synonymous ? ? ? ? ? ?TAC ? ? ? ? ? ?TAT > ?Y > ?[2] ? ? ? ?8313 ?synonymous ? ? ? ? ? ?TAC ? ? ? ? ? ?TAT > ?Y > ?[3] ? ? ? ?8313 ?synonymous ? ? ? ? ? ?TAC ? ? ? ? ? ?TAT > ?Y > ?[4] ? ? ? ?8313 ?synonymous ? ? ? ? ? ?TAC ? ? ? ? ? ?TAT > ?Y > ? ? ? ? ? ? ?varAA > ? ? ?<aastringset> > ?[1] ? ? ? ? ? ? Y > ?[2] ? ? ? ? ? ? Y > ?[3] ? ? ? ? ? ? Y > ?[4] ? ? ? ? ? ? Y > ?--- > ?seqlengths: > ? chr17 > ? ? ?NA > > Running this the function predicts this to be TAC>?TAT, a synonymous > change. However, as the gene is on the negative strand the actual > result of this variant should be TAC>?TAA or a stop mutation. > > If I instead give predictCoding a stranded GRanges to let the function > know that the base reported is with respect to the positive strand I > get this result: > >> GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), strand="+", variant = DNAStringSet("T")) >> predictCoding(GR, txdb, Hsapiens, values(GR)$variant) > GRanges with 0 ranges and 0 elementMetadata cols: > > ? seqnames ? ?ranges strand > ? ? ?<rle> <iranges> ?<rle> > ?--- > ?seqlengths: > > > Now instead of reverse complementing the varAllele, and giving me the > correct annotation, the function simply misses that the mutation > overlaps with any gene and returns an empty GRanges. > > This issue is not really a bug in predictCoding, but results from what > is (again in my opinion) an unfortunate treatment of the negative and > positive strand as separate entities in GRanges. From a biological > prospective this almost never makes sense and is not how I would > expect the functions in GRanges to behave. Just because a gene is > annotated on the negative strand does not mean that I don't want to > know it overlaps with a gene on the positive strand. Maybe there are > cases where this is the desired behavior, so it might make sense to > add this a switch you could turn on or off but in most cases this will > simply lead to erroneous results such as the one above. I think the > most useful information that comes from knowing the strand is knowing > when you need to reverse complement a sequence for the underlying > position of interest. > > A third possible option here would be to call the function twice, once > with for the positive strand and once for the negative strand with the > variants reverse complemented and then merge the results. This > however, seems quite inefficient and not like something that I should > have to do. > > >From my perspective the proper behavior of predictCoding with respect > to strand would be to treat unstranded GRanges as positive strand as > this is the reference strand for things like the genome builds and vcf > files. Then the variants should overlap positive and negative strand > genes and should be reverse complemented for consequence prediction on > the negative strand genes. It should also allow overlap of negative > and positive stranded variants with both negative and positive > stranded genes, but properly reverse complement the variant in each > case to get the proper consequence. I agree with Jeremiah that the treatment of unstranded variants should be to default to treating them as on the positive strand and reverse complement them for negative strand genes. All other variant prediction softwares that I know of make this assumption and the VCF format itself seems to make something of an implicit assumption on this point. Sean > thanks in advance, > > Jeremiah > > here is my session info incase it is needed: > >> sessionInfo() > R Under development (unstable) (2012-04-11 r58985) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > ?[1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 > ?[2] GenomicFeatures_1.8.1 > ?[3] AnnotationDbi_1.18.0 > ?[4] Biobase_2.16.0 > ?[5] BSgenome.Hsapiens.UCSC.hg19_1.3.17 > ?[6] BSgenome_1.24.0 > ?[7] VariantAnnotation_1.2.5 > ?[8] Rsamtools_1.8.1 > ?[9] Biostrings_2.24.1 > [10] GenomicRanges_1.8.3 > [11] IRanges_1.14.2 > [12] BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > ?[1] biomaRt_2.12.0 ? ? bitops_1.0-4.1 ? ? compiler_2.16.0 > DBI_0.2-5 > ?[5] grid_2.16.0 ? ? ? ?lattice_0.20-6 ? ? Matrix_1.0-6 > RCurl_1.91-1 > ?[9] RSQLite_0.11.1 ? ? rtracklayer_1.16.1 snpStats_1.6.0 > splines_2.16.0 > [13] stats4_2.16.0 ? ? ?survival_2.36-12 ? tools_2.16.0 > XML_3.9-4 > [17] zlibbioc_1.2.0 > > -- > Jeremiah Degenhardt, Ph.D. > Computational Biologist > Bioinformatics and Computational Biology > Genentech, Inc. > degenhardt.jeremiah at gene.com > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.0 years ago Sean Davis 21k

0

Entering edit mode

On 20.04.2012 02:59, Sean Davis wrote: > On Thu, Apr 19, 2012 at 9:46 PM, Jeremiah Degenhardt > <degenhardt.jeremiah at="" gene.com=""> wrote: >> >From my perspective the proper behavior of predictCoding with >> respect >> to strand would be to treat unstranded GRanges as positive strand as >> this is the reference strand for things like the genome builds and >> vcf >> files. Then the variants should overlap positive and negative strand >> genes and should be reverse complemented for consequence prediction >> on >> the negative strand genes. It should also allow overlap of negative >> and positive stranded variants with both negative and positive >> stranded genes, but properly reverse complement the variant in each >> case to get the proper consequence. > > I agree with Jeremiah that the treatment of unstranded variants > should > be to default to treating them as on the positive strand and reverse > complement them for negative strand genes. All other variant > prediction softwares that I know of make this assumption and the VCF > format itself seems to make something of an implicit assumption on > this point. > > Sean +1, also agree this would be helpful (and a reasonable assumption to make). -- Alex Gutteridge

ADD REPLY • link 12.0 years ago Alex Gutteridge ▴ 650

0

Entering edit mode

Hi Jeremiah and other coding predictors, On 04/19/2012 06:59 PM, Sean Davis wrote: > On Thu, Apr 19, 2012 at 9:46 PM, Jeremiah Degenhardt > <degenhardt.jeremiah at="" gene.com=""> wrote: >> Hello BioC, >> >> I am using the predictCoding function in the VariantAnnotation package >> and have run into a few issues. >> >> The first issue that I found is a small one. While I can supply any >> txdb object that I want to the predictCoding, the function seems to >> still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running >> predict coding without this package installed results in the following >> error: >> >> Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : >> error in evaluating the argument 'x' in selecting a method for >> function 'isCircular': Error: object >> 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found >> >> The second set of issues are a little more subtle. It seems that >> predictCoding is not properly handling the strand of the variants, at >> least in my opinion. >> >> If I provide an unstranded GRanges with variants that overlap a gene >> on the negative strand, the predictCoding function will not reverse >> complement the varAllele resulting in an incorrect functional >> consequence prediction. Here is some code to generate an example >> >> library(VariantAnnotation) >> library(BSgenome.Hsapiens.UCSC.hg19) >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene >> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), >> variant = DNAStringSet("T")) >> predictCoding(GR, txdb, Hsapiens, values(GR)$variant) >> >> GRanges with 4 ranges and 13 elementMetadata cols: >> seqnames ranges strand | variant >> varAllele >> <rle> <iranges> <rle> |<dnastringset> >> <dnastringset> >> [1] chr17 [63554271, 63554271] * | T >> T >> [2] chr17 [63554271, 63554271] * | T >> T >> [3] chr17 [63554271, 63554271] * | T >> T >> [4] chr17 [63554271, 63554271] * | T >> T >> cdsLoc proteinLoc queryID txID >> cdsID >> <iranges> <compressedintegerlist> <integer> <character> >> <integer> >> [1] [468, 468] 156 1 65536 >> 193561 >> [2] [468, 468] 156 1 65537 >> 193561 >> [3] [468, 468] 156 1 65538 >> 193561 >> [4] [468, 468] 156 1 65539 >> 193564 >> geneID consequence refCodon varCodon >> refAA >> <character> <factor> <dnastringset> <dnastringset> >> <aastringset> >> [1] 8313 synonymous TAC TAT >> Y >> [2] 8313 synonymous TAC TAT >> Y >> [3] 8313 synonymous TAC TAT >> Y >> [4] 8313 synonymous TAC TAT >> Y >> varAA >> <aastringset> >> [1] Y >> [2] Y >> [3] Y >> [4] Y >> --- >> seqlengths: >> chr17 >> NA >> >> Running this the function predicts this to be TAC> TAT, a synonymous >> change. However, as the gene is on the negative strand the actual >> result of this variant should be TAC> TAA or a stop mutation. >> >> If I instead give predictCoding a stranded GRanges to let the function >> know that the base reported is with respect to the positive strand I >> get this result: >> >>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), strand="+", variant = DNAStringSet("T")) >>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant) >> GRanges with 0 ranges and 0 elementMetadata cols: >> >> seqnames ranges strand >> <rle> <iranges> <rle> >> --- >> seqlengths: >> >> >> Now instead of reverse complementing the varAllele, and giving me the >> correct annotation, the function simply misses that the mutation >> overlaps with any gene and returns an empty GRanges. >> >> This issue is not really a bug in predictCoding, but results from what >> is (again in my opinion) an unfortunate treatment of the negative and >> positive strand as separate entities in GRanges. From a biological >> prospective this almost never makes sense and is not how I would >> expect the functions in GRanges to behave. Just because a gene is >> annotated on the negative strand does not mean that I don't want to >> know it overlaps with a gene on the positive strand. Maybe there are >> cases where this is the desired behavior, so it might make sense to >> add this a switch you could turn on or off but in most cases this will >> simply lead to erroneous results such as the one above. I think the >> most useful information that comes from knowing the strand is knowing >> when you need to reverse complement a sequence for the underlying >> position of interest. >> >> A third possible option here would be to call the function twice, once >> with for the positive strand and once for the negative strand with the >> variants reverse complemented and then merge the results. This >> however, seems quite inefficient and not like something that I should >> have to do. >> >> > From my perspective the proper behavior of predictCoding with respect >> to strand would be to treat unstranded GRanges as positive strand as >> this is the reference strand for things like the genome builds and vcf >> files. Then the variants should overlap positive and negative strand >> genes and should be reverse complemented for consequence prediction on >> the negative strand genes. It should also allow overlap of negative >> and positive stranded variants with both negative and positive >> stranded genes, but properly reverse complement the variant in each >> case to get the proper consequence. > > I agree with Jeremiah that the treatment of unstranded variants should > be to default to treating them as on the positive strand and reverse > complement them for negative strand genes. All other variant > prediction softwares that I know of make this assumption and the VCF > format itself seems to make something of an implicit assumption on > this point. That makes a lot of sense. In addition I think it would be nice if the GRanges object returned by predictCoding() was "stranded" and if the varAllele columns was reporting the allele with respect to the strand. Just to clarify, this would be the strand of the CDS where the translation if happening, not the strand specified in the input GRanges object. For example in Jeremiah's example, we would get something like: > predictCoding(GR, txdb, Hsapiens, DNAStringSet("T")) GRanges with 4 ranges and 12 elementMetadata cols: seqnames ranges strand | varAllele ... <rle> <iranges> <rle> | <dnastringset> ... [1] chr17 [63554271, 63554271] - | A ... [2] chr17 [63554271, 63554271] - | A ... [3] chr17 [63554271, 63554271] - | A ... [4] chr17 [63554271, 63554271] - | A ... So what we see in the varAllele column would be consitent with what we see in the refCodon and varCodon columns. Cheers, H. > > Sean > >> thanks in advance, >> >> Jeremiah >> >> here is my session info incase it is needed: >> >>> sessionInfo() >> R Under development (unstable) (2012-04-11 r58985) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 >> [2] GenomicFeatures_1.8.1 >> [3] AnnotationDbi_1.18.0 >> [4] Biobase_2.16.0 >> [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17 >> [6] BSgenome_1.24.0 >> [7] VariantAnnotation_1.2.5 >> [8] Rsamtools_1.8.1 >> [9] Biostrings_2.24.1 >> [10] GenomicRanges_1.8.3 >> [11] IRanges_1.14.2 >> [12] BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.12.0 bitops_1.0-4.1 compiler_2.16.0 >> DBI_0.2-5 >> [5] grid_2.16.0 lattice_0.20-6 Matrix_1.0-6 >> RCurl_1.91-1 >> [9] RSQLite_0.11.1 rtracklayer_1.16.1 snpStats_1.6.0 >> splines_2.16.0 >> [13] stats4_2.16.0 survival_2.36-12 tools_2.16.0 >> XML_3.9-4 >> [17] zlibbioc_1.2.0 >> >> -- >> Jeremiah Degenhardt, Ph.D. >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech, Inc. >> degenhardt.jeremiah at gene.com >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, I think this would be a very useful addition as well. It would make the results much more clear. best, J On Apr 20, 2012 1:50 PM, "Hervé Pagès" <hpages@fhcrc.org> wrote: > Hi Jeremiah and other coding predictors, > > On 04/19/2012 06:59 PM, Sean Davis wrote: > >> On Thu, Apr 19, 2012 at 9:46 PM, Jeremiah Degenhardt >> <degenhardt.jeremiah@gene.com> wrote: >> >>> Hello BioC, >>> >>> I am using the predictCoding function in the VariantAnnotation package >>> and have run into a few issues. >>> >>> The first issue that I found is a small one. While I can supply any >>> txdb object that I want to the predictCoding, the function seems to >>> still be looking for TxDb.Hsapiens.UCSC.hg19.**knownGene. Running >>> predict coding without this package installed results in the following >>> error: >>> >>> Error in isCircular(TxDb.Hsapiens.UCSC.**hg19.knownGene) : >>> error in evaluating the argument 'x' in selecting a method for >>> function 'isCircular': Error: object >>> 'TxDb.Hsapiens.UCSC.hg19.**knownGene' not found >>> >>> The second set of issues are a little more subtle. It seems that >>> predictCoding is not properly handling the strand of the variants, at >>> least in my opinion. >>> >>> If I provide an unstranded GRanges with variants that overlap a gene >>> on the negative strand, the predictCoding function will not reverse >>> complement the varAllele resulting in an incorrect functional >>> consequence prediction. Here is some code to generate an example >>> >>> library(VariantAnnotation) >>> library(BSgenome.Hsapiens.**UCSC.hg19) >>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene) >>> txdb<- TxDb.Hsapiens.UCSC.hg19.**knownGene >>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), >>> variant = DNAStringSet("T")) >>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant) >>> >>> GRanges with 4 ranges and 13 elementMetadata cols: >>> seqnames ranges strand | variant >>> varAllele >>> <rle> <iranges> <rle> |<dnastringset> >>> <dnastringset> >>> [1] chr17 [63554271, 63554271] * | T >>> T >>> [2] chr17 [63554271, 63554271] * | T >>> T >>> [3] chr17 [63554271, 63554271] * | T >>> T >>> [4] chr17 [63554271, 63554271] * | T >>> T >>> cdsLoc proteinLoc queryID txID >>> cdsID >>> <iranges> <compressedintegerlist> <integer> <character> >>> <integer> >>> [1] [468, 468] 156 1 65536 >>> 193561 >>> [2] [468, 468] 156 1 65537 >>> 193561 >>> [3] [468, 468] 156 1 65538 >>> 193561 >>> [4] [468, 468] 156 1 65539 >>> 193564 >>> geneID consequence refCodon varCodon >>> refAA >>> <character> <factor> <dnastringset> <dnastringset> >>> <aastringset> >>> [1] 8313 synonymous TAC TAT >>> Y >>> [2] 8313 synonymous TAC TAT >>> Y >>> [3] 8313 synonymous TAC TAT >>> Y >>> [4] 8313 synonymous TAC TAT >>> Y >>> varAA >>> <aastringset> >>> [1] Y >>> [2] Y >>> [3] Y >>> [4] Y >>> --- >>> seqlengths: >>> chr17 >>> NA >>> >>> Running this the function predicts this to be TAC> TAT, a synonymous >>> change. However, as the gene is on the negative strand the actual >>> result of this variant should be TAC> TAA or a stop mutation. >>> >>> If I instead give predictCoding a stranded GRanges to let the function >>> know that the base reported is with respect to the positive strand I >>> get this result: >>> >>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), >>>> strand="+", variant = DNAStringSet("T")) >>>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant) >>>> >>> GRanges with 0 ranges and 0 elementMetadata cols: >>> >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> --- >>> seqlengths: >>> >>> >>> Now instead of reverse complementing the varAllele, and giving me the >>> correct annotation, the function simply misses that the mutation >>> overlaps with any gene and returns an empty GRanges. >>> >>> This issue is not really a bug in predictCoding, but results from what >>> is (again in my opinion) an unfortunate treatment of the negative and >>> positive strand as separate entities in GRanges. From a biological >>> prospective this almost never makes sense and is not how I would >>> expect the functions in GRanges to behave. Just because a gene is >>> annotated on the negative strand does not mean that I don't want to >>> know it overlaps with a gene on the positive strand. Maybe there are >>> cases where this is the desired behavior, so it might make sense to >>> add this a switch you could turn on or off but in most cases this will >>> simply lead to erroneous results such as the one above. I think the >>> most useful information that comes from knowing the strand is knowing >>> when you need to reverse complement a sequence for the underlying >>> position of interest. >>> >>> A third possible option here would be to call the function twice, once >>> with for the positive strand and once for the negative strand with the >>> variants reverse complemented and then merge the results. This >>> however, seems quite inefficient and not like something that I should >>> have to do. >>> >>> > From my perspective the proper behavior of predictCoding with respect >>> to strand would be to treat unstranded GRanges as positive strand as >>> this is the reference strand for things like the genome builds and vcf >>> files. Then the variants should overlap positive and negative strand >>> genes and should be reverse complemented for consequence prediction on >>> the negative strand genes. It should also allow overlap of negative >>> and positive stranded variants with both negative and positive >>> stranded genes, but properly reverse complement the variant in each >>> case to get the proper consequence. >>> >> >> I agree with Jeremiah that the treatment of unstranded variants should >> be to default to treating them as on the positive strand and reverse >> complement them for negative strand genes. All other variant >> prediction softwares that I know of make this assumption and the VCF >> format itself seems to make something of an implicit assumption on >> this point. >> > > That makes a lot of sense. > > In addition I think it would be nice if the GRanges object returned > by predictCoding() was "stranded" and if the varAllele columns was > reporting the allele with respect to the strand. Just to clarify, this > would be the strand of the CDS where the translation if happening, not > the strand specified in the input GRanges object. For example in > Jeremiah's example, we would get something like: > > > predictCoding(GR, txdb, Hsapiens, DNAStringSet("T")) > GRanges with 4 ranges and 12 elementMetadata cols: > seqnames ranges strand | varAllele ... > <rle> <iranges> <rle> | <dnastringset> ... > [1] chr17 [63554271, 63554271] - | A ... > [2] chr17 [63554271, 63554271] - | A ... > [3] chr17 [63554271, 63554271] - | A ... > [4] chr17 [63554271, 63554271] - | A ... > > So what we see in the varAllele column would be consitent with what > we see in the refCodon and varCodon columns. > > Cheers, > H. > > >> Sean >> >> thanks in advance, >>> >>> Jeremiah >>> >>> here is my session info incase it is needed: >>> >>> sessionInfo() >>>> >>> R Under development (unstable) (2012-04-11 r58985) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.7.1 >>> [2] GenomicFeatures_1.8.1 >>> [3] AnnotationDbi_1.18.0 >>> [4] Biobase_2.16.0 >>> [5] BSgenome.Hsapiens.UCSC.hg19_1.**3.17 >>> [6] BSgenome_1.24.0 >>> [7] VariantAnnotation_1.2.5 >>> [8] Rsamtools_1.8.1 >>> [9] Biostrings_2.24.1 >>> [10] GenomicRanges_1.8.3 >>> [11] IRanges_1.14.2 >>> [12] BiocGenerics_0.2.0 >>> >>> loaded via a namespace (and not attached): >>> [1] biomaRt_2.12.0 bitops_1.0-4.1 compiler_2.16.0 >>> DBI_0.2-5 >>> [5] grid_2.16.0 lattice_0.20-6 Matrix_1.0-6 >>> RCurl_1.91-1 >>> [9] RSQLite_0.11.1 rtracklayer_1.16.1 snpStats_1.6.0 >>> splines_2.16.0 >>> [13] stats4_2.16.0 survival_2.36-12 tools_2.16.0 >>> XML_3.9-4 >>> [17] zlibbioc_1.2.0 >>> >>> -- >>> Jeremiah Degenhardt, Ph.D. >>> Computational Biologist >>> Bioinformatics and Computational Biology >>> Genentech, Inc. >>> degenhardt.jeremiah@gene.com >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Jeremiah Degenhardt ▴ 50

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

On Thu, Apr 19, 2012 at 6:46 PM, Jeremiah Degenhardt < degenhardt.jeremiah@gene.com> wrote: > Hello BioC, > > I am using the predictCoding function in the VariantAnnotation package > and have run into a few issues. > > The first issue that I found is a small one. While I can supply any > txdb object that I want to the predictCoding, the function seems to > still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running > predict coding without this package installed results in the following > error: > > Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : > error in evaluating the argument 'x' in selecting a method for > function 'isCircular': Error: object > 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found > > The second set of issues are a little more subtle. It seems that > predictCoding is not properly handling the strand of the variants, at > least in my opinion. > > If I provide an unstranded GRanges with variants that overlap a gene > on the negative strand, the predictCoding function will not reverse > complement the varAllele resulting in an incorrect functional > consequence prediction. Here is some code to generate an example > > library(VariantAnnotation) > library(BSgenome.Hsapiens.UCSC.hg19) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), > variant = DNAStringSet("T")) > predictCoding(GR, txdb, Hsapiens, values(GR)$variant) > > GRanges with 4 ranges and 13 elementMetadata cols: > seqnames ranges strand | variant > varAllele > <rle> <iranges> <rle> | <dnastringset> > <dnastringset> > [1] chr17 [63554271, 63554271] * | T > T > [2] chr17 [63554271, 63554271] * | T > T > [3] chr17 [63554271, 63554271] * | T > T > [4] chr17 [63554271, 63554271] * | T > T > cdsLoc proteinLoc queryID txID > cdsID > <iranges> <compressedintegerlist> <integer> <character> > <integer> > [1] [468, 468] 156 1 65536 > 193561 > [2] [468, 468] 156 1 65537 > 193561 > [3] [468, 468] 156 1 65538 > 193561 > [4] [468, 468] 156 1 65539 > 193564 > geneID consequence refCodon varCodon > refAA > <character> <factor> <dnastringset> <dnastringset> > <aastringset> > [1] 8313 synonymous TAC TAT > Y > [2] 8313 synonymous TAC TAT > Y > [3] 8313 synonymous TAC TAT > Y > [4] 8313 synonymous TAC TAT > Y > varAA > <aastringset> > [1] Y > [2] Y > [3] Y > [4] Y > --- > seqlengths: > chr17 > NA > > Running this the function predicts this to be TAC> TAT, a synonymous > change. However, as the gene is on the negative strand the actual > result of this variant should be TAC> TAA or a stop mutation. > > If I instead give predictCoding a stranded GRanges to let the function > know that the base reported is with respect to the positive strand I > get this result: > > > GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), > strand="+", variant = DNAStringSet("T")) > > predictCoding(GR, txdb, Hsapiens, values(GR)$variant) > GRanges with 0 ranges and 0 elementMetadata cols: > > seqnames ranges strand > <rle> <iranges> <rle> > --- > seqlengths: > > > Now instead of reverse complementing the varAllele, and giving me the > correct annotation, the function simply misses that the mutation > overlaps with any gene and returns an empty GRanges. > > This issue is not really a bug in predictCoding, but results from what > is (again in my opinion) an unfortunate treatment of the negative and > positive strand as separate entities in GRanges. From a biological > prospective this almost never makes sense and is not how I would > expect the functions in GRanges to behave. Just because a gene is > annotated on the negative strand does not mean that I don't want to > know it overlaps with a gene on the positive strand. Maybe there are > cases where this is the desired behavior, so it might make sense to > add this a switch you could turn on or off but in most cases this will > simply lead to erroneous results such as the one above. There is an "ignore.strand" argument to findOverlaps, so we have a switch. I have always thought that strand should be ignored by default in operations like overlap detection, and only considered as a "direction" rather than as separate in space. It's very useful for resize() and flank() to consider strand, but not so useful for findOverlaps. The ignore.strand=FALSE in those cases default would qualify for the eight circle if there were a Bioc Inferno book. It's only the default that I argue with though, having the capability to consider strand is useful. > I think the > most useful information that comes from knowing the strand is knowing > when you need to reverse complement a sequence for the underlying > position of interest. > > A third possible option here would be to call the function twice, once > with for the positive strand and once for the negative strand with the > variants reverse complemented and then merge the results. This > however, seems quite inefficient and not like something that I should > have to do. > > >From my perspective the proper behavior of predictCoding with respect > to strand would be to treat unstranded GRanges as positive strand as > this is the reference strand for things like the genome builds and vcf > files. Then the variants should overlap positive and negative strand > genes and should be reverse complemented for consequence prediction on > the negative strand genes. It should also allow overlap of negative > and positive stranded variants with both negative and positive > stranded genes, but properly reverse complement the variant in each > case to get the proper consequence. > > thanks in advance, > > Jeremiah > > here is my session info incase it is needed: > > > sessionInfo() > R Under development (unstable) (2012-04-11 r58985) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 > [2] GenomicFeatures_1.8.1 > [3] AnnotationDbi_1.18.0 > [4] Biobase_2.16.0 > [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17 > [6] BSgenome_1.24.0 > [7] VariantAnnotation_1.2.5 > [8] Rsamtools_1.8.1 > [9] Biostrings_2.24.1 > [10] GenomicRanges_1.8.3 > [11] IRanges_1.14.2 > [12] BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.12.0 bitops_1.0-4.1 compiler_2.16.0 > DBI_0.2-5 > [5] grid_2.16.0 lattice_0.20-6 Matrix_1.0-6 > RCurl_1.91-1 > [9] RSQLite_0.11.1 rtracklayer_1.16.1 snpStats_1.6.0 > splines_2.16.0 > [13] stats4_2.16.0 survival_2.36-12 tools_2.16.0 > XML_3.9-4 > [17] zlibbioc_1.2.0 > > -- > Jeremiah Degenhardt, Ph.D. > Computational Biologist > Bioinformatics and Computational Biology > Genentech, Inc. > degenhardt.jeremiah@gene.com > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

> a Bioc Inferno book This is one of the better ideas to be bandied about lately... On Fri, Apr 20, 2012 at 5:43 AM, Michael Lawrence <lawrence.michael@gene.com> wrote: > On Thu, Apr 19, 2012 at 6:46 PM, Jeremiah Degenhardt < > degenhardt.jeremiah@gene.com> wrote: > > > Hello BioC, > > > > I am using the predictCoding function in the VariantAnnotation package > > and have run into a few issues. > > > > The first issue that I found is a small one. While I can supply any > > txdb object that I want to the predictCoding, the function seems to > > still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running > > predict coding without this package installed results in the following > > error: > > > > Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : > > error in evaluating the argument 'x' in selecting a method for > > function 'isCircular': Error: object > > 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found > > > > The second set of issues are a little more subtle. It seems that > > predictCoding is not properly handling the strand of the variants, at > > least in my opinion. > > > > If I provide an unstranded GRanges with variants that overlap a gene > > on the negative strand, the predictCoding function will not reverse > > complement the varAllele resulting in an incorrect functional > > consequence prediction. Here is some code to generate an example > > > > library(VariantAnnotation) > > library(BSgenome.Hsapiens.UCSC.hg19) > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > > GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), > > variant = DNAStringSet("T")) > > predictCoding(GR, txdb, Hsapiens, values(GR)$variant) > > > > GRanges with 4 ranges and 13 elementMetadata cols: > > seqnames ranges strand | variant > > varAllele > > <rle> <iranges> <rle> | <dnastringset> > > <dnastringset> > > [1] chr17 [63554271, 63554271] * | T > > T > > [2] chr17 [63554271, 63554271] * | T > > T > > [3] chr17 [63554271, 63554271] * | T > > T > > [4] chr17 [63554271, 63554271] * | T > > T > > cdsLoc proteinLoc queryID txID > > cdsID > > <iranges> <compressedintegerlist> <integer> <character> > > <integer> > > [1] [468, 468] 156 1 65536 > > 193561 > > [2] [468, 468] 156 1 65537 > > 193561 > > [3] [468, 468] 156 1 65538 > > 193561 > > [4] [468, 468] 156 1 65539 > > 193564 > > geneID consequence refCodon varCodon > > refAA > > <character> <factor> <dnastringset> <dnastringset> > > <aastringset> > > [1] 8313 synonymous TAC TAT > > Y > > [2] 8313 synonymous TAC TAT > > Y > > [3] 8313 synonymous TAC TAT > > Y > > [4] 8313 synonymous TAC TAT > > Y > > varAA > > <aastringset> > > [1] Y > > [2] Y > > [3] Y > > [4] Y > > --- > > seqlengths: > > chr17 > > NA > > > > Running this the function predicts this to be TAC> TAT, a synonymous > > change. However, as the gene is on the negative strand the actual > > result of this variant should be TAC> TAA or a stop mutation. > > > > If I instead give predictCoding a stranded GRanges to let the function > > know that the base reported is with respect to the positive strand I > > get this result: > > > > > GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), > > strand="+", variant = DNAStringSet("T")) > > > predictCoding(GR, txdb, Hsapiens, values(GR)$variant) > > GRanges with 0 ranges and 0 elementMetadata cols: > > > > seqnames ranges strand > > <rle> <iranges> <rle> > > --- > > seqlengths: > > > > > > Now instead of reverse complementing the varAllele, and giving me the > > correct annotation, the function simply misses that the mutation > > overlaps with any gene and returns an empty GRanges. > > > > This issue is not really a bug in predictCoding, but results from what > > is (again in my opinion) an unfortunate treatment of the negative and > > positive strand as separate entities in GRanges. From a biological > > prospective this almost never makes sense and is not how I would > > expect the functions in GRanges to behave. Just because a gene is > > annotated on the negative strand does not mean that I don't want to > > know it overlaps with a gene on the positive strand. Maybe there are > > cases where this is the desired behavior, so it might make sense to > > add this a switch you could turn on or off but in most cases this will > > simply lead to erroneous results such as the one above. > > > There is an "ignore.strand" argument to findOverlaps, so we have a switch. > I have always thought that strand should be ignored by default in > operations like overlap detection, and only considered as a "direction" > rather than as separate in space. It's very useful for resize() and flank() > to consider strand, but not so useful for findOverlaps. The > ignore.strand=FALSE in those cases default would qualify for the eight > circle if there were a Bioc Inferno book. It's only the default that I > argue with though, having the capability to consider strand is useful. > > > > I think the > > most useful information that comes from knowing the strand is knowing > > when you need to reverse complement a sequence for the underlying > > position of interest. > > > > A third possible option here would be to call the function twice, once > > with for the positive strand and once for the negative strand with the > > variants reverse complemented and then merge the results. This > > however, seems quite inefficient and not like something that I should > > have to do. > > > > >From my perspective the proper behavior of predictCoding with respect > > to strand would be to treat unstranded GRanges as positive strand as > > this is the reference strand for things like the genome builds and vcf > > files. Then the variants should overlap positive and negative strand > > genes and should be reverse complemented for consequence prediction on > > the negative strand genes. It should also allow overlap of negative > > and positive stranded variants with both negative and positive > > stranded genes, but properly reverse complement the variant in each > > case to get the proper consequence. > > > > thanks in advance, > > > > Jeremiah > > > > here is my session info incase it is needed: > > > > > sessionInfo() > > R Under development (unstable) (2012-04-11 r58985) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 > > [2] GenomicFeatures_1.8.1 > > [3] AnnotationDbi_1.18.0 > > [4] Biobase_2.16.0 > > [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17 > > [6] BSgenome_1.24.0 > > [7] VariantAnnotation_1.2.5 > > [8] Rsamtools_1.8.1 > > [9] Biostrings_2.24.1 > > [10] GenomicRanges_1.8.3 > > [11] IRanges_1.14.2 > > [12] BiocGenerics_0.2.0 > > > > loaded via a namespace (and not attached): > > [1] biomaRt_2.12.0 bitops_1.0-4.1 compiler_2.16.0 > > DBI_0.2-5 > > [5] grid_2.16.0 lattice_0.20-6 Matrix_1.0-6 > > RCurl_1.91-1 > > [9] RSQLite_0.11.1 rtracklayer_1.16.1 snpStats_1.6.0 > > splines_2.16.0 > > [13] stats4_2.16.0 survival_2.36-12 tools_2.16.0 > > XML_3.9-4 > > [17] zlibbioc_1.2.0 > > > > -- > > Jeremiah Degenhardt, Ph.D. > > Computational Biologist > > Bioinformatics and Computational Biology > > Genentech, Inc. > > degenhardt.jeremiah@gene.com > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Tim Triche ★ 4.2k

0

Entering edit mode

I meant to put this in my initial reply, but I firmly agree with Michael Lawrence. The SummarizedExperiment class is a great thing, and subsetting it by (say) my.SE[ a.GRanges, some.patients ] is unbelievably convenient, especially if it needs to be done repeatedly, or by items in a GRangesList. The strand issue is a nasty gotcha sometimes. just MHO though. On Fri, Apr 20, 2012 at 7:25 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > > a Bioc Inferno book > > This is one of the better ideas to be bandied about lately... > > > > > On Fri, Apr 20, 2012 at 5:43 AM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> On Thu, Apr 19, 2012 at 6:46 PM, Jeremiah Degenhardt < >> degenhardt.jeremiah@gene.com> wrote: >> >> > Hello BioC, >> > >> > I am using the predictCoding function in the VariantAnnotation package >> > and have run into a few issues. >> > >> > The first issue that I found is a small one. While I can supply any >> > txdb object that I want to the predictCoding, the function seems to >> > still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running >> > predict coding without this package installed results in the following >> > error: >> > >> > Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : >> > error in evaluating the argument 'x' in selecting a method for >> > function 'isCircular': Error: object >> > 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found >> > >> > The second set of issues are a little more subtle. It seems that >> > predictCoding is not properly handling the strand of the variants, at >> > least in my opinion. >> > >> > If I provide an unstranded GRanges with variants that overlap a gene >> > on the negative strand, the predictCoding function will not reverse >> > complement the varAllele resulting in an incorrect functional >> > consequence prediction. Here is some code to generate an example >> > >> > library(VariantAnnotation) >> > library(BSgenome.Hsapiens.UCSC.hg19) >> > library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> > GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), >> > variant = DNAStringSet("T")) >> > predictCoding(GR, txdb, Hsapiens, values(GR)$variant) >> > >> > GRanges with 4 ranges and 13 elementMetadata cols: >> > seqnames ranges strand | variant >> > varAllele >> > <rle> <iranges> <rle> | <dnastringset> >> > <dnastringset> >> > [1] chr17 [63554271, 63554271] * | T >> > T >> > [2] chr17 [63554271, 63554271] * | T >> > T >> > [3] chr17 [63554271, 63554271] * | T >> > T >> > [4] chr17 [63554271, 63554271] * | T >> > T >> > cdsLoc proteinLoc queryID txID >> > cdsID >> > <iranges> <compressedintegerlist> <integer> <character> >> > <integer> >> > [1] [468, 468] 156 1 65536 >> > 193561 >> > [2] [468, 468] 156 1 65537 >> > 193561 >> > [3] [468, 468] 156 1 65538 >> > 193561 >> > [4] [468, 468] 156 1 65539 >> > 193564 >> > geneID consequence refCodon varCodon >> > refAA >> > <character> <factor> <dnastringset> <dnastringset> >> > <aastringset> >> > [1] 8313 synonymous TAC TAT >> > Y >> > [2] 8313 synonymous TAC TAT >> > Y >> > [3] 8313 synonymous TAC TAT >> > Y >> > [4] 8313 synonymous TAC TAT >> > Y >> > varAA >> > <aastringset> >> > [1] Y >> > [2] Y >> > [3] Y >> > [4] Y >> > --- >> > seqlengths: >> > chr17 >> > NA >> > >> > Running this the function predicts this to be TAC> TAT, a synonymous >> > change. However, as the gene is on the negative strand the actual >> > result of this variant should be TAC> TAA or a stop mutation. >> > >> > If I instead give predictCoding a stranded GRanges to let the function >> > know that the base reported is with respect to the positive strand I >> > get this result: >> > >> > > GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), >> > strand="+", variant = DNAStringSet("T")) >> > > predictCoding(GR, txdb, Hsapiens, values(GR)$variant) >> > GRanges with 0 ranges and 0 elementMetadata cols: >> > >> > seqnames ranges strand >> > <rle> <iranges> <rle> >> > --- >> > seqlengths: >> > >> > >> > Now instead of reverse complementing the varAllele, and giving me the >> > correct annotation, the function simply misses that the mutation >> > overlaps with any gene and returns an empty GRanges. >> > >> > This issue is not really a bug in predictCoding, but results from what >> > is (again in my opinion) an unfortunate treatment of the negative and >> > positive strand as separate entities in GRanges. From a biological >> > prospective this almost never makes sense and is not how I would >> > expect the functions in GRanges to behave. Just because a gene is >> > annotated on the negative strand does not mean that I don't want to >> > know it overlaps with a gene on the positive strand. Maybe there are >> > cases where this is the desired behavior, so it might make sense to >> > add this a switch you could turn on or off but in most cases this will >> > simply lead to erroneous results such as the one above. >> >> >> There is an "ignore.strand" argument to findOverlaps, so we have a switch. >> I have always thought that strand should be ignored by default in >> operations like overlap detection, and only considered as a "direction" >> rather than as separate in space. It's very useful for resize() and >> flank() >> to consider strand, but not so useful for findOverlaps. The >> ignore.strand=FALSE in those cases default would qualify for the eight >> circle if there were a Bioc Inferno book. It's only the default that I >> argue with though, having the capability to consider strand is useful. >> >> >> > I think the >> > most useful information that comes from knowing the strand is knowing >> > when you need to reverse complement a sequence for the underlying >> > position of interest. >> > >> > A third possible option here would be to call the function twice, once >> > with for the positive strand and once for the negative strand with the >> > variants reverse complemented and then merge the results. This >> > however, seems quite inefficient and not like something that I should >> > have to do. >> > >> > >From my perspective the proper behavior of predictCoding with respect >> > to strand would be to treat unstranded GRanges as positive strand as >> > this is the reference strand for things like the genome builds and vcf >> > files. Then the variants should overlap positive and negative strand >> > genes and should be reverse complemented for consequence prediction on >> > the negative strand genes. It should also allow overlap of negative >> > and positive stranded variants with both negative and positive >> > stranded genes, but properly reverse complement the variant in each >> > case to get the proper consequence. >> > >> > thanks in advance, >> > >> > Jeremiah >> > >> > here is my session info incase it is needed: >> > >> > > sessionInfo() >> > R Under development (unstable) (2012-04-11 r58985) >> > Platform: x86_64-unknown-linux-gnu (64-bit) >> > >> > locale: >> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> > [7] LC_PAPER=C LC_NAME=C >> > [9] LC_ADDRESS=C LC_TELEPHONE=C >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> > >> > attached base packages: >> > [1] stats graphics grDevices utils datasets methods base >> > >> > other attached packages: >> > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 >> > [2] GenomicFeatures_1.8.1 >> > [3] AnnotationDbi_1.18.0 >> > [4] Biobase_2.16.0 >> > [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17 >> > [6] BSgenome_1.24.0 >> > [7] VariantAnnotation_1.2.5 >> > [8] Rsamtools_1.8.1 >> > [9] Biostrings_2.24.1 >> > [10] GenomicRanges_1.8.3 >> > [11] IRanges_1.14.2 >> > [12] BiocGenerics_0.2.0 >> > >> > loaded via a namespace (and not attached): >> > [1] biomaRt_2.12.0 bitops_1.0-4.1 compiler_2.16.0 >> > DBI_0.2-5 >> > [5] grid_2.16.0 lattice_0.20-6 Matrix_1.0-6 >> > RCurl_1.91-1 >> > [9] RSQLite_0.11.1 rtracklayer_1.16.1 snpStats_1.6.0 >> > splines_2.16.0 >> > [13] stats4_2.16.0 survival_2.36-12 tools_2.16.0 >> > XML_3.9-4 >> > [17] zlibbioc_1.2.0 >> > >> > -- >> > Jeremiah Degenhardt, Ph.D. >> > Computational Biologist >> > Bioinformatics and Computational Biology >> > Genentech, Inc. >> > degenhardt.jeremiah@gene.com >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Tim Triche ★ 4.2k

0

Entering edit mode

On 04/20/2012 07:28 AM, Tim Triche, Jr. wrote: > I meant to put this in my initial reply, but I firmly agree with Michael > Lawrence. > > The SummarizedExperiment class is a great thing, and subsetting it by (say) > my.SE[ a.GRanges, some.patients ] is unbelievably convenient, especially > if it needs to be done repeatedly, or by items in a GRangesList. The > strand issue is a nasty gotcha sometimes. Hi Tim -- kind of a different thread here, but I wanted to say that yes, we've heard you. A problem is the many-to-many mapping that's possible, for instance for some idx, we really expect nrow(x[idx,]) == length(idx); there are additional complexities introduced by, e.g., a GRangesList and in the end the convenience for some use cases didn't seem to outweigh the ambiguity and complexity for the general case. Martin > > just MHO though. > > > > On Fri, Apr 20, 2012 at 7:25 AM, Tim Triche, Jr.<tim.triche at="" gmail.com="">wrote: > >>> a Bioc Inferno book >> >> This is one of the better ideas to be bandied about lately... >> >> >> >> >> On Fri, Apr 20, 2012 at 5:43 AM, Michael Lawrence< >> lawrence.michael at gene.com> wrote: >> >>> On Thu, Apr 19, 2012 at 6:46 PM, Jeremiah Degenhardt< >>> degenhardt.jeremiah at gene.com> wrote: >>> >>>> Hello BioC, >>>> >>>> I am using the predictCoding function in the VariantAnnotation package >>>> and have run into a few issues. >>>> >>>> The first issue that I found is a small one. While I can supply any >>>> txdb object that I want to the predictCoding, the function seems to >>>> still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running >>>> predict coding without this package installed results in the following >>>> error: >>>> >>>> Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : >>>> error in evaluating the argument 'x' in selecting a method for >>>> function 'isCircular': Error: object >>>> 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found >>>> >>>> The second set of issues are a little more subtle. It seems that >>>> predictCoding is not properly handling the strand of the variants, at >>>> least in my opinion. >>>> >>>> If I provide an unstranded GRanges with variants that overlap a gene >>>> on the negative strand, the predictCoding function will not reverse >>>> complement the varAllele resulting in an incorrect functional >>>> consequence prediction. Here is some code to generate an example >>>> >>>> library(VariantAnnotation) >>>> library(BSgenome.Hsapiens.UCSC.hg19) >>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>> txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene >>>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), >>>> variant = DNAStringSet("T")) >>>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant) >>>> >>>> GRanges with 4 ranges and 13 elementMetadata cols: >>>> seqnames ranges strand | variant >>>> varAllele >>>> <rle> <iranges> <rle> |<dnastringset> >>>> <dnastringset> >>>> [1] chr17 [63554271, 63554271] * | T >>>> T >>>> [2] chr17 [63554271, 63554271] * | T >>>> T >>>> [3] chr17 [63554271, 63554271] * | T >>>> T >>>> [4] chr17 [63554271, 63554271] * | T >>>> T >>>> cdsLoc proteinLoc queryID txID >>>> cdsID >>>> <iranges> <compressedintegerlist> <integer> <character> >>>> <integer> >>>> [1] [468, 468] 156 1 65536 >>>> 193561 >>>> [2] [468, 468] 156 1 65537 >>>> 193561 >>>> [3] [468, 468] 156 1 65538 >>>> 193561 >>>> [4] [468, 468] 156 1 65539 >>>> 193564 >>>> geneID consequence refCodon varCodon >>>> refAA >>>> <character> <factor> <dnastringset> <dnastringset> >>>> <aastringset> >>>> [1] 8313 synonymous TAC TAT >>>> Y >>>> [2] 8313 synonymous TAC TAT >>>> Y >>>> [3] 8313 synonymous TAC TAT >>>> Y >>>> [4] 8313 synonymous TAC TAT >>>> Y >>>> varAA >>>> <aastringset> >>>> [1] Y >>>> [2] Y >>>> [3] Y >>>> [4] Y >>>> --- >>>> seqlengths: >>>> chr17 >>>> NA >>>> >>>> Running this the function predicts this to be TAC> TAT, a synonymous >>>> change. However, as the gene is on the negative strand the actual >>>> result of this variant should be TAC> TAA or a stop mutation. >>>> >>>> If I instead give predictCoding a stranded GRanges to let the function >>>> know that the base reported is with respect to the positive strand I >>>> get this result: >>>> >>>>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), >>>> strand="+", variant = DNAStringSet("T")) >>>>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant) >>>> GRanges with 0 ranges and 0 elementMetadata cols: >>>> >>>> seqnames ranges strand >>>> <rle> <iranges> <rle> >>>> --- >>>> seqlengths: >>>> >>>> >>>> Now instead of reverse complementing the varAllele, and giving me the >>>> correct annotation, the function simply misses that the mutation >>>> overlaps with any gene and returns an empty GRanges. >>>> >>>> This issue is not really a bug in predictCoding, but results from what >>>> is (again in my opinion) an unfortunate treatment of the negative and >>>> positive strand as separate entities in GRanges. From a biological >>>> prospective this almost never makes sense and is not how I would >>>> expect the functions in GRanges to behave. Just because a gene is >>>> annotated on the negative strand does not mean that I don't want to >>>> know it overlaps with a gene on the positive strand. Maybe there are >>>> cases where this is the desired behavior, so it might make sense to >>>> add this a switch you could turn on or off but in most cases this will >>>> simply lead to erroneous results such as the one above. >>> >>> >>> There is an "ignore.strand" argument to findOverlaps, so we have a switch. >>> I have always thought that strand should be ignored by default in >>> operations like overlap detection, and only considered as a "direction" >>> rather than as separate in space. It's very useful for resize() and >>> flank() >>> to consider strand, but not so useful for findOverlaps. The >>> ignore.strand=FALSE in those cases default would qualify for the eight >>> circle if there were a Bioc Inferno book. It's only the default that I >>> argue with though, having the capability to consider strand is useful. >>> >>> >>>> I think the >>>> most useful information that comes from knowing the strand is knowing >>>> when you need to reverse complement a sequence for the underlying >>>> position of interest. >>>> >>>> A third possible option here would be to call the function twice, once >>>> with for the positive strand and once for the negative strand with the >>>> variants reverse complemented and then merge the results. This >>>> however, seems quite inefficient and not like something that I should >>>> have to do. >>>> >>>> > From my perspective the proper behavior of predictCoding with respect >>>> to strand would be to treat unstranded GRanges as positive strand as >>>> this is the reference strand for things like the genome builds and vcf >>>> files. Then the variants should overlap positive and negative strand >>>> genes and should be reverse complemented for consequence prediction on >>>> the negative strand genes. It should also allow overlap of negative >>>> and positive stranded variants with both negative and positive >>>> stranded genes, but properly reverse complement the variant in each >>>> case to get the proper consequence. >>>> >>>> thanks in advance, >>>> >>>> Jeremiah >>>> >>>> here is my session info incase it is needed: >>>> >>>>> sessionInfo() >>>> R Under development (unstable) (2012-04-11 r58985) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 >>>> [2] GenomicFeatures_1.8.1 >>>> [3] AnnotationDbi_1.18.0 >>>> [4] Biobase_2.16.0 >>>> [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17 >>>> [6] BSgenome_1.24.0 >>>> [7] VariantAnnotation_1.2.5 >>>> [8] Rsamtools_1.8.1 >>>> [9] Biostrings_2.24.1 >>>> [10] GenomicRanges_1.8.3 >>>> [11] IRanges_1.14.2 >>>> [12] BiocGenerics_0.2.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] biomaRt_2.12.0 bitops_1.0-4.1 compiler_2.16.0 >>>> DBI_0.2-5 >>>> [5] grid_2.16.0 lattice_0.20-6 Matrix_1.0-6 >>>> RCurl_1.91-1 >>>> [9] RSQLite_0.11.1 rtracklayer_1.16.1 snpStats_1.6.0 >>>> splines_2.16.0 >>>> [13] stats4_2.16.0 survival_2.36-12 tools_2.16.0 >>>> XML_3.9-4 >>>> [17] zlibbioc_1.2.0 >>>> >>>> -- >>>> Jeremiah Degenhardt, Ph.D. >>>> Computational Biologist >>>> Bioinformatics and Computational Biology >>>> Genentech, Inc. >>>> degenhardt.jeremiah at gene.com >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 12.0 years ago Martin Morgan 25k

0

Entering edit mode

> > > There is an "ignore.strand" argument to findOverlaps, so we have a switch. I > have always thought that strand should be ignored by default in operations > like overlap detection, and only considered as a "direction" rather than as > separate in space. It's very useful for resize() and flank() to consider > strand, but not so useful for findOverlaps. The ignore.strand=FALSE in those > cases default would qualify for the eight circle if there were a Bioc > Inferno book. It's only the default that I argue with though, having the > capability to consider strand is useful. I had forgotten about the ignore.strand option, thanks for the reminder Michael. So, given that it's there I agree with you fully. It seems he default should be changed to TRUE for the Overlap functions and the precedes and follows as well. Note however, that this would not fully correct the issue in the predictCoding function as the function still needs to correctly reverse complement the varAllele to get the annotation correct. As a further note on how big of an issue this is, if you go to the BioC home page and look at the tutorial on "Using Bioconductor to annotate genetic variants" you will find that the example makes this exact mistake. The variants in the VCF are unstranded and two of the genes in the example are negative strand and one is positive. Following the code you will get incorrect annotations for all variants on the negative strand genes. Jeremiah -- Jeremiah Degenhardt, Ph.D. Computational Biologist Bioinformatics and Computational Biology Genentech, Inc. degenhardt.jeremiah at gene.com

ADD REPLY • link 12.0 years ago Jeremiah Degenhardt ▴ 50

0

Entering edit mode

On Fri, Apr 20, 2012 at 8:58 AM, Jeremiah Degenhardt < degenhardt.jeremiah@gene.com> wrote: > > > > > > There is an "ignore.strand" argument to findOverlaps, so we have a > switch. I > > have always thought that strand should be ignored by default in > operations > > like overlap detection, and only considered as a "direction" rather than > as > > separate in space. It's very useful for resize() and flank() to consider > > strand, but not so useful for findOverlaps. The ignore.strand=FALSE in > those > > cases default would qualify for the eight circle if there were a Bioc > > Inferno book. It's only the default that I argue with though, having the > > capability to consider strand is useful. > > I had forgotten about the ignore.strand option, thanks for the > reminder Michael. So, given that it's there I agree with you fully. It > seems he default should be changed to TRUE for the Overlap functions > and the precedes and follows as well. > > Unfortunately damnation is permanent and it's unlikely we'll be able to modify that default now. > Note however, that this would not fully correct the issue in the > predictCoding function as the function still needs to correctly > reverse complement the varAllele to get the annotation correct. > > As a further note on how big of an issue this is, if you go to the > BioC home page and look at the tutorial on "Using Bioconductor to > annotate genetic variants" you will find that the example makes this > exact mistake. The variants in the VCF are unstranded and two of the > genes in the example are negative strand and one is positive. > Following the code you will get incorrect annotations for all variants > on the negative strand genes. > > Jeremiah > > > > -- > Jeremiah Degenhardt, Ph.D. > Computational Biologist > Bioinformatics and Computational Biology > Genentech, Inc. > degenhardt.jeremiah@gene.com > [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Could this be set in options() the way that stringsAsFactors is? R> options()$stringsAsFactors [1] FALSE Then it wouldn't much matter (other than making sure to set it explicitly in packages, vignettes, etc... one reason the latter are nice tests) On Fri, Apr 20, 2012 at 10:13 AM, Michael Lawrence < lawrence.michael@gene.com> wrote: > On Fri, Apr 20, 2012 at 8:58 AM, Jeremiah Degenhardt < > degenhardt.jeremiah@gene.com> wrote: > > > > > > > > > > There is an "ignore.strand" argument to findOverlaps, so we have a > > switch. I > > > have always thought that strand should be ignored by default in > > operations > > > like overlap detection, and only considered as a "direction" rather > than > > as > > > separate in space. It's very useful for resize() and flank() to > consider > > > strand, but not so useful for findOverlaps. The ignore.strand=FALSE in > > those > > > cases default would qualify for the eight circle if there were a Bioc > > > Inferno book. It's only the default that I argue with though, having > the > > > capability to consider strand is useful. > > > > I had forgotten about the ignore.strand option, thanks for the > > reminder Michael. So, given that it's there I agree with you fully. It > > seems he default should be changed to TRUE for the Overlap functions > > and the precedes and follows as well. > > > > > Unfortunately damnation is permanent and it's unlikely we'll be able to > modify that default now. > > > > Note however, that this would not fully correct the issue in the > > predictCoding function as the function still needs to correctly > > reverse complement the varAllele to get the annotation correct. > > > > As a further note on how big of an issue this is, if you go to the > > BioC home page and look at the tutorial on "Using Bioconductor to > > annotate genetic variants" you will find that the example makes this > > exact mistake. The variants in the VCF are unstranded and two of the > > genes in the example are negative strand and one is positive. > > Following the code you will get incorrect annotations for all variants > > on the negative strand genes. > > > > Jeremiah > > > > > > > > -- > > Jeremiah Degenhardt, Ph.D. > > Computational Biologist > > Bioinformatics and Computational Biology > > Genentech, Inc. > > degenhardt.jeremiah@gene.com > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Seems like people are piling up on the "ignore.strand=FALSE is a bad idea" bandwagon .. for what it's worth, I think the default to "honor the strand" in overlap queries is a sensible one, to me. -steve On Fri, Apr 20, 2012 at 11:58 AM, Jeremiah Degenhardt <degenhardt.jeremiah at="" gene.com=""> wrote: >> >> >> There is an "ignore.strand" argument to findOverlaps, so we have a switch. I >> have always thought that strand should be ignored by default in operations >> like overlap detection, and only considered as a "direction" rather than as >> separate in space. It's very useful for resize() and flank() to consider >> strand, but not so useful for findOverlaps. The ignore.strand=FALSE in those >> cases default would qualify for the eight circle if there were a Bioc >> Inferno book. It's only the default that I argue with though, having the >> capability to consider strand is useful. > > I had forgotten about the ignore.strand option, thanks for the > reminder Michael. So, given that it's there I agree with you fully. It > seems he default should be changed to TRUE for the Overlap functions > and the precedes and follows as well. > > Note however, that this would not fully correct the issue in the > predictCoding function as the function still needs to correctly > reverse complement the varAllele to get the annotation correct. > > As a further note on how big of an issue this is, if you go to the > BioC home page and look at the tutorial on "Using Bioconductor to > annotate genetic variants" you will find that the example makes this > exact mistake. The variants in the VCF are unstranded and two of the > genes in the example are negative strand and one is positive. > Following the code you will get incorrect annotations for all variants > on the negative strand genes. > > Jeremiah > > > > -- > Jeremiah Degenhardt, Ph.D. > Computational Biologist > Bioinformatics and Computational Biology > Genentech, Inc. > degenhardt.jeremiah at gene.com > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, Out of curiosity, could you provide an example of an instance when you prefer the function to only return hits on the same strand? I have tried hard to come up with an example but can't think of one. It's probably due more to my background though... best, Jeremiah On Fri, Apr 20, 2012 at 10:07 AM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Seems like people are piling up on the "ignore.strand=FALSE is a bad > idea" bandwagon .. for what it's worth, I think the default to "honor > the strand" in overlap queries is a sensible one, to me. > > -steve > > On Fri, Apr 20, 2012 at 11:58 AM, Jeremiah Degenhardt > <degenhardt.jeremiah at="" gene.com=""> wrote: >>> >>> >>> There is an "ignore.strand" argument to findOverlaps, so we have a switch. I >>> have always thought that strand should be ignored by default in operations >>> like overlap detection, and only considered as a "direction" rather than as >>> separate in space. It's very useful for resize() and flank() to consider >>> strand, but not so useful for findOverlaps. The ignore.strand=FALSE in those >>> cases default would qualify for the eight circle if there were a Bioc >>> Inferno book. It's only the default that I argue with though, having the >>> capability to consider strand is useful. >> >> I had forgotten about the ignore.strand option, thanks for the >> reminder Michael. So, given that it's there I agree with you fully. It >> seems he default should be changed to TRUE for the Overlap functions >> and the precedes and follows as well. >> >> Note however, that this would not fully correct the issue in the >> predictCoding function as the function still needs to correctly >> reverse complement the varAllele to get the annotation correct. >> >> As a further note on how big of an issue this is, if you go to the >> BioC home page and look at the tutorial on "Using Bioconductor to >> annotate genetic variants" you will find that the example makes this >> exact mistake. The variants in the VCF are unstranded and two of the >> genes in the example are negative strand and one is positive. >> Following the code you will get incorrect annotations for all variants >> on the negative strand genes. >> >> Jeremiah >> >> >> >> -- >> Jeremiah Degenhardt, Ph.D. >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech, Inc. >> degenhardt.jeremiah at gene.com >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Jeremiah Degenhardt, Ph.D. Computational Biologist Bioinformatics and Computational Biology Genentech, Inc. degenhardt.jeremiah at gene.com

ADD REPLY • link 12.0 years ago Jeremiah Degenhardt ▴ 50

0

Entering edit mode

Hi, On Fri, Apr 20, 2012 at 3:32 PM, Jeremiah Degenhardt <degenhardt.jeremiah at="" gene.com=""> wrote: > Hi Steve, > > Out of curiosity, could you provide an example of an instance when you > prefer the function to only return hits on the same strand? I have > tried hard to come up with an example but can't think of one. It's > probably due more to my background though... Sure ... the stars have aligned in such a way that I spend most of my time using different types of strand-specific rna sequencing data. In this case, when finding overlaps between reads and (annotated) genes, it's handy that it defaults to keep the strand info. -steve > > best, > > Jeremiah > > On Fri, Apr 20, 2012 at 10:07 AM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> Seems like people are piling up on the "ignore.strand=FALSE is a bad >> idea" bandwagon .. for what it's worth, I think the default to "honor >> the strand" in overlap queries is a sensible one, to me. >> >> -steve >> >> On Fri, Apr 20, 2012 at 11:58 AM, Jeremiah Degenhardt >> <degenhardt.jeremiah at="" gene.com=""> wrote: >>>> >>>> >>>> There is an "ignore.strand" argument to findOverlaps, so we have a switch. I >>>> have always thought that strand should be ignored by default in operations >>>> like overlap detection, and only considered as a "direction" rather than as >>>> separate in space. It's very useful for resize() and flank() to consider >>>> strand, but not so useful for findOverlaps. The ignore.strand=FALSE in those >>>> cases default would qualify for the eight circle if there were a Bioc >>>> Inferno book. It's only the default that I argue with though, having the >>>> capability to consider strand is useful. >>> >>> I had forgotten about the ignore.strand option, thanks for the >>> reminder Michael. So, given that it's there I agree with you fully. It >>> seems he default should be changed to TRUE for the Overlap functions >>> and the precedes and follows as well. >>> >>> Note however, that this would not fully correct the issue in the >>> predictCoding function as the function still needs to correctly >>> reverse complement the varAllele to get the annotation correct. >>> >>> As a further note on how big of an issue this is, if you go to the >>> BioC home page and look at the tutorial on "Using Bioconductor to >>> annotate genetic variants" you will find that the example makes this >>> exact mistake. The variants in the VCF are unstranded and two of the >>> genes in the example are negative strand and one is positive. >>> Following the code you will get incorrect annotations for all variants >>> on the negative strand genes. >>> >>> Jeremiah >>> >>> >>> >>> -- >>> Jeremiah Degenhardt, Ph.D. >>> Computational Biologist >>> Bioinformatics and Computational Biology >>> Genentech, Inc. >>> degenhardt.jeremiah at gene.com >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact > > > > -- > Jeremiah Degenhardt, Ph.D. > Computational Biologist > Bioinformatics and Computational Biology > Genentech, Inc. > degenhardt.jeremiah at gene.com -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

bnorthoff@web.de ▴ 20

@bnorthoffwebde-5251

Last seen 9.6 years ago

Hi there, Jeremiah Degenhardt <degenhardt.jeremiah at="" ...=""> writes: > > Hello BioC, > > I am using the predictCoding function in the VariantAnnotation package > and have run into a few issues. > > The first issue that I found is a small one. While I can supply any > txdb object that I want to the predictCoding, the function seems to > still be looking for?TxDb.Hsapiens.UCSC.hg19.knownGene. Running > predict coding without this package installed results in the following > error: > > Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : > ? error in evaluating the argument 'x' in selecting a method for > function 'isCircular': Error: object > 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found Since I am using the library TxDb.Mmusculus.UCSC.mm9.knownGene this is a problem for my analysis of snps in mice. I get the same error by using this code: library(BSgenome.Mmusculus.UCSC.mm9) library(TxDb.Mmusculus.UCSC.mm9.knownGene) txdb = TxDb.Mmusculus.UCSC.mm9.knownGene snp1 = GRanges(seqnames="chr2", ranges=IRanges(start=28920573, width=1), strand="+", varAllele=DNAStringSet("A")) predictCoding(query=snp1, subject=txdb, seqSource=Mmusculus, varAllele=values(snp1)$varAllele) Is there any solution? Tanks a lot!

ADD COMMENT • link 12.0 years ago bnorthoff@web.de ▴ 20

0

Entering edit mode

On 04/25/2012 12:28 AM, Bernd wrote: > Hi there, > > Jeremiah Degenhardt<degenhardt.jeremiah at="" ...=""> writes: > >> Hello BioC, >> >> I am using the predictCoding function in the VariantAnnotation package >> and have run into a few issues. >> >> The first issue that I found is a small one. While I can supply any >> txdb object that I want to the predictCoding, the function seems to >> still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running >> predict coding without this package installed results in the following >> error: >> >> Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : >> error in evaluating the argument 'x' in selecting a method for >> function 'isCircular': Error: object >> '' > Since I am using the library TxDb.Mmusculus.UCSC.mm9.knownGene this is a problem > for my analysis of snps in mice. I get the same error by using this code: > > library(BSgenome.Mmusculus.UCSC.mm9) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > > txdb = TxDb.Mmusculus.UCSC.mm9.knownGene > > snp1 = GRanges(seqnames="chr2", ranges=IRanges(start=28920573, width=1), > strand="+", varAllele=DNAStringSet("A")) > > predictCoding(query=snp1, subject=txdb, seqSource=Mmusculus, > varAllele=values(snp1)$varAllele) > > Is there any solution? > Tanks a lot! > The txdb error has been fixed in VariantAnnotation 1.2.6 in release and 1.3.8 in devel. I'm currently working on the predictCoding strand issue and will post back here when it's fixed. Valerie > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi all, Thanks for the feedback and discussion on this topic. A patch for predictCoding() has been applied to 1.2.7 release/1.3.9 devel. In the devel branch I've created a subfunction of predictCoding() for more convenient testing. It accepts a GRangesList (expected to be the cdsbytx). The behavior is as follows, library(BSgenome.Hsapiens.UCSC.hg19) fun <- VariantAnnotation:::.predictCodingGRangesList cdsbytx <- GRangesList(tx1=GRanges(seqnames="chr1", IRanges(c(10001, 10010), width=5), strand="+", cds_rank=c(1, 2)), tx2=GRanges(seqnames="chr1", IRanges(c(10100, 10001), width=5), strand="-", cds_rank=c(1,2))) variant=DNAStringSet(c("G", "G", "C", "T", "G")) query <- GRanges(seqnames="chr1", ranges=IRanges(c(rep(10003, 3), 10011, 10101), width=1), strand=c("+", "-", "*", "*", "*"), variant=variant) names(query) <- LETTERS[1:5] coding <- fun(query, cdsbytx, Hsapiens, variant) > coding GRanges with 6 ranges and 13 elementMetadata cols: seqnames ranges strand | variant varAllele CDSLOC <rle> <iranges> <rle> | <dnastringset> <dnastringset> <iranges> A chr1 [10003, 10003] + | G G [3, 3] B chr1 [10003, 10003] - | G C [8, 8] C chr1 [10003, 10003] + | C C [3, 3] C chr1 [10003, 10003] - | C G [8, 8] D chr1 [10011, 10011] + | T T [7, 7] E chr1 [10101, 10101] - | G C [4, 4] PROTEINLOC QUERYID TXID CDSID GENEID <compressedintegerlist> <integer> <character> <character> <character> A 1 1 tx1 <na> <na> B 3 2 tx2 <na> <na> C 1 3 tx1 <na> <na> C 3 3 tx2 <na> <na> D 3 4 tx1 <na> <na> E 2 5 tx2 <na> <na> CONSEQUENCE REFCODON VARCODON REFAA VARAA <factor> <dnastringset> <dnastringset> <aastringset> <aastringset> A synonymous TAA TAG * * B nonsynonymous GTT GCT V A C nonsynonymous TAA TAC * Y C nonsynonymous GTT GGT V G D nonsynonymous CCT TCT P S E nonsynonymous GGG CGG G R Note the strand of the output reflects the strand of the subject that was hit. query "C" is unstranded and hits a subject on both the "+" and "-" strand. A few more changes you should be aware of in the devel branch, - Column names of the output have changed to be consistent with select() in AnnotationDbi. - If the variant is nonsynonymous and the VARCODON has a stop codon not present in the REFCODON the CONSEQUENCE is now 'nonsense' instead of 'nonsynonymous'. I'm working on annotating the loss of start codons as well. Valerie On 04/25/12 08:28, Valerie Obenchain wrote: > On 04/25/2012 12:28 AM, Bernd wrote: >> Hi there, >> >> Jeremiah Degenhardt<degenhardt.jeremiah at="" ...=""> writes: >> >>> Hello BioC, >>> >>> I am using the predictCoding function in the VariantAnnotation package >>> and have run into a few issues. >>> >>> The first issue that I found is a small one. While I can supply any >>> txdb object that I want to the predictCoding, the function seems to >>> still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running >>> predict coding without this package installed results in the following >>> error: >>> >>> Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) : >>> error in evaluating the argument 'x' in selecting a method for >>> function 'isCircular': Error: object >>> '' >> Since I am using the library TxDb.Mmusculus.UCSC.mm9.knownGene this >> is a problem >> for my analysis of snps in mice. I get the same error by using this >> code: >> >> library(BSgenome.Mmusculus.UCSC.mm9) >> library(TxDb.Mmusculus.UCSC.mm9.knownGene) >> >> txdb = TxDb.Mmusculus.UCSC.mm9.knownGene >> >> snp1 = GRanges(seqnames="chr2", ranges=IRanges(start=28920573, width=1), >> strand="+", varAllele=DNAStringSet("A")) >> >> predictCoding(query=snp1, subject=txdb, seqSource=Mmusculus, >> varAllele=values(snp1)$varAllele) >> >> Is there any solution? >> Tanks a lot! >> > > The txdb error has been fixed in VariantAnnotation 1.2.6 in release > and 1.3.8 in devel. I'm currently working on the predictCoding strand > issue and will post back here when it's fixed. > > Valerie >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.