DNAStringSet_translate error in predictCoding()
2
0
Entering edit mode
Joerg Linde ▴ 20
@joerg-linde-4182
Last seen 9.6 years ago
Dear bioconductor team, I have a problem with predictCoding() of the VariantAnnotation library posing an error which is the same as described here: https://stat.ethz.ch/pipermail/bioconductor/2012-November/048940.html Howerver, after reading my vcf it clearly has a DNAStringSetList in it's ALT variable. The problem remains when using vcftools to remove indels from the vcf. As far as I see there are some ALTs with two possibilities. Is there anything else which could cause the problem? I am also aware of this thread https://stat.ethz.ch/pipermail/bioconductor/2012-October/048370.html but I can't figure out how to remove those lines causing the problem. Thank you very much J?rg vcf=readVcf("file.vcf","hg") coding <- predictCoding(vcf, txdb, seqSource=fa) Error in .Call2("DNAStringSet_translate", x, DNA_BASE_CODES, lkup, skipcode, : in 'x[[6655]]': not a base at pos 3 > alt(vcf) DNAStringSetList of length 142721 [[1]] C [[2]] T [[3]] G [[4]] G [[5]] G [[6]] C [[7]] C [[8]] A [[9]] G [[10]] C .. <142711 more elements> > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
VariantAnnotation VariantAnnotation VariantAnnotation VariantAnnotation • 1.3k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.2 years ago
United States
Hi J?rg, It looks like your sessionInfo() output was cut off and I can't tell what version of VariantAnnotation you have. Versions >= 1.10.0 detect structrural variants and create either a CharacterList or DNAStringSetList. Since you have a DNAStringSetList, all values should be valid bases. Does this return TRUE? hasOnlyBaseLetters(unlist(alt(vcf))) Are there any non-base characters in the matrix? alphabetFrequency(unlist(alt(vcf))) To help further I'll need the version of VariantAnnotation and a reproducible example. Valerie On 06/17/2014 05:45 AM, "Dr. J?rg Linde" wrote: > Dear bioconductor team, > > I have a problem with predictCoding() of the VariantAnnotation library > posing an error which is the same as described here: > https://stat.ethz.ch/pipermail/bioconductor/2012-November/048940.html > > Howerver, after reading my vcf it clearly has a DNAStringSetList in > it's ALT variable. > The problem remains when using vcftools to remove indels from the vcf. > As far as I see there are some ALTs with two possibilities. > Is there anything else which could cause the problem? > > I am also aware of this thread > https://stat.ethz.ch/pipermail/bioconductor/2012-October/048370.html > but I can't figure out how to remove those lines causing the problem. > > Thank you very much > J?rg > > vcf=readVcf("file.vcf","hg") > coding <- predictCoding(vcf, txdb, seqSource=fa) > Error in .Call2("DNAStringSet_translate", x, DNA_BASE_CODES, lkup, > skipcode, : > in 'x[[6655]]': not a base at pos 3 > > alt(vcf) > DNAStringSetList of length 142721 > [[1]] C > [[2]] T > [[3]] G > [[4]] G > [[5]] G > [[6]] C > [[7]] C > [[8]] A > [[9]] G > [[10]] C > .. > <142711 more elements> > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.2 years ago
United States
Hi, Please remember to hit 'reply all' when responding so we keep communication on the list. If you're interested in mapping the ambiguity codes in alt to their base-pair equivalents see ?IUPAC_CODE_MAP in the Biostrings package. To identify rows with ambiguous codes you can use the 'other' column output from alphabetFrequency(): alt <- DNAStringSetList("WA", "G", "NA", c("AG", "M"), "A") af <- alphabetFrequency(unlist(alt), baseOnly=TRUE) ambiguous <- any(relist(af[,"other"] > 0L, alt)) > ambiguous [1] TRUE FALSE TRUE TRUE FALSE A VCF can be subset by rows (variants) or columns (samples) using '['. Remove ambiguous rows and keep all samples: vcf[!ambiguous, ] Valerie On 06/19/2014 03:28 AM, "Dr. J?rg Linde" wrote: > Dear Valerie, > thank you sooo much. Helped a lot. Version is VariantAnnotation_1.8.13. > > >hasOnlyBaseLetters(unlist(alt(vcf))) > FALSE > > > > unlist(alt(vcf))[rowSums(alphabetFrequency(unlist(alt(vcf)))[,5:1 7])>0] > A DNAStringSet instance of length 11 > width seq > [1] 2 GN > [2] 2 WA > [3] 12 GTATGTGTNTAT > [4] 2 NA > [5] 2 YC > ... ... ... > [7] 6 AGANGA > [8] 2 GN > [9] 6 MCAATA > [10] 11 GTAGTANTAGT > [11] 2 TN > > > I am just looking for an elegant way to remove these lines from my vcf > > best > J?rg > > > > > On 06/18/2014 10:17 PM, Valerie Obenchain wrote: >> Hi J?rg, >> >> It looks like your sessionInfo() output was cut off and I can't tell >> what version of VariantAnnotation you have. >> >> Versions >= 1.10.0 detect structrural variants and create either a >> CharacterList or DNAStringSetList. Since you have a DNAStringSetList, >> all values should be valid bases. >> >> Does this return TRUE? >> >> hasOnlyBaseLetters(unlist(alt(vcf))) >> >> Are there any non-base characters in the matrix? >> >> alphabetFrequency(unlist(alt(vcf))) >> >> >> To help further I'll need the version of VariantAnnotation and a >> reproducible example. >> >> Valerie >> >> >> >> On 06/17/2014 05:45 AM, "Dr. J?rg Linde" wrote: >>> Dear bioconductor team, >>> >>> I have a problem with predictCoding() of the VariantAnnotation library >>> posing an error which is the same as described here: >>> https://stat.ethz.ch/pipermail/bioconductor/2012-November/048940.html >>> >>> Howerver, after reading my vcf it clearly has a DNAStringSetList in >>> it's ALT variable. >>> The problem remains when using vcftools to remove indels from the vcf. >>> As far as I see there are some ALTs with two possibilities. >>> Is there anything else which could cause the problem? >>> >>> I am also aware of this thread >>> https://stat.ethz.ch/pipermail/bioconductor/2012-October/048370.html >>> but I can't figure out how to remove those lines causing the problem. >>> >>> Thank you very much >>> J?rg >>> >>> vcf=readVcf("file.vcf","hg") >>> coding <- predictCoding(vcf, txdb, seqSource=fa) >>> Error in .Call2("DNAStringSet_translate", x, DNA_BASE_CODES, lkup, >>> skipcode, : >>> in 'x[[6655]]': not a base at pos 3 >>> > alt(vcf) >>> DNAStringSetList of length 142721 >>> [[1]] C >>> [[2]] T >>> [[3]] G >>> [[4]] G >>> [[5]] G >>> [[6]] C >>> [[7]] C >>> [[8]] A >>> [[9]] G >>> [[10]] C >>> .. >>> <142711 more elements> >>> > sessionInfo() >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >
ADD COMMENT

Login before adding your answer.

Traffic: 545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6