Hello,
I keep getting an error in predictCoding() which is giving me a bit of a headache.
I read somewhere that it may be due to illegal ALT character but it does not appear to be the case.
> ref <- FaFile("sequences.fa")
> coding <- predictCoding(vcf, txdb, seqSource=ref)
Error in .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[220612]]': not a base at pos 1
> alt(vcf)[[220612]]
A DNAStringSet instance of length 1
width seq
[1] 1 T
> ref(vcf)[[220612]]
1-letter "DNAString" instance
seq: A
> hasOnlyBaseLetters(unlist(alt(vcf)))
[1] TRUE
> hasOnlyBaseLetters(unlist(ref(vcf)))
[1] TRUE
> colSums(alphabetFrequency(unlist(alt(vcf))))
A C G T M R W S Y K
1097599 861657 863029 1099514 0 0 0 0 0 0
V H D B N - + .
0 0 0 0 0 0 0 0
I went ahead and looked up the corresponding line in the .vcf (v4.0) file, but nothing strange seems
to be going on there
C1 29273358 . A T 271 PASS BRF=0.0;FR=0.1031;HP=3;HapScore=3;MGOF=62;MMLQ=35;MQ=41.74;NF=1;NR=8;PP=271;QD= 33.9745791719;SC=AAAACTTCAAATTTAAGGCTT;SbPval=0.65;Source=Platypus;TC=315;TCF=125;TCR=190;TR=9;WE=29273366;WS=29273343 GT:GL:GOF:GQ:NR:NV 0 /0:0.0,-16.23,-207.1:1:99:57:0 1/1:-33.0,-2.23,0.0:31:22:9:9 0/0:0.0,-18.89,-244.1:31:99:66:0 0/0:0.0,-18.27,-233.9:62:99:64:0 0 /0:0.0,-18.27,-232.6:1:99:63:0 0/0:0.0,-9.84,-126.5:1:98:34:0 0/0:0.0,-0.45,-11.0:0:6:3:0 0/0:0.0,-0.4,-11.6:43:5:4:0 0/0:0.0,-0.92,- 19.5:3:10:5:0 0/0:0.0,-2.11,-35.4:5:21:10:0
By the way there must be an easier way to recover SNP coordinates for a given allele (I converted Granges to data.frame)
df <- data.frame(seqnames=seqnames(gr),
starts=start(gr),
ends=end(gr),
names=c(rep(".", length(gr))),
scores=c(rep(".", length(gr))),
strands=strand(gr))
df[220612,]
seqnames starts ends names scores strands
220612 C1 29273358 29273358 . . *
Any idea what else could be causing problems?
Agnieszka
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] VariantAnnotation_1.12.9 Rsamtools_1.18.2 Biostrings_2.34.1
[4] XVector_0.6.0 GenomicFeatures_1.18.3 AnnotationDbi_1.28.1
[7] Biobase_2.22.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4
[10] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9
[4] BiocParallel_1.0.3 biomaRt_2.18.0 bitops_1.0-6
[7] brew_1.0-6 BSgenome_1.30.0 checkmate_1.5.1
[10] codetools_0.2-9 DBI_0.3.1 digest_0.6.4
[13] fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.1
[16] iterators_1.0.7 RCurl_1.95-4.3 RSQLite_1.0.0
[19] rtracklayer_1.26.2 sendmailR_1.2-1 stringr_0.6.2
[22] tools_3.1.2 XML_3.98-1.1 zlibbioc_1.8.0
Also checked alt(vcf) is a DNAStringSetList
> alt(vcf)
DNAStringSetList of length 5514356
I think that problem may be due to filtering out of non-SNP variants from the original file which contains indels and MNPs. I used a pretty simple custom script which retains only lines which have one nucleotide in ref and one (or multiple single comma separated nucleotides) in ALT. Is the a VariantAnnotation friendly way of doing this (filtering out after reading in vcf file?). I am using Platypus (http://www.well.ox.ac.uk/platypus) for variant calling and as far as I can tell there is no way to make it output SNPs only.
I think that problem may be due to filtering out of non-SNP variants from the original file which contains indels and MNPs. I used a pretty simple custom script which retains only lines which have one nucleotide in ref and one (or multiple single comma separated nucleotides) in ALT. Is the a VariantAnnotation friendly way of doing this (filtering out after reading in vcf file?). I am using Platypus (http://www.well.ox.ac.uk/platypus) for variant calling and as far as I can tell there is no way to make it output SNPs only.