Question: predictCoding error DNAStringSet_translate
0
gravatar for Miss Agnieszka Aleksandra Golicz
4.0 years ago by
Australia

Hello,

I keep getting an error in predictCoding() which is giving me a bit of a headache.

I read somewhere that it may be due to illegal ALT character but it does not appear to be the case.

> ref <- FaFile("sequences.fa")
> coding <- predictCoding(vcf, txdb, seqSource=ref)
Error in .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet],  :
  in 'x[[220612]]': not a base at pos 1
> alt(vcf)[[220612]]
  A DNAStringSet instance of length 1
    width seq
[1]     1 T
> ref(vcf)[[220612]]
  1-letter "DNAString" instance
seq: A
> hasOnlyBaseLetters(unlist(alt(vcf)))
[1] TRUE
> hasOnlyBaseLetters(unlist(ref(vcf)))
[1] TRUE

> colSums(alphabetFrequency(unlist(alt(vcf))))
      A       C       G       T       M       R       W       S       Y       K
1097599  861657  863029 1099514       0       0       0       0       0       0
      V       H       D       B       N       -       +       .
      0       0       0       0       0       0       0       0

I went ahead and looked up the corresponding line in the .vcf (v4.0) file, but nothing strange seems
to be going on there

C1      29273358        .       A       T       271     PASS    BRF=0.0;FR=0.1031;HP=3;HapScore=3;MGOF=62;MMLQ=35;MQ=41.74;NF=1;NR=8;PP=271;QD=         33.9745791719;SC=AAAACTTCAAATTTAAGGCTT;SbPval=0.65;Source=Platypus;TC=315;TCF=125;TCR=190;TR=9;WE=29273366;WS=29273343  GT:GL:GOF:GQ:NR:NV    0         /0:0.0,-16.23,-207.1:1:99:57:0  1/1:-33.0,-2.23,0.0:31:22:9:9   0/0:0.0,-18.89,-244.1:31:99:66:0        0/0:0.0,-18.27,-233.9:62:99:64:0      0         /0:0.0,-18.27,-232.6:1:99:63:0  0/0:0.0,-9.84,-126.5:1:98:34:0  0/0:0.0,-0.45,-11.0:0:6:3:0     0/0:0.0,-0.4,-11.6:43:5:4:0     0/0:0.0,-0.92,-         19.5:3:10:5:0   0/0:0.0,-2.11,-35.4:5:21:10:0 

By the way there must be an easier way to recover SNP coordinates for a given allele (I converted Granges to data.frame)

df <- data.frame(seqnames=seqnames(gr),
  starts=start(gr),
  ends=end(gr),
  names=c(rep(".", length(gr))),
  scores=c(rep(".", length(gr))),
  strands=strand(gr))

 df[220612,]
       seqnames   starts     ends names scores strands
220612       C1 29273358 29273358     .      .       *

Any idea what else could be causing problems?

Agnieszka

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] VariantAnnotation_1.12.9 Rsamtools_1.18.2         Biostrings_2.34.1
 [4] XVector_0.6.0            GenomicFeatures_1.18.3   AnnotationDbi_1.28.1
 [7] Biobase_2.22.0           GenomicRanges_1.18.4     GenomeInfoDb_1.2.4
[10] IRanges_2.0.1            S4Vectors_0.4.0          BiocGenerics_0.12.1

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.9
 [4] BiocParallel_1.0.3      biomaRt_2.18.0          bitops_1.0-6
 [7] brew_1.0-6              BSgenome_1.30.0         checkmate_1.5.1
[10] codetools_0.2-9         DBI_0.3.1               digest_0.6.4
[13] fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.1
[16] iterators_1.0.7         RCurl_1.95-4.3          RSQLite_1.0.0
[19] rtracklayer_1.26.2      sendmailR_1.2-1         stringr_0.6.2
[22] tools_3.1.2             XML_3.98-1.1            zlibbioc_1.8.0

ADD COMMENTlink modified 4.0 years ago by Valerie Obenchain6.7k • written 4.0 years ago by Miss Agnieszka Aleksandra Golicz40

Also checked alt(vcf) is a DNAStringSetList

> alt(vcf)
DNAStringSetList of length 5514356

 

ADD REPLYlink written 4.0 years ago by Miss Agnieszka Aleksandra Golicz40

I think that problem may be due to filtering out of non-SNP variants from the original file which contains indels and MNPs. I used a pretty simple custom script which retains only lines which have one nucleotide in ref and one (or multiple single comma separated nucleotides) in ALT. Is the a VariantAnnotation friendly way of doing this (filtering out after reading in vcf file?). I am using Platypus (http://www.well.ox.ac.uk/platypus) for variant calling and as far as I can tell there is no way to make it output SNPs only.

ADD REPLYlink written 4.0 years ago by Miss Agnieszka Aleksandra Golicz40

I think that problem may be due to filtering out of non-SNP variants from the original file which contains indels and MNPs. I used a pretty simple custom script which retains only lines which have one nucleotide in ref and one (or multiple single comma separated nucleotides) in ALT. Is the a VariantAnnotation friendly way of doing this (filtering out after reading in vcf file?). I am using Platypus (http://www.well.ox.ac.uk/platypus) for variant calling and as far as I can tell there is no way to make it output SNPs only.

ADD REPLYlink written 4.0 years ago by Miss Agnieszka Aleksandra Golicz40
Answer: predictCoding error DNAStringSet_translate
0
gravatar for Valerie Obenchain
4.0 years ago by
United States
Valerie Obenchain6.7k wrote:

Hi Agnieszka,

predictCoding() should work on all variants, not just SNPS. Can you send the file off-line, email or dropbox so I can take a look?

As to the filtering question, there is a family of functions that filters a VCF on variant type. See the ?isSNV man page.

You also asked about recovering coordinates for a given allele. You can subset the VCF class by a numeric or logical vector. If you want row(s) 'i',

sub <- vcf[i, ]

rowData(sub)

or pull the rowData out and work with that. The rowData() accessor has become rowRanges() in devel.

rowRanges(vcf)[i,]

Isolate SNPs, extract those with allele 'A':

snps <- vcf[isSNV(vcf),]

snps[unlist(alt(snps) %in% "A"), ]

Valerie

ADD COMMENTlink written 4.0 years ago by Valerie Obenchain6.7k
1

Thanks for sending the files. The problem was that one of the reference codons had an 'N' and could not be translated. The 'N' wasn't present in the original reference allele but it was one of the 3 nucleotides pulled down when the codon was retrieved from the fasta. The code checked for 'N' values in the alternate alleles but not the reference codons. This has been fixed in 1.13.46.

Valerie

ADD REPLYlink written 4.0 years ago by Valerie Obenchain6.7k

Thank you very much.

ADD REPLYlink written 4.0 years ago by Miss Agnieszka Aleksandra Golicz40

For the future could you tell me how to remove problematic lines.

I tried:

invalid <- c(123)

vcf2 <- vcf[!invalid], but this did not seem to work.

Agnieszka

ADD REPLYlink written 4.0 years ago by Miss Agnieszka Aleksandra Golicz40

You shouldn't have to do anything on your end. Lines with 'N' in ref or alt will be ignored, not translated. If they are in the data you'll see a warning (not error):

Warning message:
varAllele values containing 'N' were not translated 

or

Warning message:
reference codons containing 'N' were not translated

If you want to remove/keep rows in a VCF object the example you have with c(1,2,3) will work. It's identifying which rows are bad that's difficult. The internal code does some additional subsetting/excluding so the index printed in the error message may not correspond to the same row in the VCF object. Herve just enhanced the error message in Biostrings 2.35.12 so instead of this

'x[[220612]]': not a base at pos 1

the error will print out the exact character that is causing the problems. That will give you a heads up as to what to look for in the VCF.

Valerie

ADD REPLYlink written 4.0 years ago by Valerie Obenchain6.7k

Thanks again. I understand that the fixes are in the devel version of VariantAnnotation. I will have to wait for my admin to install R-devel and will try it out.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Miss Agnieszka Aleksandra Golicz40

I can confirm that the code works on the full data set. Thanks for the fix!

Agnieszka

ADD REPLYlink written 4.0 years ago by Miss Agnieszka Aleksandra Golicz40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 471 users visited in the last hour