predictCoding error DNAStringSet_translate
1
0
Entering edit mode
@miss-agnieszka-aleksandra-golicz-6531
Last seen 3.6 years ago
Australia

Hello,

I keep getting an error in predictCoding() which is giving me a bit of a headache.

I read somewhere that it may be due to illegal ALT character but it does not appear to be the case.

> ref <- FaFile("sequences.fa")
> coding <- predictCoding(vcf, txdb, seqSource=ref)
Error in .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet],  :
  in 'x[[220612]]': not a base at pos 1
> alt(vcf)[[220612]]
  A DNAStringSet instance of length 1
    width seq
[1]     1 T
> ref(vcf)[[220612]]
  1-letter "DNAString" instance
seq: A
> hasOnlyBaseLetters(unlist(alt(vcf)))
[1] TRUE
> hasOnlyBaseLetters(unlist(ref(vcf)))
[1] TRUE

> colSums(alphabetFrequency(unlist(alt(vcf))))
      A       C       G       T       M       R       W       S       Y       K
1097599  861657  863029 1099514       0       0       0       0       0       0
      V       H       D       B       N       -       +       .
      0       0       0       0       0       0       0       0

I went ahead and looked up the corresponding line in the .vcf (v4.0) file, but nothing strange seems
to be going on there

C1      29273358        .       A       T       271     PASS    BRF=0.0;FR=0.1031;HP=3;HapScore=3;MGOF=62;MMLQ=35;MQ=41.74;NF=1;NR=8;PP=271;QD=         33.9745791719;SC=AAAACTTCAAATTTAAGGCTT;SbPval=0.65;Source=Platypus;TC=315;TCF=125;TCR=190;TR=9;WE=29273366;WS=29273343  GT:GL:GOF:GQ:NR:NV    0         /0:0.0,-16.23,-207.1:1:99:57:0  1/1:-33.0,-2.23,0.0:31:22:9:9   0/0:0.0,-18.89,-244.1:31:99:66:0        0/0:0.0,-18.27,-233.9:62:99:64:0      0         /0:0.0,-18.27,-232.6:1:99:63:0  0/0:0.0,-9.84,-126.5:1:98:34:0  0/0:0.0,-0.45,-11.0:0:6:3:0     0/0:0.0,-0.4,-11.6:43:5:4:0     0/0:0.0,-0.92,-         19.5:3:10:5:0   0/0:0.0,-2.11,-35.4:5:21:10:0 

By the way there must be an easier way to recover SNP coordinates for a given allele (I converted Granges to data.frame)

df <- data.frame(seqnames=seqnames(gr),
  starts=start(gr),
  ends=end(gr),
  names=c(rep(".", length(gr))),
  scores=c(rep(".", length(gr))),
  strands=strand(gr))

 df[220612,]
       seqnames   starts     ends names scores strands
220612       C1 29273358 29273358     .      .       *

Any idea what else could be causing problems?

Agnieszka

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] VariantAnnotation_1.12.9 Rsamtools_1.18.2         Biostrings_2.34.1
 [4] XVector_0.6.0            GenomicFeatures_1.18.3   AnnotationDbi_1.28.1
 [7] Biobase_2.22.0           GenomicRanges_1.18.4     GenomeInfoDb_1.2.4
[10] IRanges_2.0.1            S4Vectors_0.4.0          BiocGenerics_0.12.1

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.9
 [4] BiocParallel_1.0.3      biomaRt_2.18.0          bitops_1.0-6
 [7] brew_1.0-6              BSgenome_1.30.0         checkmate_1.5.1
[10] codetools_0.2-9         DBI_0.3.1               digest_0.6.4
[13] fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.1
[16] iterators_1.0.7         RCurl_1.95-4.3          RSQLite_1.0.0
[19] rtracklayer_1.26.2      sendmailR_1.2-1         stringr_0.6.2
[22] tools_3.1.2             XML_3.98-1.1            zlibbioc_1.8.0

variantannotation predict coding • 1.8k views
ADD COMMENT
0
Entering edit mode

Also checked alt(vcf) is a DNAStringSetList

> alt(vcf)
DNAStringSetList of length 5514356

 

0
Entering edit mode

I think that problem may be due to filtering out of non-SNP variants from the original file which contains indels and MNPs. I used a pretty simple custom script which retains only lines which have one nucleotide in ref and one (or multiple single comma separated nucleotides) in ALT. Is the a VariantAnnotation friendly way of doing this (filtering out after reading in vcf file?). I am using Platypus (http://www.well.ox.ac.uk/platypus) for variant calling and as far as I can tell there is no way to make it output SNPs only.

0
Entering edit mode

I think that problem may be due to filtering out of non-SNP variants from the original file which contains indels and MNPs. I used a pretty simple custom script which retains only lines which have one nucleotide in ref and one (or multiple single comma separated nucleotides) in ALT. Is the a VariantAnnotation friendly way of doing this (filtering out after reading in vcf file?). I am using Platypus (http://www.well.ox.ac.uk/platypus) for variant calling and as far as I can tell there is no way to make it output SNPs only.

0
Entering edit mode
@valerie-obenchain-4275
Last seen 11 months ago
United States

Hi Agnieszka,

predictCoding() should work on all variants, not just SNPS. Can you send the file off-line, email or dropbox so I can take a look?

As to the filtering question, there is a family of functions that filters a VCF on variant type. See the ?isSNV man page.

You also asked about recovering coordinates for a given allele. You can subset the VCF class by a numeric or logical vector. If you want row(s) 'i',

sub <- vcf[i, ]

rowData(sub)

or pull the rowData out and work with that. The rowData() accessor has become rowRanges() in devel.

rowRanges(vcf)[i,]

Isolate SNPs, extract those with allele 'A':

snps <- vcf[isSNV(vcf),]

snps[unlist(alt(snps) %in% "A"), ]

Valerie

ADD COMMENT
1
Entering edit mode

Thanks for sending the files. The problem was that one of the reference codons had an 'N' and could not be translated. The 'N' wasn't present in the original reference allele but it was one of the 3 nucleotides pulled down when the codon was retrieved from the fasta. The code checked for 'N' values in the alternate alleles but not the reference codons. This has been fixed in 1.13.46.

Valerie

ADD REPLY
0
Entering edit mode

Thank you very much.

0
Entering edit mode

For the future could you tell me how to remove problematic lines.

I tried:

invalid <- c(123)

vcf2 <- vcf[!invalid], but this did not seem to work.

Agnieszka

0
Entering edit mode

You shouldn't have to do anything on your end. Lines with 'N' in ref or alt will be ignored, not translated. If they are in the data you'll see a warning (not error):

Warning message:
varAllele values containing 'N' were not translated 

or

Warning message:
reference codons containing 'N' were not translated

If you want to remove/keep rows in a VCF object the example you have with c(1,2,3) will work. It's identifying which rows are bad that's difficult. The internal code does some additional subsetting/excluding so the index printed in the error message may not correspond to the same row in the VCF object. Herve just enhanced the error message in Biostrings 2.35.12 so instead of this

'x[[220612]]': not a base at pos 1

the error will print out the exact character that is causing the problems. That will give you a heads up as to what to look for in the VCF.

Valerie

ADD REPLY
0
Entering edit mode

Thanks again. I understand that the fixes are in the devel version of VariantAnnotation. I will have to wait for my admin to install R-devel and will try it out.

0
Entering edit mode

I can confirm that the code works on the full data set. Thanks for the fix!

Agnieszka

Login before adding your answer.

Traffic: 175 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6