Question

Error message from protein coding prediction with VariantAnnotation

0

Entering edit mode

rwan.work • 0

@rwanwork-13888

Last seen 22 months ago

United Kingdom

Hi all,

So I previously posted a question, which wasn't a minimal working example. I stripped it down now and hope I'm posting correctly now.

In short, I'm trying to get VariantAnnotation working with a non-reference genome where I provide my own GFF and FASTA. I'm getting an error with particular lines in my VCF file (i.e., some records in the VCF work and some do not). I've reduced my example to just two SNPs, neither of which work. (I'm not showing any samples that do work.)

The variable vcf has been saved at this link as an RDS file.

With that set, then a record of my execution is as follows, with the error message that is stumping me at the very end:

> library (VariantAnnotation)
> library (GenomicFeatures)
> txdb <- makeTxDbFromGFF (file = "http://sgd-archive.yeastgenome.org/sequence/strains/BY4742/BY4742_Toronto_2012/BY4742_Toronto_2012.gff.gz", dataSource="BY4742", organism="Saccharomyces cerevisiae")
Import genomic features from the file as a GRanges object ... trying URL 'http://sgd-archive.yeastgenome.org/sequence/strains/BY4742/BY4742_Toronto_2012/BY4742_Toronto_2012.gff.gz'
Content type 'binary/octet-stream' length 4540589 bytes (4.3 MB)
==================================================
downloaded 4.3 MB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  YBR084C-A_mRNA
> coding <- predictCoding (vcf, txdb, seqSource="http://sgd-archive.yeastgenome.org/sequence/strains/BY4742/BY4742_Toronto_2012/BY4742_Toronto_2012.fsa.gz")
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'seqlevels': unable to find an inherited method for function ‘seqinfo’ for signature ‘"character"’

The warning message from makeTxDbFromGFF is "ok". I've checked the transcript in question and two exons are on different strands; there's probably something wrong with the description of its features.

Neither SNP has gone passed the end of the chromosome in question (808695 bp). Beyond that, I can't see anything obviously wrong with my input data.

My session information is:

> sessionInfo ()
R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS/LAPACK: /home/rwan/.conda/envs/vcf2gene/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] GenomicFeatures_1.46.1      AnnotationDbi_1.56.1       
 [3] VariantAnnotation_1.40.0    Rsamtools_2.10.0           
 [5] Biostrings_2.62.0           XVector_0.34.0             
 [7] SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [9] GenomicRanges_1.46.1        GenomeInfoDb_1.30.0        
[11] IRanges_2.28.0              S4Vectors_0.32.3           
[13] MatrixGenerics_1.6.0        matrixStats_0.62.0         
[15] BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9               lattice_0.20-45          prettyunits_1.1.1       
 [4] png_0.1-7                assertthat_0.2.1         digest_0.6.29           
 [7] utf8_1.2.2               BiocFileCache_2.2.0      R6_2.5.1                
[10] RSQLite_2.2.8            httr_1.4.4               pillar_1.8.1            
[13] zlibbioc_1.40.0          rlang_1.0.5              progress_1.2.2          
[16] curl_4.3.2               blob_1.2.3               Matrix_1.4-1            
[19] BiocParallel_1.28.3      stringr_1.4.1            RCurl_1.98-1.8          
[22] bit_4.0.4                biomaRt_2.50.0           DelayedArray_0.20.0     
[25] rtracklayer_1.54.0       compiler_4.1.3           pkgconfig_2.0.3         
[28] tidyselect_1.1.2         KEGGREST_1.34.0          tibble_3.1.8            
[31] GenomeInfoDbData_1.2.7   XML_3.99-0.10            fansi_1.0.3             
[34] crayon_1.5.1             dplyr_1.0.10             dbplyr_2.2.1            
[37] GenomicAlignments_1.30.0 bitops_1.0-7             rappdirs_0.3.3          
[40] grid_4.1.3               lifecycle_1.0.2          DBI_1.1.3               
[43] magrittr_2.0.3           cli_3.4.0                stringi_1.7.8           
[46] cachem_1.0.6             xml2_1.3.3               ellipsis_0.3.2          
[49] filelock_1.0.2           vctrs_0.4.1              generics_0.1.3          
[52] rjson_0.2.21             restfulr_0.0.15          tools_4.1.3             
[55] bit64_4.0.5              BSgenome_1.62.0          glue_1.6.2              
[58] purrr_0.3.4              hms_1.1.2                yaml_2.3.5              
[61] parallel_4.1.3           fastmap_1.1.0            memoise_2.0.1           
[64] BiocIO_1.4.0

Any help would be appreciated! Thank you!

Ray

R VariantAnnotation • 1.1k views

ADD COMMENT • link 22 months ago rwan.work • 0

0

Entering edit mode

That's not an example that anybody can use to test, as dput isn't meant to be used on such a complicated object. For example, something like this

listData = list(REF = new("DNAStringSet", pool = new("SharedRaw_Pool", 
        xp_list = list(<pointer: (nil)>), .link_to_cached_object_list = list(
            <environment>))

Cannot be simply copy/pasted into R, because <environment> is actually an environment on your machine, and obviously passing that in as a character string won't work. If you want to provide an example, you should output that object as an RDS and put somewhere that people can download.

ADD REPLY • link 22 months ago James W. MacDonald 66k

0

Entering edit mode

Hi James,

Thanks a lot for the explanation! I've never used dput before, but found some information on how to create a "minimal working example" for R and the person who posted that message suggested dput. Clearly, s/he was incorrect -- thank you for pointing that out!

I've edited my post with an RDS now. I hope it is easier for someone to understand my problem.

Thank you!

Ray

ADD REPLY • link 22 months ago rwan.work • 0

score 2 · Accepted Answer · 2022-09-15

From ?predictCoding

Arguments:

   query: A VCF, IntegerRanges, GRanges or 'VRanges' instance
          containing the variants to be annotated. If 'query' is a
          IntegerRanges or 'VRanges' it is coerced to a GRanges.  If a
          VCF is provided the 'GRanges' returned by the 'rowRanges()'
          accessor will be used.  All metadata columns are ignored.

          When 'query' is not a 'VCF' object a 'varAllele' must be
          provided. The 'varAllele' must be a 'DNAStringSet' the same
          length as the 'query'. If there are multiple alternate
          alleles per variant the 'query' must be expanded to one row
          per each alternate allele.  See examples.

          NOTE: Variants are expected to conform to the VCF specs as
          described on the 1000 Genomes page (see references). Indels
          must include the reference allele; zero-width ranges are not
          valid and return no matches.

 subject: A TxDb object that serves as the annotation. GFF files can be
          converted to TxDb objects with 'makeTxDbFromGFF()' in the
          'GenomicFeatures' package.

seqSource: A 'BSgenome' instance or an FaFile to be used for sequence
          extraction.

And note the choices for 'seqSource', neither of which is a URI pointing to a file.

> vcf <- readRDS("20220915.rds")
> library(VariantAnnotation)
> library(GenomicFeatures)
> txdb <- makeTxDbFromGFF (file = "http://sgd-archive.yeastgenome.org/sequence/strains/BY4742/BY4742_Toronto_2012/BY4742_Toronto_2012.gff.gz", dataSource="BY4742", organism="Saccharomyces cerevisiae")

## get the FASTA file
> download.file("http://sgd-archive.yeastgenome.org/sequence/strains/BY4742/BY4742_Toronto_2012/BY4742_Toronto_2012.fsa.gz", "TO.fa.gz")
## ungzip b/c it's not bgzipped
> system("gzip -d TO.fa.gz")
## index
> indexFa("TO.fa")
##FaFile object
> fafile <- FaFile("TO.fa")
## now let's go
> coding <- predictCoding (vcf, txdb, seqSource=fafile)
> coding
GRanges object with 3 ranges and 17 metadata columns:
                   seqnames    ranges strand | paramRangeID            REF
                      <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
   chr02:2629_CT/C    chr02 2629-2630      - |           NA             CT
   chr02:2629_CT/C    chr02 2629-2630      - |           NA             CT
  chr02:577385_G/C    chr02    577385      - |           NA              G
                                  ALT      QUAL      FILTER      varAllele
                   <DNAStringSetList> <numeric> <character> <DNAStringSet>
   chr02:2629_CT/C                  C     34.73        PASS              G
   chr02:2629_CT/C                  C     34.73        PASS              G
  chr02:577385_G/C                  C   4949.77        PASS              G
                      CDSLOC    PROTEINLOC   QUERYID        TXID         CDSID
                   <IRanges> <IntegerList> <integer> <character> <IntegerList>
   chr02:2629_CT/C     29-30            10         1        4037     3986,3987
   chr02:2629_CT/C   270-271         90,91         1        4038     3986,3987
  chr02:577385_G/C       299           100         2        4211          4166
                        GENEID CONSEQUENCE       REFCODON       VARCODON
                   <character>    <factor> <DNAStringSet> <DNAStringSet>
   chr02:2629_CT/C     YBL113C  frameshift            AAG             AG
   chr02:2629_CT/C     YBL112C  frameshift         AAAGGG          AAGGG
  chr02:577385_G/C     YBR173C  nonsense              TCA            TGA
                           REFAA         VARAA
                   <AAStringSet> <AAStringSet>
   chr02:2629_CT/C                            
   chr02:2629_CT/C                            
  chr02:577385_G/C             S             *
  -------
  seqinfo: 18 sequences (1 circular) from 2 genomes (NA, BY4742)