Entering edit mode
Thomas Sandmann
▴
30
@thomas-sandmann-5891
Last seen 10.3 years ago
Hi Valerie,
thanks a lot for supporting legacy versions of the ensembl database /
variant_effect_predictor.pl script.
I assume you're still using version 67 and have the data cached.
Yes, that's right. We use ensembl release 67 together with the
corresponding variant_effect_predictor.pl script version 2.5.
> How are you calling the script right now?
As a temporary fix, I am using the ensemblVEP method from ensemblVEP
version 1.1.3 (BioC svn revision r76970). I think this is the last
version
that worked with ensembl release 67 for me.
I modified the default parameters in the VEPParam object by creating a
temporary "gVEPParam" class for use with our in-house ensembl release
67.
This object is passed to the ensemblVEP method with the default
parameters
listed below. (Please note that our installation of
variant_effect_predictor.pl by default connects to our in-house
database.)
Formal class 'gVEPParam' with 6 slots
..@ basic :List of 5
.. ..$ verbose : logi FALSE
.. ..$ quiet : logi FALSE
.. ..$ no_progress: logi TRUE
.. ..$ config : chr(0)
.. ..$ everything : logi FALSE
..@ input :List of 4
.. ..$ species : chr "homo_sapiens"
.. ..$ format : chr(0)
.. ..$ output_file : chr(0)
.. ..$ force_overwrite: logi FALSE
..@ output :List of 24
.. ..$ terms : chr "so"
.. ..$ sift : chr "b"
.. ..$ polyphen : chr "b"
.. ..$ regulatory : logi FALSE
.. ..$ cell_type : chr(0)
.. ..$ hgvs : logi TRUE
.. ..$ hgnc : logi TRUE
.. ..$ gene : logi TRUE
.. ..$ protein : logi TRUE
.. ..$ ccds : logi TRUE
.. ..$ canonical : logi TRUE
.. ..$ xref_refseq: logi FALSE
.. ..$ numbers : logi TRUE
.. ..$ domains : logi TRUE
.. ..$ most_severe: logi FALSE
.. ..$ summary : logi FALSE
.. ..$ per_gene : logi FALSE
.. ..$ convert : chr(0)
.. ..$ fields : chr(0)
.. ..$ vcf : logi FALSE
.. ..$ gvf : logi FALSE
.. ..$ original : logi FALSE
.. ..$ custom : chr(0)
.. ..$ plugin : chr "GNECondel,/Plugins/config/Condel/config"
..@ filterqc:List of 17
.. ..$ check_ref : logi FALSE
.. ..$ coding_only : logi FALSE
.. ..$ check_existing : logi TRUE
.. ..$ check_alleles : logi FALSE
.. ..$ check_svs : logi FALSE
.. ..$ individual : chr(0)
.. ..$ chr : chr(0)
.. ..$ no_intergenic : logi FALSE
.. ..$ filter_common : logi FALSE
.. ..$ check_frequency : logi FALSE
.. ..$ freq_pop : chr(0)
.. ..$ freq_freq : logi FALSE
.. ..$ freq_gt_lt : chr(0)
.. ..$ freq_filter : chr(0)
.. ..$ filter : chr(0)
.. ..$ failed : logi FALSE
.. ..$ allow_non_variant: logi FALSE
..@ database:List of 9
.. ..$ database : logi FALSE
.. ..$ host : chr "useastdb.ensembl.org"
.. ..$ user : chr(0)
.. ..$ password : chr(0)
.. ..$ port : num(0)
.. ..$ genomes : logi FALSE
.. ..$ refseq : logi FALSE
.. ..$ db_version: num(0)
.. ..$ registry : chr(0)
..@ advanced:List of 4
.. ..$ no_whole_genome: logi FALSE
.. ..$ buffer_size : num 5000
.. ..$ compress : chr(0)
.. ..$ skip_db_check : logi FALSE
> Do you use the --cache flag or --offline flag?
I am not using the --cache flag right now, because version 2.5 of the
variant_effect_predictor.pl script does not allow me to specify the
Plugin
directory and the cache directory separately. (This was only
introduced in
a later version of the perl script).
The --offline flag does not seem to be available in
variant_effect_predictor.pl version 2.5, at least I cannot find it in
the
listed arguments (provided below for reference).
version 2.5
Options
=======
--help Display this message and quit
--verbose Display verbose output as the script runs
[default:
off]
--quiet Suppress status and warning messages [default:
off]
--no_progress Suppress progress bars [default: off]
--config Load configuration from file. Any command line
options
specified overwrite those in the file [default:
off]
--everything Shortcut switch to turn on commonly used
options.
See web
documentation for details [default: off]
-i | --input_file Input file - if not specified, reads from
STDIN.
Files
may be gzip compressed.
--format Specify input file format - one of "ensembl",
"pileup",
"vcf", "hgvs", "id" or "guess" to try and work
out
format.
-o | --output_file Output file. Write to STDOUT by specifying -o
STDOUT
- this
will force --quiet [default:
"variant_effect_output.txt"]
--force_overwrite Force overwriting of output file [default: quit
if
file
exists]
--original Writes output as it was in input - must be used
with
--filter
since no consequence data is added [default:
off]
--vcf Write output as VCF [default: off]
--gvf Write output as GVF [default: off]
--fields [field list] Define a custom output format by specifying a
comma-separated
list of field names. Field names normally
present in
the
"Extra" field may also be specified, including
those
added by
plugin modules. Can also be used to configure
VCF
output
columns [default: off]
--species [species] Species to use [default: "human"]
-t | --terms Type of consequence terms to output - one of
"ensembl", "SO",
"NCBI" [default: ensembl]
--sift=[p|s|b] Add SIFT [p]rediction, [s]core or [b]oth
[default:
off]
--polyphen=[p|s|b] Add PolyPhen [p]rediction, [s]core or [b]oth
[default: off]
--regulatory Look for overlaps with regulatory regions. The
script can
also call if a variant falls in a high
information
position
within a transcription factor binding site.
Output
lines have
a Feature type of RegulatoryFeature or
MotifFeature
[default: off]
--cell_type [types] Report only regulatory regions that are found
in the
given cell
type(s). Can be a single cell type or a
comma-separated list.
The functional type in each cell type is
reported
under
CELL_TYPE in the output. To retrieve a list of
cell
types, use
"--cell_type list" [default: off]
--custom [file list] Add custom annotations from tabix-indexed
files. See
documentation for full details [default: off]
--plugin [plugin_name] Use named plugin module [default: off]
--hgnc Add HGNC gene identifiers to output [default:
off]
--hgvs Output HGVS identifiers (coding and protein).
Requires database
connection [default: off]
--ccds Output CCDS transcript identifiers [default:
off]
--xref_refseq Output aligned RefSeq mRNA identifier for
transcript. NB: the
RefSeq and Ensembl transcripts aligned in this
way
MAY NOT, AND
FREQUENTLY WILL NOT, match exactly in sequence,
exon
structure
and protein product [default: off]
--protein Output Ensembl protein identifer [default: off]
--gene Force output of Ensembl gene identifer -
disabled by
default
unless using --cache or --no_whole_genome
[default:
off]
--canonical Indicate if the transcript for this consequence
is
the canonical
transcript for this gene [default: off]
--domains Include details of any overlapping protein
domains
[default: off]
--numbers Include exon & intron numbers [default: off]
--no_intergenic Excludes intergenic consequences from the
output
[default: off]
--coding_only Only return consequences that fall in the
coding
region of
transcripts [default: off]
--most_severe Ouptut only the most severe consequence per
variation.
Transcript-specific columns will be left blank.
[default: off]
--summary Output only a comma-separated list of all
consequences per
variation. Transcript-specific columns will be
left
blank.
[default: off]
--per_gene Output only the most severe consequence per
gene.
Where more
than one transcript has the same consequence,
the
transcript
chosen is arbitrary. [default: off]
--check_ref If specified, checks supplied reference allele
against stored
entry in Ensembl Core database [default: off]
--check_existing If specified, checks for existing co-located
variations in the
Ensembl Variation database [default: off]
--failed [0|1] Include (1) or exclude (0) variants that have
been
flagged as
failed by Ensembl when checking for existing
variants.
[default: exclude]
--check_alleles If specified, the alleles of existing co-
located
variations
are compared to the input; an existing
variation
will only
be reported if no novel allele is in the input
(strand is
accounted for) [default: off]
--check_svs Report overlapping structural variants
[default: off]
--filter [filters] Filter output by consequence type. Use this to
output only
variants that have at least one consequence
type
matching the
filter. Multiple filters can be used separated
by
",". By
combining this with --original it is possible
to run
the VEP
iteratively to progressively filter a set of
variants. See
documentation for full details [default: off]
--check_frequency Turns on frequency filtering. Use this to
include or
exclude
variants based on the frequency of co-located
existing
variants in the Ensembl Variation database. You
must
also
specify all of the following --freq flags
[default:
off]
--freq_pop [pop] Name of the population to use e.g. hapmap_ceu
for
CEU HapMap,
1kg_yri for YRI 1000 genomes. See documentation
for
more
details
--freq_freq [freq] Frequency to use in filter. Must be a number
between
0 and 0.5
--freq_gt_lt [gt|lt] Specify whether the frequency should be greater
than
(gt) or
less than (lt) --freq_freq
--freq_filter Specify whether variants that pass the above
should
be included
[exclude|include] or excluded from analysis
--individual [id] Consider only alternate alleles present in the
genotypes of the
specified individual(s). May be a single
individual,
a comma-
separated list or "all" to assess all
individuals
separately.
Each individual and variant combination is
given on
a separate
line of output. Only works with VCF files
containing
individual
genotype data; individual IDs are taken from
column
headers.
--allow_non_variant Prints out non-variant lines when using VCF
input
--chr [list] Select a subset of chromosomes to analyse from
your
file. Any
data not on this chromosome in the input will
be
skipped. The
list can be comma separated, with "-"
characters
representing
a range e.g. 1-5,8,15,X [default: off]
--gp If specified, tries to read GRCh37 position
from GP
field in the
INFO column of a VCF file. Only applies when
VCF is
the input
format and human is the species [default: off]
--convert Convert the input file to the output format
specified.
[ensembl|vcf|pileup] Converted output is written to the file
specified in
--output_file. No consequence calculation is
carried
out when
doing file conversion. [default: off]
--refseq Use the otherfeatures database to retrieve
transcripts - this
database contains RefSeq transcripts (as well
as
CCDS and
Ensembl EST alignments) [default: off]
--host Manually define database host [default: "
ensembldb.ensembl.org"]
-u | --user Database username [default: "anonymous"]
--port Database port [default: 5306]
--password Database password [default: no password]
--genomes Sets DB connection params for Ensembl Genomes
[default: off]
--registry Registry file to use defines DB connections
[default: off]
Defining a registry file overrides above
connection
settings.
--db_version=[number] Force script to load DBs from a specific
Ensembl
version. Not
advised due to likely incompatibilities between
API
and DB
--no_whole_genome Run in old-style, non-whole genome mode
[default:
off]
--buffer_size Sets the number of variants sent in each batch
[default: 5000]
Increasing buffer size can retrieve results
more
quickly
but requires more memory. Only applies to whole
genome mode.
--cache Enables read-only use of cache [default: off]
--dir [directory] Specify the base cache directory to use
[default:
"$HOME/.vep/"]
--write_cache Enable writing to cache [default: off]
--build [all|list] Build a complete cache for the selected
species.
Build for all
chromosomes with --build all, or a list of
chromosomes (see
--chr). DO NOT USE WHEN CONNECTED TO PUBLIC DB
SERVERS AS THIS
VIOLATES OUR FAIR USAGE POLICY [default: off]
--compress Specify utility to decompress cache files - may
be
"gzcat" or
"gzip -dc" Only use if default does not work
[default: zcat]
--skip_db_check ADVANCED! Force the script to use a cache built
from
a different
database than specified with --host. Only use
this
if you are
sure the hosts are compatible (e.g.
ensembldb.ensembl.org and
useastdb.ensembl.org) [default: off]
--cache_region_size ADVANCED! The size in base-pairs of the region
covered by one
file in the cache. [default: 1MB]
> Also, please remind me of (point me to) the plug-in you're using so
I can
> test that.
>
We are using a single plugin that returns the Condel scores. The
*Condel
plugin* can be found on github here:
https://github.com/ensembl-variation/VEP_plugins
Again, thanks a lot for your support. Please let me know if there is
anything I can do to help, e.g. with testing the package.
Best,
Thomas
[[alternative HTML version deleted]]