Hi,
as I am currently working on creating a new database for the variants called in our cohort, I stumbled upon the expand function in VariantAnnotation for the vcf object
expand(x, ..., row.names = FALSE): Expand (unlist) the ALT column of a CollapsedVCF object to one row per ALT value. Variables with Number='A' have one value per alternate allele and are expanded accordingly. The 'AD' genotype field is expanded into REF/ALT pairs. For all other fields, the rows are replicated to match the elementNROWS of ALT.
this is exactly what I need.
Unfortunatly I dont think it works as intended
here is what i do
>vcf <- readVcf("/archive/sample_data/sy244/sy244pa.recalibrated_variants_vep.vcf.gz", "hg19")
> rowRanges(vcf)[1942]
GRanges object with 1 range and 5 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
chr1:46084614_C/CTTT chr1 [46084614, 46084614] * | <NA>
REF ALT QUAL FILTER
<DNAStringSet> <DNAStringSetList> <numeric> <character>
chr1:46084614_C/CTTT C CTTT,CTTTT 2365.77 PASS
-------
seqinfo: 85 sequences from hg19 genome
> geno(vcf)$GT[1942]
[1] "1/2"
> geno(vcf)$AD[1942]
[[1]]
[1] 0 25 32
So you see this entry has two alts which is reflected in all the fields
then the vcf gets expanded
> evcf <- expand(vcf)
> rowRanges(evcf)[1942]
GRanges object with 1 range and 5 metadata columns:
seqnames ranges strand | paramRangeID REF
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
[1] chr1 [46084614, 46084614] * | <NA> C
ALT QUAL FILTER
<DNAStringSet> <numeric> <character>
[1] CTTT 2365.77 PASS
-------
seqinfo: 85 sequences from hg19 genome
So far so good, however
> geno(evcf)$AD[1942]
[1] 0
This should clearly be 0 25
So the second part of the depth is missing. This is also true for all the other splitted variants.
AM I doing something wrong, or is this a bug?
TY for your help
also:
> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] stringr_1.1.0 RPostgreSQL_0.4-1 [3] DBI_0.5-1 optparse_1.3.2 [5] data.table_1.10.0 ensemblVEP_1.12.0 [7] VariantAnnotation_1.18.7 Rsamtools_1.24.0 [9] Biostrings_2.40.2 XVector_0.12.1 [11] SummarizedExperiment_1.2.3 Biobase_2.32.0 [13] GenomicRanges_1.24.3 GenomeInfoDb_1.8.7 [15] IRanges_2.6.1 S4Vectors_0.10.3 [17] BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.8 AnnotationDbi_1.34.4 magrittr_1.5 [4] GenomicAlignments_1.8.4 zlibbioc_1.18.0 getopt_1.20.0 [7] BiocParallel_1.6.6 BSgenome_1.40.1 tools_3.3.1 [10] digest_0.6.11 rtracklayer_1.32.2 bitops_1.0-6 [13] RCurl_1.95-4.8 biomaRt_2.28.0 memoise_1.0.0 [16] RSQLite_1.1 stringi_1.1.2 GenomicFeatures_1.24.5 [19] XML_3.98-1.5

We'll need to see the VCF header and ideally the line that causes the problem.
> vcf class: CollapsedVCF dim: 61872 1 rowRanges(vcf): GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER info(vcf): DataFrame with 23 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, ... info(header(vcf)): Number Type Description AC A Integer Allele count in genotypes, for each AL... AF A Float Allele Frequency, for each ALT allele,... AN 1 Integer Total number of alleles in called geno... BaseQRankSum 1 Float Z-score from Wilcoxon rank sum test of... ClippingRankSum 1 Float Z-score From Wilcoxon rank sum test of... DP 1 Integer Approximate read depth; some reads may... DS 0 Flag Were any of the samples downsampled? END 1 Integer Stop position of the interval FS 1 Float Phred-scaled p-value using Fisher's ex... HaplotypeScore 1 Float Consistency of the site with at most t... InbreedingCoeff 1 Float Inbreeding coefficient as estimated fr... MLEAC A Integer Maximum likelihood expectation (MLE) f... MLEAF A Float Maximum likelihood expectation (MLE) f... MQ 1 Float RMS Mapping Quality MQRankSum 1 Float Z-score From Wilcoxon rank sum test of... NEGATIVE_TRAIN_SITE 0 Flag This variant was used to build the neg... POSITIVE_TRAIN_SITE 0 Flag This variant was used to build the pos... QD 1 Float Variant Confidence/Quality by Depth ReadPosRankSum 1 Float Z-score from Wilcoxon rank sum test of... SOR 1 Float Symmetric Odds Ratio of 2x2 contingenc... VQSLOD 1 Float Log odds of being a true variant versu... culprit 1 String The annotation which was the worst per... CSQ . String Consequence annotations from Ensembl V... geno(vcf): SimpleList of length 10: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, RGQ, SB geno(header(vcf)): Number Type Description GT 1 String Genotype AD . Integer Allelic depths for the ref and alt alleles in the o... DP 1 Integer Approximate read depth (reads with MQ=255 or with b... GQ 1 Integer Genotype Quality MIN_DP 1 Integer Minimum DP observed within the GVCF block PGT 1 String Physical phasing haplotype information, describing ... PID 1 String Physical phasing ID information, where each unique ... PL G Integer Normalized, Phred-scaled likelihoods for genotypes ... RGQ 1 Integer Unconditional reference genotype confidence, encode... SB 4 Integer Per-sample component statistics which comprise the ... > header(vcf) class: VCFHeader samples(1): sy244pa meta(2): META contig fixed(2): FILTER ALT info(23): AC AF ... culprit CSQ geno(10): GT AD ... RGQ SBDo you also need the header in the file, as it is quite long?
and here the line (shorted=<...>)
The actual file header would be useful (in a pastebin or someting) so that I can construct a file to reproduce this.
Here the header plus one of the lines, that show this behaviour
http://pastebin.com/f0hVctUG