I'm having some trouble importing VCFs into seqCAT. I used Sarek to produce VCFs with haplotypecaller and mutect2, and annotated with snpEff and VEP, but importing either fails:
> vcf <-("../VCFs/haplotypecaller/haplotypecaller_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz") > create_profile(vcf, 'T0_CTL_CTL_rep1', 'T0_CTL_CTL_rep1.txt') Reading VCF file ... Creating SNV profile ... Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID", : replacement has 1 row, data has 0 In addition: Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames
I've tried the same with tumor-normal VCFs generated with mutect2, but get a different error:
> vcfm<-('../VCFs/mutect/mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz') > create_profile(vcfm, 'T2_CTL_CTL_rep1', 'T2_CTL_CTL_rep1.txt') Reading VCF file ... Creating SNV profile ... Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID", : replacement has 1 row, data has 0
I've checked the sample names correspond using VariantAnnotation::scanVcfHeader, but no luck. Any other suggestions?
The two VCFs are here: https://drive.google.com/open?id=1wj68DQkneLzTP65j8u1Y_tsWy7Sn94BM
> sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] seqCAT_1.4.0 VariantAnnotation_1.28.3 Rsamtools_1.34.0 [4] Biostrings_2.50.1 XVector_0.22.0 SummarizedExperiment_1.12.0 [7] DelayedArray_0.8.0 BiocParallel_1.16.2 matrixStats_0.54.0 [10] Biobase_2.42.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.1 [13] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 loaded via a namespace (and not attached): [1] progress_1.2.0 tidyselect_0.2.5 purrr_0.2.5 [4] lattice_0.20-35 rtracklayer_1.42.1 GenomicFeatures_1.34.1 [7] blob_1.1.1 XML_3.98-1.16 rlang_0.3.0.1 [10] pillar_1.3.0 glue_1.3.0 DBI_1.0.0 [13] bit64_0.9-7 bindrcpp_0.2.2 GenomeInfoDbData_1.2.0 [16] bindr_0.1.1 stringr_1.3.1 zlibbioc_1.28.0 [19] memoise_1.1.0 biomaRt_2.38.0 AnnotationDbi_1.44.0 [22] Rcpp_1.0.0 BSgenome_1.50.0 BiocManager_1.30.4 [25] bit_1.1-14 hms_0.4.2 digest_0.6.18 [28] stringi_1.2.4 dplyr_0.7.8 grid_3.5.1 [31] tools_3.5.1 bitops_1.0-6 magrittr_1.5 [34] RCurl_1.95-4.11 RSQLite_2.1.1 tibble_1.4.2 [37] tidyr_0.8.2 crayon_1.3.4 pkgconfig_2.0.2 [40] Matrix_1.2-14 prettyunits_1.0.2 assertthat_0.2.0 [43] httr_1.3.1 rstudioapi_0.8 R6_2.3.0 [46] GenomicAlignments_1.18.0 compiler_3.5.1
It looks like the depth information does exist in the mutect, but it's in the INFO column, rather than being defined in the FORMAT column, and then being presented in the individual sample columns, which doesn't make any sense to me, as the depth information is sample-specific rather than allele-specific.
Thanks Erik and James- i really appreciate how quickly you're able to help! With your explanations and a bit of googling, i found it's a bug in Mutect2 thats since been fixed (https://github.com/broadinstitute/gatk/pull/5185 ). I'll filter the haplotypecaller vcfs and try again. I'll also let the Sarek devs know that a fix is available. Thanks so much!
Happy to help! And thank you too, James, for your answers! Regarding your HaplotypeCaller VCF, I did some thinking. I'm guessing the reason that your `FILTER` column was empty was that you have not run the data through any filtering (whether using hard thresholds or GATK'S VQSR), which I would recommend from a biological standpoint: it's important to remove variants that are more likely to be false positives. If the data is genomic you should be able to use VQSR (check out GATK's Best Practices), but it has not yet been evaluated on RNA-seq (at least at the time of writing).
I've added the relevant code, which you can find in version `1.5.1` at GitHub right now, or at Bioconductor-devel after it has been built in their overnight processes. Good luck with your analyses!