Question: Importing VCFs in seqCAT
0
6 months ago by
omeally0
USA
omeally0 wrote:

I'm having some trouble importing VCFs into seqCAT. I used Sarek to produce VCFs with haplotypecaller and mutect2, and annotated with snpEff and VEP, but importing either fails:

> vcf <-("../VCFs/haplotypecaller/haplotypecaller_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz")
> create_profile(vcf, 'T0_CTL_CTL_rep1', 'T0_CTL_CTL_rep1.txt')

Creating SNV profile ...
Error in [<-.data.frame(*tmp*, !grepl("^rs[0-9]+", data$rsID), "rsID", : replacement has 1 row, data has 0 In addition: Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames I've tried the same with tumor-normal VCFs generated with mutect2, but get a different error: > vcfm<-('../VCFs/mutect/mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz') > create_profile(vcfm, 'T2_CTL_CTL_rep1', 'T2_CTL_CTL_rep1.txt') Reading VCF file ... Creating SNV profile ... Error in [<-.data.frame(*tmp*, !grepl("^rs[0-9]+", data$rsID), "rsID",  :
replacement has 1 row, data has 0

I've checked the sample names correspond using VariantAnnotation::scanVcfHeader, but no luck. Any other suggestions?

The two VCFs are here: https://drive.google.com/open?id=1wj68DQkneLzTP65j8u1Y_tsWy7Sn94BM

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] seqCAT_1.4.0                VariantAnnotation_1.28.3    Rsamtools_1.34.0
[4] Biostrings_2.50.1           XVector_0.22.0              SummarizedExperiment_1.12.0
[7] DelayedArray_0.8.0          BiocParallel_1.16.2         matrixStats_0.54.0
[10] Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1
[13] IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] progress_1.2.0           tidyselect_0.2.5         purrr_0.2.5
[4] lattice_0.20-35          rtracklayer_1.42.1       GenomicFeatures_1.34.1
[7] blob_1.1.1               XML_3.98-1.16            rlang_0.3.0.1
[10] pillar_1.3.0             glue_1.3.0               DBI_1.0.0
[13] bit64_0.9-7              bindrcpp_0.2.2           GenomeInfoDbData_1.2.0
[16] bindr_0.1.1              stringr_1.3.1            zlibbioc_1.28.0
[19] memoise_1.1.0            biomaRt_2.38.0           AnnotationDbi_1.44.0
[22] Rcpp_1.0.0               BSgenome_1.50.0          BiocManager_1.30.4
[25] bit_1.1-14               hms_0.4.2                digest_0.6.18
[28] stringi_1.2.4            dplyr_0.7.8              grid_3.5.1
[31] tools_3.5.1              bitops_1.0-6             magrittr_1.5
[34] RCurl_1.95-4.11          RSQLite_2.1.1            tibble_1.4.2
[37] tidyr_0.8.2              crayon_1.3.4             pkgconfig_2.0.2
[40] Matrix_1.2-14            prettyunits_1.0.2        assertthat_0.2.0
[43] httr_1.3.1               rstudioapi_0.8           R6_2.3.0
[46] GenomicAlignments_1.18.0 compiler_3.5.1
seqcat • 256 views
modified 6 months ago by erikfas30 • written 6 months ago by omeally0
3
6 months ago by
erikfas30
erikfas30 wrote:

The first problem is that your FILTER column from the variant calling is empty (.), as James already pointed out. I have already added a parameter to optionally skip filtering on the FILTER column, and I'll likely be adding some kind of check to see if the FILTER column is empty, as in your case; these updates will be available in the next devel version, likely within the next week. I'll post here again as soon as I've finished the coding, so you can get back to analysing your data.

The second problem (your comment to James' answer) is also related to your VCF and how the depth is stored in it. By manually inspecting the first VCF (from HaplotypeCaller) you can see that it contains the depth (DP) information in the FORMAT column: it looks like this: GT:AD:DP:GQ:PL. Your second VCF (from mutect) does not follow this procedure, however, and the DP information is missing: GT:AD:AF:F1R2:F2R1:[...]. As seqCAT works using the data stored in the per-sample column based on the FORMAT column and the VariantAnnotation package, the result is that you depth for all variants becomes NA.

I have not used mutect myself, so I do not know why it doesn't store the depth information where it is usually stored with other variant callers. I shall look into this and see if I can do something about this. In the meantime, I'll do my best to incorporate the optional filtering code to the devel-version on GitHub (https://github.com/fasterius/seqCAT) as soon as possible, and you should be able to use your original HaplotypeCaller-VCFs.

It looks like the depth information does exist in the mutect, but it's in the INFO column, rather than being defined in the FORMAT column, and then being presented in the individual sample columns, which doesn't make any sense to me, as the depth information is sample-specific rather than allele-specific.

Thanks Erik and James- i really appreciate how quickly you're able to help! With your explanations and a bit of googling, i found it's a bug in Mutect2 thats since been fixed (https://github.com/broadinstitute/gatk/pull/5185 ). I'll filter the haplotypecaller vcfs and try again. I'll also let the Sarek devs know that a fix is available. Thanks so much!

Happy to help! And thank you too, James, for your answers! Regarding your HaplotypeCaller VCF, I did some thinking. I'm guessing the reason that your FILTER column was empty was that you have not run the data through any filtering (whether using hard thresholds or GATK'S VQSR), which I would recommend from a biological standpoint: it's important to remove variants that are more likely to be false positives. If the data is genomic you should be able to use VQSR (check out GATK's Best Practices), but it has not yet been evaluated on RNA-seq (at least at the time of writing).

I've added the relevant code, which you can find in version 1.5.1 at GitHub right now, or at Bioconductor-devel after it has been built in their overnight processes. Good luck with your analyses!

0
6 months ago by
United States
James W. MacDonald50k wrote:

The FILTER column in (at least the VCF I looked at) are all missing, and create_profile filters on those that are 'PASS', so after that filtering step you have no remaining data. You probably need to figure out how to get Sarek to put something useful in the FILTER column of your VCF.

Thanks for the tip, but its still not working, although the error from the mutect VCF is now similar to haplotypecaller:

gatk FilterMutectCalls -V mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz -O mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.filtered.vcf.gz

> vcfm<-('../VCFs/mutect/mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.filtered.vcf.gz')
> create_profile(vcfm, 'T2_CTL_CTL_rep1', 'T2_CTL_CTL_rep1.txt')
Creating SNV profile ...
Error in [<-.data.frame(*tmp*, !grepl("^rs[0-9]+", data$rsID), "rsID", : replacement has 1 row, data has 0 In addition: Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames  Also scanVcfHeader gives the error: > scanVcfHeader(vcfm) class: VCFHeader samples(2): T0_CTL_CTL_rep1 T2_CTL_CTL_rep1 meta(8): SnpEffCmd SnpEffVersion ... GATKCommandLine contig fixed(1): FILTER info(16): ANN CSQ ... STR TLOD geno(16): GT AD ... SA_MAP_AF SA_POST_PROB Warning message: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames  ADD REPLYlink modified 6 months ago • written 6 months ago by omeally0 You should probably be a bit more careful with your terminology and/or understanding of what R is telling you. The error for every failure you have had has been identical - when create_profile is filtering your data, you end up with no data remaining. This error is reflected in the error message: Error in [<-.data.frame(*tmp*, !grepl("^rs[0-9]+", data$rsID), "rsID",  :
replacement has 1 row, data has 0

This comes from create_profile_R, in this portion of the code:

    gr <- gr[gr$FILTER == "PASS", ] gr$FILTER <- NULL
gr <- gr[gr$DP >= filter_depth & !is.na(gr$DP), ]
data <- GenomicRanges::as.data.frame(gr)



So you have to have 'PASS' in your FILTER column, and you also have to have DP values >= 10 (if you use the defaults for filter_depth). If you don't have both PASS and DP >= 10, then you end up with zero rows (which is what is happening).

If you get something like

In addition: Warning messages:
duplicate keys in header will be forced to unique rownames
duplicate keys in header will be forced to unique rownames
duplicate keys in header will be forced to unique rownames

Those are warnings, which are NOT errors. The difference being that a warning says 'hey, this thing happened that you might not have expected', but R kept going, and an error means that R stopped running the function because something went wrong.

As for why you keep getting errors, you now know why, and it's up to you to check your input data to see what's up.

Thanks for your help James, I really appreciate it! I take your point re warnings/errors (still learning R/bioconductor), but I'm confused because after running FilterMutectCalls, the filter column is populated and many variants have DP>=10, but no profile is written to disk. One eg variant:

1 1581926 . G T . PASS ANN=T|intron_variant|MODIFIER|CDK11B|ENSG00000248333|transcript|ENST00000407249|protein_coding|2/20|c.228-517C&gt;A||||||;CSQ=T|intron_variant|MODIFIER|CDK11B|ENSG00000248333|Transcript|ENST00000407249|protein_coding||2/20||||||||rs61774917||-1||SNV|HGNC|1729|YES||||ENSP00000464036||Q6P5Y5&amp;Q5QPQ9&amp;J3QR44&amp;A4VCI5|UPI0000D61E1A|||||||||||||||||||||||||||||||||;DP=122;ECNT=1;NLOD=16.09;N_ART_LOD=-0.5349;POP_AF=1e-05;P_CONTAM=0.00;P_GERMLINE=-2.392e+01;TLOD=5.9 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/0:63,1:0.04:31,0:32,1:40,26:227,228:27:20 0/1:47,4:0.111:23,4:24,0:39,36:215,209:30:17:0.071,0.061,0.078:0.016,0.00978,0.974


I'm still searching for what might cause the warnings - seems its common to the VariantAnnotation package. Presumably the header is odd somehow, or do indels need to be filtered? I can read the VCFs using read_vcfs_as_granges from the MutationalPatterns package, but it issues warnings that indels and/or multiple alternative alleles are excluded.

According to VariantAnnotation that's not correct.

> z <- readVcf("mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.filtered.vcf")
Warning messages:
duplicate keys in header will be forced to unique rownames
duplicate keys in header will be forced to unique rownames
> table(geno(z)$DP) < table of extent 0 > > head(geno(z)$DP)
T0_CTL_CTL_rep1 T2_CTL_CTL_rep1
1:899778_C/T                NA              NA
1:990924_C/T                NA              NA
1:1138884_C/A               NA              NA
1:1327869_AT/A              NA              NA
1:1581926_G/T               NA              NA
1:1920434_TA/T              NA              NA
Hmmm, I'm at a loss then. In the above snippt, doesn't DP=122` for that variant?