Question

Importing VCFs in seqCAT

0

Entering edit mode

omeally • 0

@omeally-18776

Last seen 3.9 years ago

USA

I'm having some trouble importing VCFs into seqCAT. I used Sarek to produce VCFs with haplotypecaller and mutect2, and annotated with snpEff and VEP, but importing either fails:

> vcf <-("../VCFs/haplotypecaller/haplotypecaller_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz")
> create_profile(vcf, 'T0_CTL_CTL_rep1', 'T0_CTL_CTL_rep1.txt')

Reading VCF file ...
Creating SNV profile ...
Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID",  :
  replacement has 1 row, data has 0
In addition: Warning messages:
1: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
2: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
3: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames

I've tried the same with tumor-normal VCFs generated with mutect2, but get a different error:

> vcfm<-('../VCFs/mutect/mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz')
> create_profile(vcfm, 'T2_CTL_CTL_rep1', 'T2_CTL_CTL_rep1.txt')
Reading VCF file ...
Creating SNV profile ...
Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID",  :
  replacement has 1 row, data has 0

I've checked the sample names correspond using VariantAnnotation::scanVcfHeader, but no luck. Any other suggestions?

The two VCFs are here: https://drive.google.com/open?id=1wj68DQkneLzTP65j8u1Y_tsWy7Sn94BM

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] seqCAT_1.4.0                VariantAnnotation_1.28.3    Rsamtools_1.34.0           
 [4] Biostrings_2.50.1           XVector_0.22.0              SummarizedExperiment_1.12.0
 [7] DelayedArray_0.8.0          BiocParallel_1.16.2         matrixStats_0.54.0         
[10] Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
[13] IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] progress_1.2.0           tidyselect_0.2.5         purrr_0.2.5             
 [4] lattice_0.20-35          rtracklayer_1.42.1       GenomicFeatures_1.34.1  
 [7] blob_1.1.1               XML_3.98-1.16            rlang_0.3.0.1           
[10] pillar_1.3.0             glue_1.3.0               DBI_1.0.0               
[13] bit64_0.9-7              bindrcpp_0.2.2           GenomeInfoDbData_1.2.0  
[16] bindr_0.1.1              stringr_1.3.1            zlibbioc_1.28.0         
[19] memoise_1.1.0            biomaRt_2.38.0           AnnotationDbi_1.44.0    
[22] Rcpp_1.0.0               BSgenome_1.50.0          BiocManager_1.30.4      
[25] bit_1.1-14               hms_0.4.2                digest_0.6.18           
[28] stringi_1.2.4            dplyr_0.7.8              grid_3.5.1              
[31] tools_3.5.1              bitops_1.0-6             magrittr_1.5            
[34] RCurl_1.95-4.11          RSQLite_2.1.1            tibble_1.4.2            
[37] tidyr_0.8.2              crayon_1.3.4             pkgconfig_2.0.2         
[40] Matrix_1.2-14            prettyunits_1.0.2        assertthat_0.2.0        
[43] httr_1.3.1               rstudioapi_0.8           R6_2.3.0                
[46] GenomicAlignments_1.18.0 compiler_3.5.1

seqcat • 1.5k views

ADD COMMENT • link updated 5.4 years ago by erikfas ▴ 30 • written 5.4 years ago by omeally • 0

score 3 · Answer 1 · 2018-12-12

3

Entering edit mode

erikfas ▴ 30

@erikfas-13696

Last seen 5.0 years ago

The first problem is that your `FILTER` column from the variant calling is empty (`.`), as James already pointed out. I have already added a parameter to optionally skip filtering on the `FILTER` column, and I'll likely be adding some kind of check to see if the `FILTER` column is empty, as in your case; these updates will be available in the next devel version, likely within the next week. I'll post here again as soon as I've finished the coding, so you can get back to analysing your data.

The second problem (your comment to James' answer) is also related to your VCF and how the depth is stored in it. By manually inspecting the first VCF (from HaplotypeCaller) you can see that it contains the depth (`DP`) information in the `FORMAT` column: it looks like this: `GT:AD:DP:GQ:PL`. Your second VCF (from mutect) does not follow this procedure, however, and the `DP` information is missing: `GT:AD:AF:F1R2:F2R1:[...]`. As seqCAT works using the data stored in the per-sample column based on the `FORMAT` column and the VariantAnnotation package, the result is that you depth for all variants becomes `NA`.

I have not used mutect myself, so I do not know why it doesn't store the depth information where it is usually stored with other variant callers. I shall look into this and see if I can do something about this. In the meantime, I'll do my best to incorporate the optional filtering code to the devel-version on GitHub (https://github.com/fasterius/seqCAT) as soon as possible, and you should be able to use your original HaplotypeCaller-VCFs.

ADD COMMENT • link 5.4 years ago erikfas ▴ 30

0

Entering edit mode

It looks like the depth information does exist in the mutect, but it's in the INFO column, rather than being defined in the FORMAT column, and then being presented in the individual sample columns, which doesn't make any sense to me, as the depth information is sample-specific rather than allele-specific.

ADD REPLY • link 5.4 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks Erik and James- i really appreciate how quickly you're able to help! With your explanations and a bit of googling, i found it's a bug in Mutect2 thats since been fixed (https://github.com/broadinstitute/gatk/pull/5185 ). I'll filter the haplotypecaller vcfs and try again. I'll also let the Sarek devs know that a fix is available. Thanks so much!

ADD REPLY • link 5.4 years ago omeally • 0

0

Entering edit mode

Happy to help! And thank you too, James, for your answers! Regarding your HaplotypeCaller VCF, I did some thinking. I'm guessing the reason that your `FILTER` column was empty was that you have not run the data through any filtering (whether using hard thresholds or GATK'S VQSR), which I would recommend from a biological standpoint: it's important to remove variants that are more likely to be false positives. If the data is genomic you should be able to use VQSR (check out GATK's Best Practices), but it has not yet been evaluated on RNA-seq (at least at the time of writing).

ADD REPLY • link 5.4 years ago erikfas ▴ 30

0

Entering edit mode

I've added the relevant code, which you can find in version `1.5.1` at GitHub right now, or at Bioconductor-devel after it has been built in their overnight processes. Good luck with your analyses!

ADD REPLY • link 5.4 years ago erikfas ▴ 30

score 0 · Answer 2 · 2018-12-11

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

The FILTER column in (at least the VCF I looked at) are all missing, and create_profile filters on those that are 'PASS', so after that filtering step you have no remaining data. You probably need to figure out how to get Sarek to put something useful in the FILTER column of your VCF.

ADD COMMENT • link 5.4 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for the tip, but its still not working, although the error from the mutect VCF is now similar to haplotypecaller:

gatk FilterMutectCalls -V mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz -O mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.filtered.vcf.gz

> vcfm<-('../VCFs/mutect/mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.filtered.vcf.gz')
> create_profile(vcfm, 'T2_CTL_CTL_rep1', 'T2_CTL_CTL_rep1.txt')
Reading VCF file ...
Creating SNV profile ...
Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID",  : 
  replacement has 1 row, data has 0
In addition: Warning messages:
1: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
2: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
3: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames

Also scanVcfHeader gives the error:

> scanVcfHeader(vcfm)
class: VCFHeader 
samples(2): T0_CTL_CTL_rep1 T2_CTL_CTL_rep1
meta(8): SnpEffCmd SnpEffVersion ... GATKCommandLine contig
fixed(1): FILTER
info(16): ANN CSQ ... STR TLOD
geno(16): GT AD ... SA_MAP_AF SA_POST_PROB
Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames`

ADD REPLY • link 5.4 years ago omeally • 0

0

Entering edit mode

You should probably be a bit more careful with your terminology and/or understanding of what R is telling you. The error for every failure you have had has been identical - when create_profile is filtering your data, you end up with no data remaining. This error is reflected in the error message:

Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID",  : 
  replacement has 1 row, data has 0

This comes from create_profile_R, in this portion of the code:

    gr <- gr[gr$FILTER == "PASS", ]
    gr$FILTER <- NULL
    gr <- gr[gr$DP >= filter_depth & !is.na(gr$DP), ]
    data <- GenomicRanges::as.data.frame(gr)

So you have to have 'PASS' in your FILTER column, and you also have to have DP values >= 10 (if you use the defaults for filter_depth). If you don't have both PASS and DP >= 10, then you end up with zero rows (which is what is happening).

If you get something like

In addition: Warning messages:
1: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
2: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
3: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames

Those are warnings, which are NOT errors. The difference being that a warning says 'hey, this thing happened that you might not have expected', but R kept going, and an error means that R stopped running the function because something went wrong.

As for why you keep getting errors, you now know why, and it's up to you to check your input data to see what's up.

ADD REPLY • link 5.4 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for your help James, I really appreciate it! I take your point re warnings/errors (still learning R/bioconductor), but I'm confused because after running FilterMutectCalls, the filter column is populated and many variants have DP>=10, but no profile is written to disk. One eg variant:

1 1581926 . G T . PASS ANN=T|intron_variant|MODIFIER|CDK11B|ENSG00000248333|transcript|ENST00000407249|protein_coding|2/20|c.228-517C&gt;A||||||;CSQ=T|intron_variant|MODIFIER|CDK11B|ENSG00000248333|Transcript|ENST00000407249|protein_coding||2/20||||||||rs61774917||-1||SNV|HGNC|1729|YES||||ENSP00000464036||Q6P5Y5&amp;Q5QPQ9&amp;J3QR44&amp;A4VCI5|UPI0000D61E1A|||||||||||||||||||||||||||||||||;DP=122;ECNT=1;NLOD=16.09;N_ART_LOD=-0.5349;POP_AF=1e-05;P_CONTAM=0.00;P_GERMLINE=-2.392e+01;TLOD=5.9 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/0:63,1:0.04:31,0:32,1:40,26:227,228:27:20 0/1:47,4:0.111:23,4:24,0:39,36:215,209:30:17:0.071,0.061,0.078:0.016,0.00978,0.974

I'm still searching for what might cause the warnings - seems its common to the VariantAnnotation package. Presumably the header is odd somehow, or do indels need to be filtered? I can read the VCFs using read_vcfs_as_granges from the MutationalPatterns package, but it issues warnings that indels and/or multiple alternative alleles are excluded.

The filtered VCF is shared in the google drive link above

ADD REPLY • link 5.4 years ago omeally • 0

0

Entering edit mode

According to VariantAnnotation that's not correct.

> z <- readVcf("mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.filtered.vcf")
Warning messages:
1: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
2: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
> table(geno(z)$DP)
< table of extent 0 >
> head(geno(z)$DP)
               T0_CTL_CTL_rep1 T2_CTL_CTL_rep1
1:899778_C/T                NA              NA
1:990924_C/T                NA              NA
1:1138884_C/A               NA              NA
1:1327869_AT/A              NA              NA
1:1581926_G/T               NA              NA
1:1920434_TA/T              NA              NA

ADD REPLY • link 5.4 years ago James W. MacDonald 65k

0

Entering edit mode

Hmmm, I'm at a loss then. In the above snippt, doesn't DP=122 for that variant?

ADD REPLY • link 5.4 years ago omeally • 0