I'm getting the same error message as found in this post when running the create_profile
function from seqCAT
:
> create_profile(vcf, "SRR7789342", "SRR7789342_profile.txt")
Reading VCF file ...
Creating SNV profile ...
Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID", :
replacement has 1 row, data has 0
In addition: Warning messages:
1: In .bcfHeaderAsSimpleList(header) :
duplicate keys in header will be forced to unique rownames
2: In .bcfHeaderAsSimpleList(header) :
duplicate keys in header will be forced to unique rownames
3: In .bcfHeaderAsSimpleList(header) :
duplicate keys in header will be forced to unique rownames
4: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
- in 'x': chrM, chr15_KI270727v1_random, chrY_KI270740v1_random, chrUn_KI270304v1, chrUn_KI270305v1, chrUn_KI270322v1, chrUn_KI270320v1, chrUn_KI270312v1, chrUn_KI270412v1, chrUn_KI270419v1, chrUn_KI270420v1, chrUn_KI270424v1, chrUn_KI270417v1, chrUn_KI270422v1, chrUn_KI270425v1, chrUn_KI270528v1, chrUn_KI270539v1, chrUn_KI270548v1, chrUn_KI270334v1, chrUn_KI270335v1, chrUn_KI270338v1, chrUn_KI270340v1, chrUn_KI270336v1, chrUn_KI270364v1, chrUn_KI270366v1, chrUn_KI270379v1, chrUn_KI270389v1, chrUn_KI270395v1, chrUn_KI270388v1, chrUn_KI270386v1, chrUn_KI270376v1, chrUn_KI270374v1, chrUn_KI270372v1, chrUn_KI270373v1, chrUn_KI270375v1, chrUn_KI270371v1, chrUn_KI270752v1, chrUn_KI270755v1, chr1_KI270762v1_alt, chr1_KI270766v1_alt, chr1_KI270760v1_alt, chr1_GL383518v1_alt, chr1_KI270761v1_alt, chr2_KI270770v1_alt, chr2_KI270773v1_alt, chr2_KI270775v1_alt, chr2_KI270771v1_alt, chr2_KI270776v1_alt, chr2_KI270767v1_alt, chr [... truncated]
It looks like all of my variants are being filtered out? I double-checked the source code for the create_profile
function, and I understand that the following filters exist:
1. Remove variants without "PASS" in the FILTER
column.
2. Remove variants with depth below default of 10.
3. Remove non-variants, i.e. any row with greater than one char in the REF
or ALT
columns.
I went back to my .vcf to inspect for these filters, and I found a row that passes all those criteria:
chr1 187485 . G A 855.6 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-0.656;DP=37;ExcessHet=3.0103;FS=2.005;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=23.12;ReadPosRankSum=2.135;SOR=0.283;ANN=A|synonymous_variant|LOW|FO538757.1|ENSG00000279457|transcript|ENST00000623083.3|protein_coding|7/11|c.771C>T|p.Asn257Asn|771/1687|771/1395|257/464||,A|synonymous_variant|LOW|FO538757.1|ENSG00000279457|transcript|ENST00000623834.3|protein_coding|8/10|c.768C>T|p.Asn256Asn|768/1374|768/1080|256/359||,A|synonymous_variant|LOW|FO538757.1|ENSG00000279457|transcript|ENST00000624735.1|protein_coding|9/15|c.804C>T|p.Asn268Asn|1040/1604|804/1368|268/455||,A|downstream_gene_variant|MODIFIER|FO538757.2|ENSG00000279928|transcript|ENST00000624431.1|protein_coding||c.*3327G>A|||||3327|,A|downstream_gene_variant|MODIFIER|MIR6859-2|ENSG00000273874|transcript|ENST00000612080.1|miRNA||n.*406C>T|||||406| GT:AD:DP:GQ:PL 0/1:5,32:37:62:863,0,62
So, why would this row be filtered out? I know that create_profile
calls filter_annotations
to filter for the highest impact variants. Is this row getting filtered out because its impact in marked as LOW
?
For reference, I used HaplotypeCaller
, VariantFiltration
and snpEff
to get the final .vcf here: https://drive.google.com/open?id=1vfih6-QWPuczqHYDRhx6ygzzD8EjeRDm.
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)
Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
[3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
[5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
[7] LC_PAPER=en_US.iso885915 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] seqCAT_1.4.1 VariantAnnotation_1.27.9
[3] Rsamtools_1.33.7 Biostrings_2.49.2
[5] XVector_0.21.4 SummarizedExperiment_1.11.6
[7] DelayedArray_0.7.48 BiocParallel_1.15.15
[9] matrixStats_0.54.0 Biobase_2.41.2
[11] GenomicRanges_1.33.14 GenomeInfoDb_1.17.4
[13] IRanges_2.15.18 S4Vectors_0.19.22
[15] BiocGenerics_0.27.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 lattice_0.20-35 prettyunits_1.0.2
[4] ps_1.2.0 assertthat_0.2.0 rprojroot_1.3-2
[7] digest_0.6.18 R6_2.3.0 backports_1.1.2
[10] RSQLite_2.1.1 httr_1.3.1 zlibbioc_1.27.0
[13] rlang_0.3.0.1 GenomicFeatures_1.33.4 progress_1.2.0
[16] curl_3.2 callr_3.0.0 blob_1.1.1
[19] Matrix_1.2-14 desc_1.2.0 devtools_2.0.1
[22] stringr_1.3.1 RCurl_1.95-4.11 bit_1.1-14
[25] biomaRt_2.37.9 munsell_0.5.0 compiler_3.5.1
[28] rtracklayer_1.41.8 pkgconfig_2.0.2 base64enc_0.1-3
[31] pkgbuild_1.0.2 GenomeInfoDbData_1.2.0 XML_3.98-1.16
[34] crayon_1.3.4 withr_2.1.2 GenomicAlignments_1.17.3
[37] bitops_1.0-6 grid_3.5.1 gtable_0.2.0
[40] DBI_1.0.0 magrittr_1.5 cli_1.0.1
[43] stringi_1.2.4 fs_1.2.6 remotes_2.0.1
[46] testthat_2.0.1 tools_3.5.1 bit64_0.9-7
[49] glue_1.3.0 BSgenome_1.49.5 hms_0.4.2
[52] processx_3.2.0 pkgload_1.0.1 AnnotationDbi_1.44.0
[55] colorspace_1.3-2 BiocManager_1.30.3 sessioninfo_1.1.0
[58] memoise_1.1.0 usethis_1.4.0
Hi, abe!
There are a number of things you can check. First, are you using the correct sample names when creating the profile? You can check the sample names with
VariantAnnotation::scanVcfHeader(vcf)
Secondly, what type of variant calling have you performed? If you have done joint genotyping that might explain the issue; in that case, the latest development version should fix it (
1.5.1
and above). Bioconductor also has its upcoming release at April 30th, so that change will be included in the next release as well.Having only
LOW
variant impacts does not mean that they will be filtered out;seqCAT
only filters out those variants that are below the highest impact, i.e. anyMODIFIER
variant impacts would be filtered out here, as they have less impact thanLOW
. Also,seqCAT
does not remove "non-variants", but rather non-SNVs. So any variant that is longer than 1 bp, as you correctly stated, i.e. indels.I don't have permission to access the Google Drive folder, but if you provide me with access (I have requested it) I can try to look in your file to see if I can find the error as well.
I believe that this is a bug fixed in Rhtslib 1.15.6, but I am not exactly sure because your packages seem a mix of release and 'devel' versions. Be sure that
BiocManager::valid()
returns TRUE.