Question

VCF File too large for tabix. Best option to make it usable with R?

0

Entering edit mode

naive • 0

@e3519b66

Last seen 3.3 years ago

Hello everyone,

I have two large SNP data sets stored as vcf.gz files. So far, I found thet the tabix index derived from htslib is a good way to get access to genomic data that are too large for my RAM. However, it seems that both vcf.gz files are even too large to create a tabix index for them. Therefore, htslib recommends to create a CSI index. My question is: how can I access my CSI indexed data so that I can manipulate them with R to conduct for example GWAS?

I am thankful for your answers!

samtools tabix VariantAnnotation VCFArray • 2.4k views

ADD COMMENT • link updated 3.3 years ago by Vincent J. Carey, Jr. 6.7k • written 3.4 years ago by naive • 0

0

Entering edit mode

We don't expose the CSI functionality of htslib at this time. You mention RAM issues. If I understand correctly we are dealing with a chromosome with more than 2^29 positions and so tabix TBI indexing cannot work. But I would like to see the error message and version of tabix. We will look into what is required to support CSI at our end but it will take some time.

ADD REPLY • link 3.3 years ago Vincent J. Carey, Jr. 6.7k

score 0 · Answer 1 · 2021-04-03

I don't have a solution for the tabix-based behavior that you may be looking for. However, for ingestion of CSI-indexed vcf, with opportunities for RAM-limited region extraction, the following may help. This uses https://github.com/brentp/cyvcf2.

> library(reticulate)
1/5 packages newly attached/loaded, see sessionInfo() for details.

> # after pip3 install cyvcf2
> 
> cy = import("cyvcf2", convert=TRUE)

> py_help(cy$VCF)  # informative

> names(cy$VCF)
 [1] "add_filter_to_header" "add_format_to_header" "add_info_to_header"  
 [4] "add_to_header"        "close"                "contains"            
 [7] "gen_variants"         "get_header_type"      "header_iter"         
[10] "HET"                  "het_check"            "HOM_ALT"             
[13] "HOM_REF"              "ibd"                  "plot_relatedness"    
[16] "raw_header"           "relatedness"          "samples"             
[19] "seqlens"              "seqnames"             "set_index"           
[22] "set_samples"          "set_threads"          "site_relatedness"    
[25] "UNKNOWN"  

> # I added a record to the vcf file that 
> # you supplied, with position approximate length of chrom 1H
> # 622e6 

> v1 = cy$VCF("big.vcf.gz")  # throws warning  [W::bcf_hdr_check_sanity] PL should be declared as Number=G
> names(v1)
 [1] "add_filter_to_header" "add_format_to_header" "add_info_to_header"  
 [4] "add_to_header"        "close"                "contains"            
 [7] "gen_variants"         "get_header_type"      "header_iter"         
[10] "HET"                  "het_check"            "HOM_ALT"             
[13] "HOM_REF"              "ibd"                  "plot_relatedness"    
[16] "raw_header"           "relatedness"          "samples"             
[19] "seqlens"              "seqnames"             "set_index"           
[22] "set_samples"          "set_threads"          "site_relatedness"    
[25] "UNKNOWN"              "update"              

> allsamps = v1$samples

> head(allsamps)
[1] "genotype_1" "genotype_2" "genotype_3" "genotype_4" "genotype_5"
[6] "genotype_6"

> calls1 = iter_next(v1)

> class(calls1)
[1] "cyvcf2.cyvcf2.Variant" "python.builtin.object"

> names(calls1)
 [1] "aaf"                "ALT"                "call_rate"         
 [4] "CHROM"              "end"                "FILTER"            
 [7] "format"             "FORMAT"             "genotype"          
[10] "genotypes"          "gt_alt_depths"      "gt_alt_freqs"      
[13] "gt_bases"           "gt_depths"          "gt_phases"         
[16] "gt_phred_ll_het"    "gt_phred_ll_homalt" "gt_phred_ll_homref"
[19] "gt_quals"           "gt_ref_depths"      "gt_types"          
[22] "ID"                 "INFO"               "is_deletion"       
[25] "is_indel"           "is_snp"             "is_sv"             
[28] "is_transition"      "nucl_diversity"     "num_called"        
[31] "num_het"            "num_hom_alt"        "num_hom_ref"       
[34] "num_unknown"        "ploidy"             "POS"               
[37] "QUAL"               "REF"                "relatedness"       
[40] "set_format"         "set_pos"            "start"             
[43] "var_subtype"        "var_type"          

> print(calls1$REF)
[1] "G"

> print(calls1$ALT)
[1] "C"

> length(calls1$genotypes)
[1] 315
> length(allsamps)
[1] 315


> print(calls1$genotypes[1])
[[1]]
[[1]][[1]]
[1] 1

[[1]][[2]]
[1] 1

[[1]][[3]]
[1] FALSE

With a bit more programming you could specify the region to ingest and convert the results to a VRanges or other container that you find useful.