VCF File too large for tabix. Best option to make it usable with R?
1
0
Entering edit mode
naive • 0
@e3519b66
Last seen 3.8 years ago

Hello everyone,

I have two large SNP data sets stored as vcf.gz files. So far, I found thet the tabix index derived from htslib is a good way to get access to genomic data that are too large for my RAM. However, it seems that both vcf.gz files are even too large to create a tabix index for them. Therefore, htslib recommends to create a CSI index. My question is: how can I access my CSI indexed data so that I can manipulate them with R to conduct for example GWAS?

I am thankful for your answers!

samtools tabix VariantAnnotation VCFArray • 2.8k views
ADD COMMENT
0
Entering edit mode

We don't expose the CSI functionality of htslib at this time. You mention RAM issues. If I understand correctly we are dealing with a chromosome with more than 2^29 positions and so tabix TBI indexing cannot work. But I would like to see the error message and version of tabix. We will look into what is required to support CSI at our end but it will take some time.

ADD REPLY
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 3 days ago
United States

I don't have a solution for the tabix-based behavior that you may be looking for. However, for ingestion of CSI-indexed vcf, with opportunities for RAM-limited region extraction, the following may help. This uses https://github.com/brentp/cyvcf2.

> library(reticulate)
1/5 packages newly attached/loaded, see sessionInfo() for details.

> # after pip3 install cyvcf2
> 
> cy = import("cyvcf2", convert=TRUE)

> py_help(cy$VCF)  # informative

> names(cy$VCF)
 [1] "add_filter_to_header" "add_format_to_header" "add_info_to_header"  
 [4] "add_to_header"        "close"                "contains"            
 [7] "gen_variants"         "get_header_type"      "header_iter"         
[10] "HET"                  "het_check"            "HOM_ALT"             
[13] "HOM_REF"              "ibd"                  "plot_relatedness"    
[16] "raw_header"           "relatedness"          "samples"             
[19] "seqlens"              "seqnames"             "set_index"           
[22] "set_samples"          "set_threads"          "site_relatedness"    
[25] "UNKNOWN"  

> # I added a record to the vcf file that 
> # you supplied, with position approximate length of chrom 1H
> # 622e6 

> v1 = cy$VCF("big.vcf.gz")  # throws warning  [W::bcf_hdr_check_sanity] PL should be declared as Number=G
> names(v1)
 [1] "add_filter_to_header" "add_format_to_header" "add_info_to_header"  
 [4] "add_to_header"        "close"                "contains"            
 [7] "gen_variants"         "get_header_type"      "header_iter"         
[10] "HET"                  "het_check"            "HOM_ALT"             
[13] "HOM_REF"              "ibd"                  "plot_relatedness"    
[16] "raw_header"           "relatedness"          "samples"             
[19] "seqlens"              "seqnames"             "set_index"           
[22] "set_samples"          "set_threads"          "site_relatedness"    
[25] "UNKNOWN"              "update"              

> allsamps = v1$samples

> head(allsamps)
[1] "genotype_1" "genotype_2" "genotype_3" "genotype_4" "genotype_5"
[6] "genotype_6"

> calls1 = iter_next(v1)

> class(calls1)
[1] "cyvcf2.cyvcf2.Variant" "python.builtin.object"

> names(calls1)
 [1] "aaf"                "ALT"                "call_rate"         
 [4] "CHROM"              "end"                "FILTER"            
 [7] "format"             "FORMAT"             "genotype"          
[10] "genotypes"          "gt_alt_depths"      "gt_alt_freqs"      
[13] "gt_bases"           "gt_depths"          "gt_phases"         
[16] "gt_phred_ll_het"    "gt_phred_ll_homalt" "gt_phred_ll_homref"
[19] "gt_quals"           "gt_ref_depths"      "gt_types"          
[22] "ID"                 "INFO"               "is_deletion"       
[25] "is_indel"           "is_snp"             "is_sv"             
[28] "is_transition"      "nucl_diversity"     "num_called"        
[31] "num_het"            "num_hom_alt"        "num_hom_ref"       
[34] "num_unknown"        "ploidy"             "POS"               
[37] "QUAL"               "REF"                "relatedness"       
[40] "set_format"         "set_pos"            "start"             
[43] "var_subtype"        "var_type"          

> print(calls1$REF)
[1] "G"

> print(calls1$ALT)
[1] "C"

> length(calls1$genotypes)
[1] 315
> length(allsamps)
[1] 315


> print(calls1$genotypes[1])
[[1]]
[[1]][[1]]
[1] 1

[[1]][[2]]
[1] 1

[[1]][[3]]
[1] FALSE

With a bit more programming you could specify the region to ingest and convert the results to a VRanges or other container that you find useful.

Login before adding your answer.

Traffic: 890 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6