I have two large SNP data sets stored as vcf.gz files. So far, I found thet the tabix index derived from htslib is a good way to get access to genomic data that are too large for my RAM. However, it seems that both vcf.gz files are even too large to create a tabix index for them. Therefore, htslib recommends to create a CSI index.
My question is: how can I access my CSI indexed data so that I can manipulate them with R to conduct for example GWAS?
We don't expose the CSI functionality of htslib at this time. You mention RAM issues. If I understand correctly we are dealing with a chromosome with more than 2^29 positions and so tabix TBI indexing cannot work. But I would like to see the error message and version of tabix. We will look into what is required to support CSI at our end but it will take some time.
I don't have a solution for the tabix-based behavior that you may be looking for. However, for ingestion of CSI-indexed
vcf, with opportunities for RAM-limited region extraction, the following may help. This uses https://github.com/brentp/cyvcf2.
> library(reticulate)
1/5 packages newly attached/loaded, see sessionInfo()for details.
># after pip3 install cyvcf2>> cy = import("cyvcf2", convert=TRUE)> py_help(cy$VCF)# informative> names(cy$VCF)[1]"add_filter_to_header""add_format_to_header""add_info_to_header"[4]"add_to_header""close""contains"[7]"gen_variants""get_header_type""header_iter"[10]"HET""het_check""HOM_ALT"[13]"HOM_REF""ibd""plot_relatedness"[16]"raw_header""relatedness""samples"[19]"seqlens""seqnames""set_index"[22]"set_samples""set_threads""site_relatedness"[25]"UNKNOWN"># I added a record to the vcf file that ># you supplied, with position approximate length of chrom 1H># 622e6 > v1 = cy$VCF("big.vcf.gz")# throws warning [W::bcf_hdr_check_sanity] PL should be declared as Number=G> names(v1)[1]"add_filter_to_header""add_format_to_header""add_info_to_header"[4]"add_to_header""close""contains"[7]"gen_variants""get_header_type""header_iter"[10]"HET""het_check""HOM_ALT"[13]"HOM_REF""ibd""plot_relatedness"[16]"raw_header""relatedness""samples"[19]"seqlens""seqnames""set_index"[22]"set_samples""set_threads""site_relatedness"[25]"UNKNOWN""update"> allsamps = v1$samples> head(allsamps)[1]"genotype_1""genotype_2""genotype_3""genotype_4""genotype_5"[6]"genotype_6"> calls1 = iter_next(v1)> class(calls1)[1]"cyvcf2.cyvcf2.Variant""python.builtin.object"> names(calls1)[1]"aaf""ALT""call_rate"[4]"CHROM""end""FILTER"[7]"format""FORMAT""genotype"[10]"genotypes""gt_alt_depths""gt_alt_freqs"[13]"gt_bases""gt_depths""gt_phases"[16]"gt_phred_ll_het""gt_phred_ll_homalt""gt_phred_ll_homref"[19]"gt_quals""gt_ref_depths""gt_types"[22]"ID""INFO""is_deletion"[25]"is_indel""is_snp""is_sv"[28]"is_transition""nucl_diversity""num_called"[31]"num_het""num_hom_alt""num_hom_ref"[34]"num_unknown""ploidy""POS"[37]"QUAL""REF""relatedness"[40]"set_format""set_pos""start"[43]"var_subtype""var_type"> print(calls1$REF)[1]"G"> print(calls1$ALT)[1]"C"> length(calls1$genotypes)[1] 315
> length(allsamps)[1] 315
> print(calls1$genotypes[1])[[1]][[1]][[1]][1] 1
[[1]][[2]][1] 1
[[1]][[3]][1] FALSE
With a bit more programming you could specify the region to ingest and convert the results to a VRanges or other container that you find useful.
We don't expose the CSI functionality of htslib at this time. You mention RAM issues. If I understand correctly we are dealing with a chromosome with more than 2^29 positions and so tabix TBI indexing cannot work. But I would like to see the error message and version of tabix. We will look into what is required to support CSI at our end but it will take some time.