Hello,
i have large VCF file, which contains 916 samples. Because i am interested in specific region of genom, i create smaller VCF with which looks like this:
class: CollapsedVCF
dim: 167 916
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 16 columns: DP, NZ, AD, AN, AQ, GN, HT, EF, PV, IBD1, LLD, MAF, NI5, ANN, LOF, NMD
info(header(vcf)):
Number Type Description
DP 1 Integer Total Depth
NZ 1 Integer Number of taxa with data
AD . Integer Total allelelic depths in order listed
AN . Integer Total number of alleles in order listed
AQ . Integer Average phred base quality for alleles in order listed
GN . Integer Number of taxa with genotypes AA,AB,BB or AA,AB,AC,BB,BC,CC if 2 alt alleles
HT 1 Integer Number of heterozygotes
EF 1 Float Ed factor
PV . Float p-value from segregation test between AB or AB, AC, BC if 2 alt alleles
IBD1 0 Flag only one allele present in IBD contrasts
LLD 0 Flag Site in local LD with GBS map
MAF 1 Float Minor allele frequency
NI5 0 Flag Site with 5bp of a putative indel
ANN . String Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'
LOF . String Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'
NMD . String Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'
geno(vcf):
SimpleList of length 2: GT, AD
geno(header(vcf)):
Number Type Description
GT 1 String Genotype
AD . Integer Allelic depths for the ref and alt alleles in the order listed
However, when i try to import this file using readVcfAsVRanges() i got vranges object, which contains each possible variants in each sample (916x167=173124). I include only first line:
VRanges object with 1 range and 18 metadata columns: seqnames ranges strand ref alt totalDepth refDepth altDepth sampleNames softFilterMatrix | QUAL DP NZ AD AN <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <numeric> <integer> <integer> <IntegerList> <IntegerList> [1] 8 [35004963, 35004963] * C A <NA> 1 <NA> 030-1 | <NA> 1808 388 1658,150 708,68 AQ GN HT EF PV IBD1 LLD MAF NI5 <IntegerList> <IntegerList> <integer> <numeric> <NumericList> <logical> <logical> <numeric> <logical> [1] 35,34 353,2,33 2 0.14 0 FALSE TRUE 0.088 FALSE ANN <CharacterList> [1] A|upstream_gene_variant|MODIFIER|GRMZM2G331566|GRMZM2G331566|transcript|GRMZM2G331566_T02|protein_coding||c.-285C>A|||||1049|,A|upstream_gene_variant|MODIFIER|GRMZM2G501490|GRMZM2G501490|transcript|GRMZM2G501490_T01|transposable_element||n.-1C>A|||||4391|,A|upstream_gene_variant|MODIFIER|GRMZM2G331566|GRMZM2G331566|transcript|GRMZM2G331566_T01|protein_coding||c.-84C>A|||||1232|,... LOF NMD GT <CharacterList> <CharacterList> <character> [1] 0/0 ------- seqinfo: 10 sequences from 2 genomes (NA, genom); no seqlengths hardFilters: NULL
So it looks like all variants are summarized in each sample and this is not what i expected. When i open this file in IGV browser it looks normal - so same variants are present in same sample and some not. I try to set different parameters in ScanVcfParam(), however it doesn't work. Is it any way to load it correctly?
Thank you for help,
Marcin

The VRanges class is a 'flat' representation of the VCF class and has one row per ref/alt pair. Are you saying you want to know which samples have which variants? If yes, you'll need to use the genotype field to determine that. You may want to look at ?genotypeToSnpMatrix.
Valerie
Thank you for your reply and advice. Yes, it is what i would like to do. I thought that maybe easy way to make downstream analysis. So as i understand i can have to use readGT() to get proper information about genotype?
Marcin
You can get the genotype information a couple of difference ways:
Read the data as a VCF (as you've done above) then extract the genotypes with geno():
If you don't need other information from the file you read in genotypes only with readGT():
To determine which samples have which snps it's easiest to convert a VCF object to a SnpMatrix object
then use methods from the snpStats package. See the package vignettes for examples:
library(snpStats) browseVignettes("snpStats")Valerie
Thank you very much for advice. SnpMatix seems to be very useful.
Marcin