Hello,
i have large VCF file, which contains 916 samples. Because i am interested in specific region of genom, i create smaller VCF with which looks like this:
class: CollapsedVCF dim: 167 916 rowRanges(vcf): GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER info(vcf): DataFrame with 16 columns: DP, NZ, AD, AN, AQ, GN, HT, EF, PV, IBD1, LLD, MAF, NI5, ANN, LOF, NMD info(header(vcf)): Number Type Description DP 1 Integer Total Depth NZ 1 Integer Number of taxa with data AD . Integer Total allelelic depths in order listed AN . Integer Total number of alleles in order listed AQ . Integer Average phred base quality for alleles in order listed GN . Integer Number of taxa with genotypes AA,AB,BB or AA,AB,AC,BB,BC,CC if 2 alt alleles HT 1 Integer Number of heterozygotes EF 1 Float Ed factor PV . Float p-value from segregation test between AB or AB, AC, BC if 2 alt alleles IBD1 0 Flag only one allele present in IBD contrasts LLD 0 Flag Site in local LD with GBS map MAF 1 Float Minor allele frequency NI5 0 Flag Site with 5bp of a putative indel ANN . String Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' LOF . String Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' NMD . String Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' geno(vcf): SimpleList of length 2: GT, AD geno(header(vcf)): Number Type Description GT 1 String Genotype AD . Integer Allelic depths for the ref and alt alleles in the order listed
However, when i try to import this file using readVcfAsVRanges() i got vranges object, which contains each possible variants in each sample (916x167=173124). I include only first line:
VRanges object with 1 range and 18 metadata columns: seqnames ranges strand ref alt totalDepth refDepth altDepth sampleNames softFilterMatrix | QUAL DP NZ AD AN <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <numeric> <integer> <integer> <IntegerList> <IntegerList> [1] 8 [35004963, 35004963] * C A <NA> 1 <NA> 030-1 | <NA> 1808 388 1658,150 708,68 AQ GN HT EF PV IBD1 LLD MAF NI5 <IntegerList> <IntegerList> <integer> <numeric> <NumericList> <logical> <logical> <numeric> <logical> [1] 35,34 353,2,33 2 0.14 0 FALSE TRUE 0.088 FALSE ANN <CharacterList> [1] A|upstream_gene_variant|MODIFIER|GRMZM2G331566|GRMZM2G331566|transcript|GRMZM2G331566_T02|protein_coding||c.-285C>A|||||1049|,A|upstream_gene_variant|MODIFIER|GRMZM2G501490|GRMZM2G501490|transcript|GRMZM2G501490_T01|transposable_element||n.-1C>A|||||4391|,A|upstream_gene_variant|MODIFIER|GRMZM2G331566|GRMZM2G331566|transcript|GRMZM2G331566_T01|protein_coding||c.-84C>A|||||1232|,... LOF NMD GT <CharacterList> <CharacterList> <character> [1] 0/0 ------- seqinfo: 10 sequences from 2 genomes (NA, genom); no seqlengths hardFilters: NULL
So it looks like all variants are summarized in each sample and this is not what i expected. When i open this file in IGV browser it looks normal - so same variants are present in same sample and some not. I try to set different parameters in ScanVcfParam(), however it doesn't work. Is it any way to load it correctly?
Thank you for help,
Marcin
The VRanges class is a 'flat' representation of the VCF class and has one row per ref/alt pair. Are you saying you want to know which samples have which variants? If yes, you'll need to use the genotype field to determine that. You may want to look at ?genotypeToSnpMatrix.
Valerie
Thank you for your reply and advice. Yes, it is what i would like to do. I thought that maybe easy way to make downstream analysis. So as i understand i can have to use readGT() to get proper information about genotype?
Marcin
You can get the genotype information a couple of difference ways:
Read the data as a VCF (as you've done above) then extract the genotypes with geno():
If you don't need other information from the file you read in genotypes only with readGT():
To determine which samples have which snps it's easiest to convert a VCF object to a SnpMatrix object
then use methods from the snpStats package. See the package vignettes for examples:
Valerie
Thank you very much for advice. SnpMatix seems to be very useful.
Marcin