Question

How to import vcf with multi sample?

0

Entering edit mode

Marcin Grzybowski • 0

@marcin-grzybowski-8436

Last seen 9.5 years ago

Poland

Hello,

i have large VCF file, which contains 916 samples. Because i am interested in specific region of genom, i create smaller VCF with which looks like this:

class: CollapsedVCF
dim: 167 916
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 16 columns: DP, NZ, AD, AN, AQ, GN, HT, EF, PV, IBD1, LLD, MAF, NI5, ANN, LOF, NMD
info(header(vcf)):
        Number Type    Description                                                                                
   DP   1      Integer Total Depth                                                                                   
   NZ   1      Integer Number of taxa with data                                                                                                                                                            
   AD   .      Integer Total allelelic depths in order listed                                                                                                                                              
   AN   .      Integer Total number of alleles in order listed                                                                                                                                            
   AQ   .      Integer Average phred base quality for alleles in order listed                                                                                                                                
   GN   .      Integer Number of taxa with genotypes AA,AB,BB or AA,AB,AC,BB,BC,CC if 2 alt alleles                                                                                                    
   HT   1      Integer Number of heterozygotes                                                                                                                                                      
   EF   1      Float   Ed factor                                                                                  
   PV   .      Float   p-value from segregation test between AB or AB, AC, BC if 2 alt alleles                                                                                                       
   IBD1 0      Flag    only one allele present in IBD contrasts                                                                                                                                         
   LLD  0      Flag    Site in local LD with GBS map                                                                                                                                                    
   MAF  1      Float   Minor allele frequency                                                                                                                                                           
   NI5  0      Flag    Site with 5bp of a putative indel                                                                                                                                                 
   ANN  .      String  Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'
   LOF  .      String  Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'                                                                                                                       
   NMD  .      String  Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'                                                                                                                
geno(vcf):
  SimpleList of length 2: GT, AD
geno(header(vcf)):
      Number Type    Description                                                 
   GT 1      String  Genotype                                                    
   AD .      Integer Allelic depths for the ref and alt alleles in the order listed

However, when i try to import this file using readVcfAsVRanges() i got vranges object, which contains each possible variants in each sample (916x167=173124). I include only first line:

VRanges object with 1 range and 18 metadata columns:
      seqnames               ranges strand         ref              alt     totalDepth       refDepth       altDepth   sampleNames softFilterMatrix |      QUAL        DP        NZ            AD            AN
         <Rle>            <IRanges>  <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle>         <matrix> | <numeric> <integer> <integer> <IntegerList> <IntegerList>
  [1]        8 [35004963, 35004963]      *           C                A           <NA>              1           <NA>         030-1                  |      <NA>      1808       388      1658,150        708,68
                 AQ            GN        HT        EF            PV      IBD1       LLD       MAF       NI5
      <IntegerList> <IntegerList> <integer> <numeric> <NumericList> <logical> <logical> <numeric> <logical>
  [1]         35,34      353,2,33         2      0.14             0     FALSE      TRUE     0.088     FALSE
                                                                                                                                                                                                                                                                                                                                                                                                   ANN
                                                                                                                                                                                                                                                                                                                                                                                       <CharacterList>
  [1] A|upstream_gene_variant|MODIFIER|GRMZM2G331566|GRMZM2G331566|transcript|GRMZM2G331566_T02|protein_coding||c.-285C>A|||||1049|,A|upstream_gene_variant|MODIFIER|GRMZM2G501490|GRMZM2G501490|transcript|GRMZM2G501490_T01|transposable_element||n.-1C>A|||||4391|,A|upstream_gene_variant|MODIFIER|GRMZM2G331566|GRMZM2G331566|transcript|GRMZM2G331566_T01|protein_coding||c.-84C>A|||||1232|,...
                  LOF             NMD          GT
      <CharacterList> <CharacterList> <character>
  [1]                                         0/0
  -------
  seqinfo: 10 sequences from 2 genomes (NA, genom); no seqlengths
  hardFilters: NULL

So it looks like all variants are summarized in each sample and this is not what i expected. When i open this file in IGV browser it looks normal - so same variants are present in same sample and some not. I try to set different parameters in ScanVcfParam(), however it doesn't work. Is it any way to load it correctly?

Thank you for help,

Marcin

vcf readvcf readvcfasvranges • 3.3k views

ADD COMMENT • link 10.0 years ago Marcin Grzybowski • 0

0

Entering edit mode

The VRanges class is a 'flat' representation of the VCF class and has one row per ref/alt pair. Are you saying you want to know which samples have which variants? If yes, you'll need to use the genotype field to determine that. You may want to look at ?genotypeToSnpMatrix.

Valerie

ADD REPLY • link 10.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thank you for your reply and advice. Yes, it is what i would like to do. I thought that maybe easy way to make downstream analysis. So as i understand i can have to use readGT() to get proper information about genotype?

Marcin

ADD REPLY • link 10.0 years ago Marcin Grzybowski • 0

0

Entering edit mode

You can get the genotype information a couple of difference ways:

Read the data as a VCF (as you've done above) then extract the genotypes with geno():

vcf <- readVcf(file, "genome")

gt <- geno(vcf)

If you don't need other information from the file you read in genotypes only with readGT():

gt <- readGT(file)

To determine which samples have which snps it's easiest to convert a VCF object to a SnpMatrix object

genotypeToSnpMatrix(vcf)

then use methods from the snpStats package. See the package vignettes for examples:

library(snpStats)

browseVignettes("snpStats")

Valerie

ADD REPLY • link 10.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thank you very much for advice. SnpMatix seems to be very useful.

Marcin

ADD REPLY • link 10.0 years ago Marcin Grzybowski • 0