Question

Problem with summarizeVariants function (VariantAnnotation) when using a multi-sample vcf

0

Entering edit mode

juls • 0

@juls-11275

Last seen 3.7 years ago

Austria

Hi all,

I am using the R package VariantAnnotation which I like very much.

I hope you can help me with an issue though - I have a problem concerning the summarizeVariants function when using a multi-sample vcf.

I have 6 samples in my vcf file.

# reading vcf
chromInfo <- seqinfo(scanVcfHeader(fl)) vcf <- readVcf(fl, chromInfo)

The samples were read in fine - see examples below:
info(vcf)

DataFrame with 2 rows and 18 columns
            AC            AF        AN BaseQRankSum ClippingRankSum        DP        DS        FS HaplotypeScore InbreedingCoeff         MLEAC         MLEAF        MQ       MQ0 MQRankSum        QD
<IntegerList> <NumericList> <integer>    <numeric>       <numeric> <integer> <logical> <numeric>      <numeric>       <numeric> <IntegerList> <NumericList> <numeric> <integer> <numeric> <numeric>
1             5         0.833         6       -0.354          -0.825        40     FALSE        NA             NA              NA            NA            NA        NA         0     2.003        NA
2             4             1         4           NA              NA        26     FALSE         0             NA              NA             2             1        NA         0        NA        NA
ReadPosRankSum                        set
      <numeric>                <character>
1          1.061 variant4-variant5-variant6
2             NA          variant5-variant6

# transcripts tx <- transcripts(ann2) txlst <- splitAsList(tx, seq_len(length(tx))) # summarize promoterVar.sum <- summarizeVariants(txlst, vcf, PromoterVariants()) head(assays(promoterVar.sum)$counts) colSums(assays(promoterVar.sum)$counts) # example output
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
25      51      51      51      51      51      51
26       7       7       7       7       7       7
2       44      44      44      44      44      44
3       97      97      97      97      97      97
27      97      97      97      97      97      97
28      97      97      97      97      97      97

As you can see the output is for all samples the same - which is not the case!
This happens also when I use findOverlaps or any other mode.

Am I doing something wrong? I very much appreciate your help!

Kind Regards,
Julia

variantannotation • 1.6k views

ADD COMMENT • link updated 8.7 years ago by Valerie Obenchain ★ 6.8k • written 8.7 years ago by juls • 0

0

Entering edit mode

Not sure about that issue, but it would be easier to get the transcript annotations like this:

txlst <- exonsBy(ann2)

ADD REPLY • link 8.7 years ago Michael Lawrence ★ 11k

score 0 · Answer 1 · 2016-08-11

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.3 years ago

United States

Hi Julia,

I'm not able to reproduce this problem using another VCF file with multiple samples, e.g., chr22.vcf.gz used in the ?summarizeOverlaps man page. You also mention you're having this problem with findOverlaps which might point to a problem with the data and not the functions.

Regions in 'txlst' are being overlapped with regions in rowRanges(vcf). You may want to inspect each of those more closely. Are the ranges 'vcf' the same?

sum(duplicated(rowRanges(vcf)))

How about 'txlst'?

Valerie

ADD COMMENT • link 8.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie,

Thank you so much for answer.

The individual VCF file have been generated with GATK. I used GATK's CombineVariants for merging into one multi-sample VCF.

table(sum(duplicated(txlst)))

0
28872

rowRanges(vcf)

GRanges object with 60022 ranges and 5 metadata columns:

seqnames ranges strand | paramRangeID REF ALT QUAL FILTER
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
chr1_scaffold_2:1084_A/AT chr1_scaffold_2 [1084, 1084] * | <NA> A AT 282.73 .
chr1_scaffold_2:6286_CT/C chr1_scaffold_2 [6286, 6287] * | <NA> CT C 390.73 .

sum(duplicated(rowRanges(vcf)))

[1] 46

Best, Julia

ADD REPLY • link 8.7 years ago juls • 0

0

Entering edit mode

Can you make 'txlst' and 'vcf' available to me off-line (valerie.obenchain@roswelpark.org) with dropbox?

Valerie

ADD REPLY • link 8.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

One more thing you can look at is the GT field. Is there variation across samples?

Valerie

ADD REPLY • link 8.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

You should have access. I uploaded a test vcf file because the complete file is so huge. Thank you!!!

ADD REPLY • link 8.7 years ago juls • 0

1

Entering edit mode

Thanks for sending the test files.

This is the last line in the test.vcf:

chr1_scaffold_43 929893 . GA G 328.73 . AC=2;AF=0.500;AN=4;DP=40;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;set=variant4-variant5 GT:AD:DP:GQ:PL ./. ./. ./. 0/1:3,10:13:82:366,0,82 0/1:16,11:27:99:364,0,563 ./.

The genotype fields are GT, AD, DP, GQ and PL. It looks like all are missing for samples 1,2,3 and 6. Instead of representing this information as the usual string of colon separated missing values e.g., ./.:.,.:.:.,.,. it's represented as ./.

Because there were no fields to parse, the vcf reader in VariantAnnotation assigned a GT value of '.' to samples with no GT data. Consequently, summarizeVariants() did not recognize '.' as a missing GT value (was expecting "0|0", "0/0", "./.", or ".|."). I've modified summarizeVariants() to treat '.' as missing so it now works on your data. I've made this change in VariantAnnotation 1.19.10 (devel) and 1.18.7 (release).

> assays(promoterVar.sum)$counts[1:5,]
      
       Sample.variant Sample.variant2 Sample.variant3 Sample.variant4
  1764              1               0               0               0
  1740              0               0               0               0
  1741              0               6               0               6
  1765              0               1               0               0
  1742              4               4               6               5
      
       Sample.variant5 Sample.variant6
  1764               0               0
  1740               1               1
  1741               0               6
  1765               0               0
  1742               5               6

I was not aware that ./. was a valid representation of 'all fields missing' and I don't see it mentioned in the vcf specs (https://samtools.github.io/hts-specs/VCFv4.3.pdf). I'm not sure how to best modify the vcf parser (or if we should) to guess at the ploidy of the missing GT value. The header is no help -

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

One could look at other non-missing genotypes or the ALT field but that seems like a lot of fishing. If the 'all fields missing' ./. string is now standard output from GATK we probably need a reasonable default value or documented best guess. Open to ideas.

Valerie

ADD REPLY • link 8.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thank you so much!!

ADD REPLY • link 8.7 years ago juls • 0

0

Entering edit mode

Access where? Did you send me (valerie.obenchain@roswellpark.org) a link?

ADD REPLY • link 8.7 years ago Valerie Obenchain ★ 6.8k