VariantAnnotation _ problem in merging files, mutationspectrum per sample
2
0
Entering edit mode
Anna N. • 0
@anna-n-6889
Last seen 9.3 years ago
France

Hello again

I have encountered some problems and I would like your valuable input.

My analysis pipeline produces either tsv file per patient/sample or vcf file.

Since I wanted to have all my patients in one file I tried several approaches.

I have samples that are different and also they don't have the same ranges, so rbind and cbind would not be very helpfull in my situation since they are recommended either for same sample/ different variants or same variants/ different samples.

 

One approach was to use readMutect for tsv files; create a VRange object for each tsv file, and then merge the VRanges objects.

Ac216test=readMutect("Ac216_out_test.tsv", strip=TRUE)

Ac295test=readMutect("Ac295_out_test.tsv", strip=TRUE)

Ac375test=readMutect("Ac375_out_test.tsv", strip=TRUE)

c(Ac216test, Ac295test, Ac375test)

vr_total=c(Ac216test, Ac295test, Ac375test)

vr_motifs= mutationContext(vr_total,BSgenome.Hsapiens.UCSC.hg19, unify = TRUE)

vr_mm=motifMatrix(vr_motifs, group="sampleNames", normalize=TRUE)

svg(file="report/p_mutation_spectrum.svg")

plotMutationSpectrum(vr_motifs, "sampleNames")

dev.off()

Then proceed with the rest

 

Since I don’t have only tsv files but also vcf and at some point I will need to use merged vfs I tried the following in order to compare results:

I used vcf tools to merge my vcf files and produce one vcf file with all samples

My main problem is that when I start from vcf file to  make the VRange object the sampleNames column is full of NAs.

When I use for example readVcfAsVRanges I end up with a vrange object with 324 ranges and O metadata columns, whereas sampleNames are all NA, so I cannot plot anything per sample.

fl="out2.vcf"

vcf=readVcf(fl, "hg19")

vr_test=readVcfAsVRanges(fl, "hg19")

Warning messages:

1: In .bcfHeaderAsSimpleList(header) :

  duplicate keys in header will be forced to unique rownames

2: In .vcf_usertag(map, tag, ...) :

  ScanVcfParam ‘geno’ fields not present: ‘AD’

vr_test

VRanges object with 324 ranges and 0 metadata columns:

 

 

If I use the as(vcf, “VRanges”) I end up with 1944 ranges and 101 metadata columns, but the sampleNames are retained.

fl="out2.vcf"

v Warning message:

In .bcfHeaderAsSimpleList(header) :

  duplicate keys in header will be forced to unique rownames

cf=readVcf(fl, "hg19")

vr=as(vcf,"VRanges")

vr

VRanges object with 1944 ranges and 101 metadata columns:

 

When I plot the mutationspectrum per sample I end up with identical spectra for all samples (exactly the same) and very different from what I end up with the tsv approach.

I feel a little lost. Could you point out what I am doing wrong and maybe the best approach in order to be able to plot per sample.

Thank you in advance
Anna

variantannotation • 2.0k views
ADD COMMENT
0
Entering edit mode

Let me try to separate and summarize the problems that you describe:

If you import a VCF file, you can maintain the sample names with as(readVcf(fl), "VCF").​​ So this part looks fine.

What exactly do you mean by "When I plot the mutationspectrum per sample I end up with identical spectra for all samples (exactly the same) and very different from what I end up with the tsv approach."? Is the "tsv approach" not per sample?

Can you inspect the imported variant calls and check if they are identical? This would most likely explain why the spectra are identical. 

In your statement "I have samples that are different and also they don't have the same ranges, so rbind and cbind would not be very helpfull in my situation since they are recommended either for same sample/ different variants or same variants/ different samples.", what do you mean by that different samples have different ranges?

ADD REPLY
0
Entering edit mode

Could you please edit your post and add the markup for the code? That makes it much easier to read.  Thanks.

ADD REPLY
0
Entering edit mode

Hi Julian

I edited the post, hope it is fine now.

I will try to explain a little better my problem.

The tsv files are Mutect's output, so I can use the readMutect command.

When I merge, the sample names are retained and I can plot the spectrum per sample _ meaning that I end up with a spectrum that has three distinctive spectra _one for each sample. So far so good.

I don't always end up with tsv files, I have also vcf files. For the vcf files I first merge the vcfs with vcftools (vcf merge) and then i try to import my file.

If I use the recommended approach readVcfAsVRanges, my sampleNames are full of NAs and I cannot plot my spectrum per sample (if i change the NA=sample, then I can have a spectrum per total, as everything is seen as "sample"). In this case, comparison is not easy, as I am unable to distinguish between samples, everything is considered as one sample.

If I use the vr=as(vcf,"VRanges") command and also use the param to specify the samples I want, my sample names are retained and I am able to plot the spectrum per sample. But in this case when I compare with the tsv approach, I see that I don't get the same results, and as I mentioned above the spectra are identical while the imported variant calls are not (not very different, but still not identical).

For your last question, I paste an example for chr1 for two different samples :As you can see I have two different samples that do not have identical variant calls.

 

One question that I still have, is how can one predefine the number of signatures he expects? And what if I expect only one signature?

Thank you in advance
Anna

 

#CHROM POS
chr1 1309199
chr1 1309213
chr1 9770606
chr1 21036222
chr1 40535851
chr1 42048764
chr1 150975012
chr1 155294225
chr1 156816374
chr1 175334300
chr1 180905240
chr1 182791267
chr1 185119609
chr1 197297719
chr1 200619583
chr1 206331168
chr1 226180206
chr1 241886740
 
   
#CHROM POS
chr1 1309199
chr1 1309213
chr1 42048764
chr1 64089396
chr1 150975012
chr1 156816374
chr1 162558481
chr1 175306721
chr1 182791267
chr1 200619583
chr1 206331168
chr1 235602158
 
   
   
   
   
   
   

 

   
   
   
   
   
   
   
   
   
   
ADD REPLY
0
Entering edit mode
Anna N. • 0
@anna-n-6889
Last seen 9.3 years ago
France

                

I used a different set of samples and I encountered the same problems- when I use the tsv output, after readMutect and merging, my matrix from the motifMatrix function looks like that:

                sample1       sample2      sample3     sample4      sample5     sample6       sample7    sample8
CA A.A 0.000000000 0.003007519 0.00000000 0.01775148 0.01219512 0.01754386 0.02061856 0.01587302
CA A.C 0.015151515 0.000000000 0.00000000 0.01183432 0.01219512 0.00000000 0.03092784 0.01587302
CA A.G 0.000000000 0.000000000 0.00000000 0.00000000 0.00000000 0.00000000 0.01030928 0.00000000
CA A.T 0.000000000 0.000000000 0.00000000 0.00000000 0.00000000 0.00000000 0.01030928 0.00000000
CA C.A 0.000000000 0.001503759 0.00000000 0.00000000 0.01219512 0.03508772 0.00000000 0.01587302
CA C.C 0.000000000 0.000000000 0.00000000 0.01775148 0.03658537 0.01754386 0.03092784 0.00000000
CA C.G 0.007575758 0.000000000 0.00000000 0.00591716 0.01219512 0.00000000 0.00000000 0.00000000
CA C.T 0.000000000 0.000000000 0.00000000 0.00000000 0.00000000 0.00877193 0.00000000 0.01587302
CA G.A 0.015151515 0.001503759 0.02380952 0.00000000 0.03658537 0.01754386 0.02061856 0.00000000
CA G.C 0.000000000 0.000000000 0.02380952 0.00591716 0.02439024 0.05263158 0.01030928 0.01587302

 

while when I use the merged vcf (for the same samples/analysis) and the command

vr=as(vcf,"VRanges")

my matrix looks like that:

                   sample1       sample2       sample3           sample4        sample5         sample6         sample7        sample8
CA A.A 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444
CA A.C 0.0072697900 0.0072697900 0.0072697900 0.0072697900 0.0072697900 0.0072697900 0.0072697900 0.0072697900
CA A.G 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544
CA A.T 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544
CA C.A 0.0056542811 0.0056542811 0.0056542811 0.0056542811 0.0056542811 0.0056542811 0.0056542811 0.0056542811
CA C.C 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989
CA C.G 0.0024232633 0.0024232633 0.0024232633 0.0024232633 0.0024232633 0.0024232633 0.0024232633 0.0024232633
CA C.T 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089
CA G.A 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989
CA G.C 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444
CA G.G 0.0048465267 0.0048465267 0.0048465267 0.0048465267 0.0048465267 0.0048465267 0.0048465267 0.0048465267
CA G.T 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089

 

As you can see  all rows have identical values, and this why I end up with identical spectra...Apparently something is wrong with my merged vcf...Any ideas as what I should be looking for?

 

Thank you
Anna

ADD COMMENT
0
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 5.0 years ago

As you suspected, there is mostly likely a problem in merging your VCF files manually. Since we don't have access to the exact code on the command line and R as well as the data itself, there is little we can do from the outside. I would suggest to examine the imported 'VRanges' object with each other, and also to compare equivalent entries of the mutect tsv and your merged vcf files.

ADD COMMENT

Login before adding your answer.

Traffic: 771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6