Hello again
I have encountered some problems and I would like your valuable input.
My analysis pipeline produces either tsv file per patient/sample or vcf file.
Since I wanted to have all my patients in one file I tried several approaches.
I have samples that are different and also they don't have the same ranges, so rbind and cbind would not be very helpfull in my situation since they are recommended either for same sample/ different variants or same variants/ different samples.
One approach was to use readMutect for tsv files; create a VRange object for each tsv file, and then merge the VRanges objects.
Ac216test=readMutect("Ac216_out_test.tsv", strip=TRUE)
Ac295test=readMutect("Ac295_out_test.tsv", strip=TRUE)
Ac375test=readMutect("Ac375_out_test.tsv", strip=TRUE)
c(Ac216test, Ac295test, Ac375test)
vr_total=c(Ac216test, Ac295test, Ac375test)
vr_motifs= mutationContext(vr_total,BSgenome.Hsapiens.UCSC.hg19, unify = TRUE)
vr_mm=motifMatrix(vr_motifs, group="sampleNames", normalize=TRUE)
svg(file="report/p_mutation_spectrum.svg")
plotMutationSpectrum(vr_motifs, "sampleNames")
dev.off()
Then proceed with the rest
Since I don’t have only tsv files but also vcf and at some point I will need to use merged vfs I tried the following in order to compare results:
I used vcf tools to merge my vcf files and produce one vcf file with all samples
My main problem is that when I start from vcf file to make the VRange object the sampleNames column is full of NAs.
When I use for example readVcfAsVRanges I end up with a vrange object with 324 ranges and O metadata columns, whereas sampleNames are all NA, so I cannot plot anything per sample.
fl="out2.vcf" vcf=readVcf(fl, "hg19") vr_test=readVcfAsVRanges(fl, "hg19") Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In .vcf_usertag(map, tag, ...) : ScanVcfParam ‘geno’ fields not present: ‘AD’ vr_test
VRanges object with 324 ranges and 0 metadata columns:
If I use the as(vcf, “VRanges”) I end up with 1944 ranges and 101 metadata columns, but the sampleNames are retained.
fl="out2.vcf" v Warning message: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames cf=readVcf(fl, "hg19") vr=as(vcf,"VRanges") vr VRanges object with 1944 ranges and 101 metadata columns:
When I plot the mutationspectrum per sample I end up with identical spectra for all samples (exactly the same) and very different from what I end up with the tsv approach.
I feel a little lost. Could you point out what I am doing wrong and maybe the best approach in order to be able to plot per sample.
Thank you in advance
Anna
Let me try to separate and summarize the problems that you describe:
If you import a VCF file, you can maintain the sample names with as(readVcf(fl), "VCF"). So this part looks fine.
What exactly do you mean by "When I plot the mutationspectrum per sample I end up with identical spectra for all samples (exactly the same) and very different from what I end up with the tsv approach."? Is the "tsv approach" not per sample?
Can you inspect the imported variant calls and check if they are identical? This would most likely explain why the spectra are identical.
In your statement "I have samples that are different and also they don't have the same ranges, so rbind and cbind would not be very helpfull in my situation since they are recommended either for same sample/ different variants or same variants/ different samples.", what do you mean by that different samples have different ranges?
Could you please edit your post and add the markup for the code? That makes it much easier to read. Thanks.
Hi Julian
I edited the post, hope it is fine now.
I will try to explain a little better my problem.
The tsv files are Mutect's output, so I can use the readMutect command.
When I merge, the sample names are retained and I can plot the spectrum per sample _ meaning that I end up with a spectrum that has three distinctive spectra _one for each sample. So far so good.
I don't always end up with tsv files, I have also vcf files. For the vcf files I first merge the vcfs with vcftools (vcf merge) and then i try to import my file.
If I use the recommended approach readVcfAsVRanges, my sampleNames are full of NAs and I cannot plot my spectrum per sample (if i change the NA=sample, then I can have a spectrum per total, as everything is seen as "sample"). In this case, comparison is not easy, as I am unable to distinguish between samples, everything is considered as one sample.
If I use the vr=as(vcf,"VRanges") command and also use the param to specify the samples I want, my sample names are retained and I am able to plot the spectrum per sample. But in this case when I compare with the tsv approach, I see that I don't get the same results, and as I mentioned above the spectra are identical while the imported variant calls are not (not very different, but still not identical).
For your last question, I paste an example for chr1 for two different samples :As you can see I have two different samples that do not have identical variant calls.
One question that I still have, is how can one predefine the number of signatures he expects? And what if I expect only one signature?
Thank you in advance
Anna