Dear all,
following an older conversation (https://support.bioconductor.org/p/80977/), I would like to ask for your help on the following task : given a VCF file, I would like to extract from a BAM file :
a) number of READS / REF_ALLELE, and b) number of READS/ ALT_ALLELE .
I am convinced that there is a way to achieve it with Rsamtools; the issue is that my R code (below) is missing something. Any help would be very welcome and highly appreciated.
library("VariantAnnotation")
library("Rsamtools")
vcf <-VariantAnnotation::readVcf("7G.samtools.germline.HETERO.vcf", genome="hg38")
vcf_granges <- rowRanges(vcf)
sbp = ScanBamParam(which=vcf_granges)
bam_file <- "8T.intervals_chr19.IR.RC.bam"
pup = PileupParam(max_depth=1000,
min_base_quality=13,
min_mapq=0,
min_nucleotide_depth=1,
min_minor_allele_depth=0,
distinguish_strands=FALSE,
distinguish_nucleotides=TRUE,
ignore_query_Ns=TRUE,
include_deletions=FALSE,
include_insertions=FALSE)
x2 <- pileup(bam_file, scanBamParam=sbp, pileupParam=pup)
The results look in the following way :
> head(x2)
seqnames pos nucleotide count which_label
1 chr19 61766 A 35 chr19:61766-61766.1
2 chr19 61766 C 9 chr19:61766-61766.1
3 chr19 61774 A 40 chr19:61774-61774.2
4 chr19 61774 G 9 chr19:61774-61774.2
5 chr19 61799 G 7 chr19:61799-61799.3
6 chr19 61799 T 40 chr19:61799-61799.3
And what i would like to have at the end is a TABLE with :
CHR | position | refAllele | altAllele | refCount | altCount | totalCount |
chr19 | 61766 | A | C | 36 | 10 | 46 |
chr19 | 61774 | A | G | 40 | 9 | 49 |
chr19 | 61799 | T | G | 43 | 7 | 50 |
chr19 | 62155 | A | G | 49 | 11 | 60 |
chr19 | 71178 | G | C | 3 | 8 | 11 |
chr19 | 80384 | C | A | 6 | 10 | 16 |
Hi James, great to hear from you, and thank you for your feedback. i could easily do the re-formatting, do not worry, if I may rephrase the question a bit please : it would be wonderful if you could please help :
-- in the vcf file, let's consider a variant : chr19 61766 A C
-- i would like to know how many reads on this position have the allele A, and how many reads have the allele C
-- at this moment, my script only gives the number of reads that have A (i.e. 35 reads)
--'ve specified in PileUpParam : "distinguish_nucleotides=TRUE", although it did not have an effect (apparently)
how shall i do it ? thank you very much ;) !
Well, your question is pretty unclear. You say 'given a VCF, I would like to extract from a BAM file', which doesn't indicate that you want to get these data from the VCF. In addition, you don't use the VCF for anything but getting the variant positions.
In addition, your script DOES give you the read counts at each position - it just happens to come from the BAM file.
What you can get from the VCF is dependent on what is in the VCF, and you don't show that. So here is one I recently worked with:
And you can see that the geno slot is a SimpleList with 8 things, one of which is called AD, which is described as being the Allelic depths.
HI James, thank you, yes, of course, some of the VCF files (especially those generated by HaplotypeCaller, Varscan, or Mutect, just to name a few callers) do output the AD field in their output VCF file.
Here, the question is a bit different (sorry sorry that i have not described in a better way). More precisely:
1 -- the VCF file is generated by HaplotypeCaller and contains the GERMLINE variants (and indeed the VCF file contains the AD according to the GERMLINE sample)
2 -- from the GERMLINE VCF file, I would only need the POSITIONS and the NUCLEOTIDES for the corresponding ALLELES (i.e. A and C in the example: chr19 61766 A C).
3 -- what i would like to do, is to the take -- the POSITIONS of the VARIANTS from the GERMLINE VCF file, and -- the ALLELES (eg A and C in the example) and -- based on a new BAM file (from TUMOR) to compute the ALLELE FREQUENCIES of GERMLINE VARIANTS in the TUMOR sample (considering the TUMOR BAM file).
shall i re-post the question perhaps with a better description ? I believe that i could use Rsamtools in this regard, however my R code is missing an important detail, and I do not know how to fix it ;)
I must be totally confused, because it seems to me that's exactly what you have done, except for the last part, which is converting the data you get from the BAM:
Into the table you desire, which as I said at the outset is a simple data munging task.
Hi James, yes, you are right ... sorry, I was working late at night yesterday and I have not seen that the VARIANTS are grouped function of 2 consecutive lines.
Now, after a coffee, and after talking with you, everything is more clear ;)
thanks a lot for your help, and patience, and replies, and have a happy Easter time ;) !