Question: extracting Allele Read Counts - part2
gravatar for Bogdan
18 months ago by
Palo Alto, CA, USA
Bogdan520 wrote:

Dear all,

following an older conversation (, I would like to ask for your help on the following task : given a VCF file, I would like to extract from a BAM file :

a) number of READS / REF_ALLELE, and b) number of READS/ ALT_ALLELE .

I am convinced that there is a way to achieve it with Rsamtools; the issue is that my R code (below) is missing something. Any help would be very welcome and highly appreciated.


vcf <-VariantAnnotation::readVcf("7G.samtools.germline.HETERO.vcf", genome="hg38")

vcf_granges <- rowRanges(vcf)

sbp = ScanBamParam(which=vcf_granges)

bam_file <- "8T.intervals_chr19.IR.RC.bam"

pup = PileupParam(max_depth=1000,

x2 <- pileup(bam_file, scanBamParam=sbp, pileupParam=pup)

The results look in the following way :

> head(x2)
  seqnames   pos nucleotide count         which_label
1    chr19 61766          A    35 chr19:61766-61766.1
2    chr19 61766          C     9 chr19:61766-61766.1
3    chr19 61774          A    40 chr19:61774-61774.2
4    chr19 61774          G     9 chr19:61774-61774.2
5    chr19 61799          G     7 chr19:61799-61799.3
6    chr19 61799          T    40 chr19:61799-61799.3

And what i would like to have at the end is a TABLE with :


CHR position refAllele altAllele refCount altCount totalCount
chr19 61766 A C 36 10 46
chr19 61774 A G 40 9 49
chr19 61799 T G 43 7 50
chr19 62155 A G 49 11 60
chr19 71178 G C 3 8 11
chr19 80384 C A 6 10 16




ADD COMMENTlink modified 18 months ago by James W. MacDonald48k • written 18 months ago by Bogdan520
gravatar for James W. MacDonald
18 months ago by
United States
James W. MacDonald48k wrote:

This isn't a Bioconductor question. You have some data and you want to be able to re-format it into a different structure. There are any number of ways to do that, but none require a Bioconductor package. People seem to like Hadley Whickam's packages for that sort of thing, so you could look at, um, dplyr or reshape or something. Or you could probably use data.table, or just basic R functions. I would recommend figuring this out on your own, but if you really just want somebody to tell you how to do it, you could try stackexchange or R-help.

ADD COMMENTlink written 18 months ago by James W. MacDonald48k

Hi James, great to hear from you, and thank you for your feedback. i could easily do the re-formatting, do not worry, if I may rephrase the question a bit please : it would be wonderful if you could please help :

-- in the vcf file, let's consider a variant :  chr19   61766    A    C

-- i would like to know how many reads on this position have the allele A, and how many reads have the allele C

-- at this moment, my script only gives the number of reads that have A (i.e. 35 reads)

--'ve specified in PileUpParam : "distinguish_nucleotides=TRUE", although it did not have an effect (apparently)

how shall i do it ? thank you very much ;) !


ADD REPLYlink written 18 months ago by Bogdan520

Well, your question is pretty unclear. You say 'given a VCF, I would like to extract from a BAM file', which doesn't indicate that you want to get these data from the VCF. In addition, you don't use the VCF for anything but getting the variant positions.

In addition, your script DOES give you the read counts at each position - it just happens to come from the BAM file.

What you can get from the VCF is dependent on what is in the VCF, and you don't show that. So here is one I recently worked with:

> vcf
class: CollapsedVCF
dim: 46454 2
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
  DataFrame with 20 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DB, ...
                   Number Type    Description                                  
   AC              A      Integer Allele count in genotypes, for each ALT al...
   AF              A      Float   Allele Frequency, for each ALT allele, in ...
   AN              1      Integer Total number of alleles in called genotypes  
   BaseQRankSum    1      Float   Z-score from Wilcoxon rank sum test of Alt...
   ClippingRankSum 1      Float   Z-score From Wilcoxon rank sum test of Alt...
   DB              0      Flag    dbSNP Membership                             
   DP              1      Integer Approximate read depth; some reads may hav...
   DS              0      Flag    Were any of the samples downsampled?         
   END             1      Integer Stop position of the interval                
   FS              1      Float   Phred-scaled p-value using Fisher's exact ...
   HaplotypeScore  1      Float   Consistency of the site with at most two s...
   InbreedingCoeff 1      Float   Inbreeding coefficient as estimated from t...
   MLEAC           A      Integer Maximum likelihood expectation (MLE) for t...
   MLEAF           A      Float   Maximum likelihood expectation (MLE) for t...
   MQ              1      Float   RMS Mapping Quality                          
   MQ0             1      Integer Total Mapping Quality Zero Reads             
   MQRankSum       1      Float   Z-score From Wilcoxon rank sum test of Alt...
   QD              1      Float   Variant Confidence/Quality by Depth          
   ReadPosRankSum  1      Float   Z-score from Wilcoxon rank sum test of Alt...
   SOR             1      Float   Symmetric Odds Ratio of 2x2 contingency ta...
  SimpleList of length 8: GT, AD, DP, GQ, MIN_DP, PL, RGQ, SB
          Number Type    Description                                           
   GT     1      String  Genotype                                              
   AD     .      Integer Allelic depths for the ref and alt alleles in the o...
   DP     1      Integer Approximate read depth (reads with MQ=255 or with b...
   GQ     1      Integer Genotype Quality                                      
   MIN_DP 1      Integer Minimum DP observed within the GVCF block             
   PL     G      Integer Normalized, Phred-scaled likelihoods for genotypes ...
   RGQ    1      Integer Unconditional reference genotype confidence, encode...
   SB     4      Integer Per-sample component statistics which comprise the ...

And you can see that the geno slot is a SimpleList with 8 things, one of which is called AD, which is described as being the Allelic depths.

ADD REPLYlink written 18 months ago by James W. MacDonald48k

HI James, thank you, yes, of course, some of the VCF files (especially those generated by HaplotypeCaller, Varscan, or Mutect, just to name a few callers) do output the AD field in their output VCF file.

Here, the question is a bit different (sorry sorry that i have not described in a better way). More precisely:

1 -- the VCF file is generated by HaplotypeCaller and contains the GERMLINE variants (and indeed the VCF file contains the AD according to the GERMLINE sample)

2 -- from the GERMLINE VCF file, I would only need the POSITIONS and the NUCLEOTIDES for the corresponding ALLELES (i.e. A and C in the example:  chr19   61766    A    C).

3 -- what i would like to do, is to the take -- the POSITIONS of the VARIANTS from the GERMLINE VCF file, and -- the ALLELES (eg A and C in the example)  and -- based on a  new BAM file (from TUMOR) to compute the ALLELE FREQUENCIES of GERMLINE VARIANTS in the TUMOR sample (considering the TUMOR BAM file).

shall i re-post the question perhaps with a better description ? I believe that i could use Rsamtools in this regard, however my R code is missing an important detail, and I do not know how to fix it ;)





ADD REPLYlink modified 18 months ago • written 18 months ago by Bogdan520

I must be totally confused, because it seems to me that's exactly what you have done, except for the last part, which is converting the data you get from the BAM:

> head(x2)
  seqnames   pos nucleotide count         which_label
1    chr19 61766          A    35 chr19:61766-61766.1
2    chr19 61766          C     9 chr19:61766-61766.1
3    chr19 61774          A    40 chr19:61774-61774.2
4    chr19 61774          G     9 chr19:61774-61774.2

Into the table you desire, which as I said at the outset is a simple data munging task.


ADD REPLYlink written 18 months ago by James W. MacDonald48k

Hi James, yes, you are right ... sorry, I was working late at night yesterday and I have not seen that the VARIANTS are grouped function of 2 consecutive lines.

Now, after a coffee, and after talking with you, everything is more clear ;)

thanks a lot for your help, and patience, and replies, and have a happy Easter time ;) !

ADD REPLYlink modified 18 months ago • written 18 months ago by Bogdan520
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 396 users visited in the last hour