Search
Question: extracting Allele Read Counts - part2
0
15 months ago by
Bogdan490
Palo Alto, CA, USA
Bogdan490 wrote:

Dear all,

following an older conversation (https://support.bioconductor.org/p/80977/), I would like to ask for your help on the following task : given a VCF file, I would like to extract from a BAM file :

a) number of READS / REF_ALLELE, and b) number of READS/ ALT_ALLELE .

I am convinced that there is a way to achieve it with Rsamtools; the issue is that my R code (below) is missing something. Any help would be very welcome and highly appreciated.

library("VariantAnnotation")
library("Rsamtools")

vcf_granges <- rowRanges(vcf)

sbp = ScanBamParam(which=vcf_granges)

bam_file <- "8T.intervals_chr19.IR.RC.bam"

pup = PileupParam(max_depth=1000,
min_base_quality=13,
min_mapq=0,
min_nucleotide_depth=1,
min_minor_allele_depth=0,
distinguish_strands=FALSE,
distinguish_nucleotides=TRUE,
ignore_query_Ns=TRUE,
include_deletions=FALSE,
include_insertions=FALSE)

x2 <- pileup(bam_file, scanBamParam=sbp, pileupParam=pup)

The results look in the following way :

seqnames   pos nucleotide count         which_label
1    chr19 61766          A    35 chr19:61766-61766.1
2    chr19 61766          C     9 chr19:61766-61766.1
3    chr19 61774          A    40 chr19:61774-61774.2
4    chr19 61774          G     9 chr19:61774-61774.2
5    chr19 61799          G     7 chr19:61799-61799.3
6    chr19 61799          T    40 chr19:61799-61799.3

And what i would like to have at the end is a TABLE with :

 CHR position refAllele altAllele refCount altCount totalCount chr19 61766 A C 36 10 46 chr19 61774 A G 40 9 49 chr19 61799 T G 43 7 50 chr19 62155 A G 49 11 60 chr19 71178 G C 3 8 11 chr19 80384 C A 6 10 16

modified 15 months ago by James W. MacDonald46k • written 15 months ago by Bogdan490
0
15 months ago by
United States
James W. MacDonald46k wrote:

This isn't a Bioconductor question. You have some data and you want to be able to re-format it into a different structure. There are any number of ways to do that, but none require a Bioconductor package. People seem to like Hadley Whickam's packages for that sort of thing, so you could look at, um, dplyr or reshape or something. Or you could probably use data.table, or just basic R functions. I would recommend figuring this out on your own, but if you really just want somebody to tell you how to do it, you could try stackexchange or R-help.

Hi James, great to hear from you, and thank you for your feedback. i could easily do the re-formatting, do not worry, if I may rephrase the question a bit please : it would be wonderful if you could please help :

-- in the vcf file, let's consider a variant :  chr19   61766    A    C

-- i would like to know how many reads on this position have the allele A, and how many reads have the allele C

-- at this moment, my script only gives the number of reads that have A (i.e. 35 reads)

--'ve specified in PileUpParam : "distinguish_nucleotides=TRUE", although it did not have an effect (apparently)

how shall i do it ? thank you very much ;) !

1

Well, your question is pretty unclear. You say 'given a VCF, I would like to extract from a BAM file', which doesn't indicate that you want to get these data from the VCF. In addition, you don't use the VCF for anything but getting the variant positions.

In addition, your script DOES give you the read counts at each position - it just happens to come from the BAM file.

What you can get from the VCF is dependent on what is in the VCF, and you don't show that. So here is one I recently worked with:

> vcf
class: CollapsedVCF
dim: 46454 2
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 20 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DB, ...
Number Type    Description
AC              A      Integer Allele count in genotypes, for each ALT al...
AF              A      Float   Allele Frequency, for each ALT allele, in ...
AN              1      Integer Total number of alleles in called genotypes
BaseQRankSum    1      Float   Z-score from Wilcoxon rank sum test of Alt...
ClippingRankSum 1      Float   Z-score From Wilcoxon rank sum test of Alt...
DB              0      Flag    dbSNP Membership
DS              0      Flag    Were any of the samples downsampled?
END             1      Integer Stop position of the interval
FS              1      Float   Phred-scaled p-value using Fisher's exact ...
HaplotypeScore  1      Float   Consistency of the site with at most two s...
InbreedingCoeff 1      Float   Inbreeding coefficient as estimated from t...
MLEAC           A      Integer Maximum likelihood expectation (MLE) for t...
MLEAF           A      Float   Maximum likelihood expectation (MLE) for t...
MQ              1      Float   RMS Mapping Quality
MQ0             1      Integer Total Mapping Quality Zero Reads
MQRankSum       1      Float   Z-score From Wilcoxon rank sum test of Alt...
QD              1      Float   Variant Confidence/Quality by Depth
ReadPosRankSum  1      Float   Z-score from Wilcoxon rank sum test of Alt...
SOR             1      Float   Symmetric Odds Ratio of 2x2 contingency ta...
geno(vcf):
SimpleList of length 8: GT, AD, DP, GQ, MIN_DP, PL, RGQ, SB
Number Type    Description
GT     1      String  Genotype
AD     .      Integer Allelic depths for the ref and alt alleles in the o...
DP     1      Integer Approximate read depth (reads with MQ=255 or with b...
GQ     1      Integer Genotype Quality
MIN_DP 1      Integer Minimum DP observed within the GVCF block
PL     G      Integer Normalized, Phred-scaled likelihoods for genotypes ...
RGQ    1      Integer Unconditional reference genotype confidence, encode...
SB     4      Integer Per-sample component statistics which comprise the ...

And you can see that the geno slot is a SimpleList with 8 things, one of which is called AD, which is described as being the Allelic depths.

HI James, thank you, yes, of course, some of the VCF files (especially those generated by HaplotypeCaller, Varscan, or Mutect, just to name a few callers) do output the AD field in their output VCF file.

Here, the question is a bit different (sorry sorry that i have not described in a better way). More precisely:

1 -- the VCF file is generated by HaplotypeCaller and contains the GERMLINE variants (and indeed the VCF file contains the AD according to the GERMLINE sample)

2 -- from the GERMLINE VCF file, I would only need the POSITIONS and the NUCLEOTIDES for the corresponding ALLELES (i.e. A and C in the example:  chr19   61766    A    C).

3 -- what i would like to do, is to the take -- the POSITIONS of the VARIANTS from the GERMLINE VCF file, and -- the ALLELES (eg A and C in the example)  and -- based on a  new BAM file (from TUMOR) to compute the ALLELE FREQUENCIES of GERMLINE VARIANTS in the TUMOR sample (considering the TUMOR BAM file).

shall i re-post the question perhaps with a better description ? I believe that i could use Rsamtools in this regard, however my R code is missing an important detail, and I do not know how to fix it ;)

1

I must be totally confused, because it seems to me that's exactly what you have done, except for the last part, which is converting the data you get from the BAM:

> head(x2)
seqnames   pos nucleotide count         which_label
1    chr19 61766          A    35 chr19:61766-61766.1
2    chr19 61766          C     9 chr19:61766-61766.1
3    chr19 61774          A    40 chr19:61774-61774.2
4    chr19 61774          G     9 chr19:61774-61774.2

Into the table you desire, which as I said at the outset is a simple data munging task.

Hi James, yes, you are right ... sorry, I was working late at night yesterday and I have not seen that the VARIANTS are grouped function of 2 consecutive lines.

Now, after a coffee, and after talking with you, everything is more clear ;)

thanks a lot for your help, and patience, and replies, and have a happy Easter time ;) !