Filtering BAM files by start position for VariantTools

0

Entering edit mode

Taylor, Sean D ▴ 250

@taylor-sean-d-5800

Last seen 9.6 years ago

I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site. #Example code: > bams <- LungCancerLines::LungCancerBamFiles() > bam <- bams$H1993 > which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') > tally.param <- VariantTallyParam(gmapR::TP53Genome(), + readlen = 100L, + high_base_quality = 23L, + which = which) > raw.variants <- tallyVariants(bam, tally.param) This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426. Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode: > raw.variants<-lapply (start(bam), function (x){ which<-GRanges(seqnames=c('chrM'), '+', start=x) tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) tallyVariants(bamfile, tally.param) }) Thanks, Sean > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 AnnotationDbi_1.22.5 [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6 [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 [19] IRanges_1.18.1 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] biomaRt_2.16.0 bitops_1.0-5 [3] BSgenome_1.28.0 BSgenome.Hsapiens.UCSC.hg19_1.3.19 [5] graph_1.38.2 Matrix_1.0-12 [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 [9] RCurl_1.95-4.1 rtracklayer_1.20.2 [11] stats4_3.0.1 tools_3.0.1 [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 [15] zlibbioc_1.6.0 [[alternative HTML version deleted]]

VariantTools VariantTools • 2.2k views

ADD COMMENT • link updated 10.8 years ago by Valerie Obenchain ★ 6.8k • written 10.8 years ago by Taylor, Sean D ▴ 250

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Sean, As you've discovered, the 'which' in the 'param' (for reading bam files) specifies positions to overlap, not start or end. One approach to isolating reads with a specific starting position would be to filter the bam by 'pos'. library(VariantTools) fl <- LungCancerLines::LungCancerBamFiles()$H1993 mystart <- 1110426 filt <- list(setStart=function(x) x$pos %in% mystart) dest <- tempfile() filterBam(fl, dest, filter=FilterRules(filt)) scn <- scanBam(dest) Confirm all reads start with 'mystart': > table(scn[[1]]$pos) 1110426 2388 If you want a tally of all nucleotides for all sequences starting with 'mystart' then no need to supply a 'which': param <- VariantTallyParam(gmapR::TP53Genome(), readlen=100L, high_base_quality=23L) tally <- tallyVariants(fl, param) Valerie On 07/09/2013 02:06 PM, Taylor, Sean D wrote: > I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site. > > #Example code: >> bams <- LungCancerLines::LungCancerBamFiles() >> bam <- bams$H1993 >> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') >> tally.param <- VariantTallyParam(gmapR::TP53Genome(), > + readlen = 100L, > + high_base_quality = 23L, > + which = which) >> raw.variants <- tallyVariants(bam, tally.param) > > This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426. > > Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode: >> raw.variants<-lapply (start(bam), function (x){ > which<-GRanges(seqnames=c('chrM'), '+', start=x) > tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) > tallyVariants(bamfile, tally.param) > }) > > Thanks, > Sean > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] grid parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 > [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 AnnotationDbi_1.22.5 > [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 > [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 > [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6 > [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 > [19] IRanges_1.18.1 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 bitops_1.0-5 > [3] BSgenome_1.28.0 BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [5] graph_1.38.2 Matrix_1.0-12 > [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 > [9] RCurl_1.95-4.1 rtracklayer_1.20.2 > [11] stats4_3.0.1 tools_3.0.1 > [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 > [15] zlibbioc_1.6.0 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 10.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Sorry, the file here should be the filtered 'dest', not 'fl': > tally <- tallyVariants(fl, param) > > Valerie > Valerie > > > On 07/09/2013 02:06 PM, Taylor, Sean D wrote: >> I am trying to read a specific set of records from a bam file for use >> in the VariantTools package. I'm trying to construct a which argument >> (a GRanges object) that will pull in a set of records from all reads >> that only start at a specified position. (i.e. all reads that start at >> position 100). So far I have only been able to specify reads that >> overlap position 100, but have not been able to find a way to define >> the start site. >> >> #Example code: >>> bams <- LungCancerLines::LungCancerBamFiles() >>> bam <- bams$H1993 >>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') >>> tally.param <- VariantTallyParam(gmapR::TP53Genome(), >> + readlen = 100L, >> + high_base_quality = 23L, >> + which = which) >>> raw.variants <- tallyVariants(bam, tally.param) >> >> This code shows all the variants at position 1110426, but not all the >> variants from the reads that start at position 1110426. >> >> Ultimately, I am trying to do this for all start positions in my data >> set, so I would want something that looks like this pseudocode: >>> raw.variants<-lapply (start(bam), function (x){ >> which<-GRanges(seqnames=c('chrM'), '+', start=x) >> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) >> tallyVariants(bamfile, tally.param) >> }) >> >> Thanks, >> Sean >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 >> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 >> LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> >> attached base packages: >> [1] grid parallel stats graphics grDevices utils >> datasets methods base >> >> other attached packages: >> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 >> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 >> AnnotationDbi_1.22.5 >> [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 >> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 >> [13] ade4_1.5-2 VariantTools_1.2.2 >> VariantAnnotation_1.6.6 >> [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 >> [19] IRanges_1.18.1 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.16.0 bitops_1.0-5 >> [3] BSgenome_1.28.0 >> BSgenome.Hsapiens.UCSC.hg19_1.3.19 >> [5] graph_1.38.2 Matrix_1.0-12 >> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 >> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 >> [11] stats4_3.0.1 tools_3.0.1 >> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 >> [15] zlibbioc_1.6.0 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks Valerie, I'll give that a try. And I see that you just clarified my question about the input file for tallyVariants. I had thought about doing this before but hadn't been able to figure out the proper filter syntax. Thanks! The only downside to this approach is that I ultimately want to do this for all unique start sites that align to my reference. That could entail generating thousands of tempfiles that all have to be read back in. It could be done with lapply, but it sounds rather memory and time intensive. Another approach that I tried was reading the bam file as a GappedAlignments object and then splitting it by start position into a GAlignmentsList. That was really easy, but so far as I can tell tallyVariants will not accept GappedAlignments objects as an input. I wonder if there is another way to vectorize this approach? Thanks, Sean -----Original Message----- From: Valerie Obenchain [mailto:vobencha@fhcrc.org] Sent: Thursday, July 11, 2013 10:02 AM To: Taylor, Sean D Cc: bioconductor at r-project.org; Michael Lawrence Subject: Re: [BioC] Filtering BAM files by start position for VariantTools Hi Sean, As you've discovered, the 'which' in the 'param' (for reading bam files) specifies positions to overlap, not start or end. One approach to isolating reads with a specific starting position would be to filter the bam by 'pos'. library(VariantTools) fl <- LungCancerLines::LungCancerBamFiles()$H1993 mystart <- 1110426 filt <- list(setStart=function(x) x$pos %in% mystart) dest <- tempfile() filterBam(fl, dest, filter=FilterRules(filt)) scn <- scanBam(dest) Confirm all reads start with 'mystart': > table(scn[[1]]$pos) 1110426 2388 If you want a tally of all nucleotides for all sequences starting with 'mystart' then no need to supply a 'which': param <- VariantTallyParam(gmapR::TP53Genome(), readlen=100L, high_base_quality=23L) tally <- tallyVariants(fl, param) Valerie On 07/09/2013 02:06 PM, Taylor, Sean D wrote: > I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site. > > #Example code: >> bams <- LungCancerLines::LungCancerBamFiles() >> bam <- bams$H1993 >> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') >> tally.param <- VariantTallyParam(gmapR::TP53Genome(), > + readlen = 100L, > + high_base_quality = 23L, > + which = which) >> raw.variants <- tallyVariants(bam, tally.param) > > This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426. > > Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode: >> raw.variants<-lapply (start(bam), function (x){ > which<-GRanges(seqnames=c('chrM'), '+', start=x) > tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) > tallyVariants(bamfile, tally.param) > }) > > Thanks, > Sean > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] grid parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 > [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 AnnotationDbi_1.22.5 > [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 > [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 > [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6 > [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 > [19] IRanges_1.18.1 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 bitops_1.0-5 > [3] BSgenome_1.28.0 BSgenome.Hsapiens.UCSC.hg19_1.3.19 > [5] graph_1.38.2 Matrix_1.0-12 > [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 > [9] RCurl_1.95-4.1 rtracklayer_1.20.2 > [11] stats4_3.0.1 tools_3.0.1 > [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 > [15] zlibbioc_1.6.0 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.8 years ago Taylor, Sean D ▴ 250

0

Entering edit mode

As an fyi, Martin just checked in a change to Rsamtools (devel) that allows filtering by range. This should speed up the filtering substantially. I'll use the new syntax in the filter example below. I'm not sure of the best way to tallyVariants for all groups of reads for each unique start position. Here's an approach I might take - maybe others will chime in with their ideas. Start by reading in the bam and identify the unique starts. library(Rsamtools) fl <- system.file("extdata", "ex1.bam", package="Rsamtools") gal <- readGAlignments(fl) starts <- unique(start(gal)) mclapply() over the seqlevels containing unique starts. (code not tested) lst <- split(start(gal), as.factor(seqnames(gal))) starts <- lapply(lst, unique) vtparam <- VariantTallyParam(...) mclapply(starts, function(start, rname, fl, vtparam) { ## 'what' imports only the fields used in the filter ## 'which' is the position(s) to overlap param <- ScanBamParam( what=c("rname", "pos"), which=GRanges(rname, IRanges(start, width=1))) filter <- FilterRules(list(atPos = function(x) { (x$rname %in% rname) & (x$pos %in% start)})) dest0 <- filterBam(fl, tempfile(), filter=filter, param=param) tallyVariants(dest0, vtparam)}, rname=names(starts), fl=fl, vtparam=vtparam) Rsamtools in devel is broken today but should be ok tomorrow. Valerie On 07/11/2013 10:27 AM, Taylor, Sean D wrote: > Thanks Valerie, I'll give that a try. And I see that you just clarified my question about the input file for tallyVariants. > > I had thought about doing this before but hadn't been able to figure out the proper filter syntax. Thanks! The only downside to this approach is that I ultimately want to do this for all unique start sites that align to my reference. That could entail generating thousands of tempfiles that all have to be read back in. It could be done with lapply, but it sounds rather memory and time intensive. Another approach that I tried was reading the bam file as a GappedAlignments object and then splitting it by start position into a GAlignmentsList. That was really easy, but so far as I can tell tallyVariants will not accept GappedAlignments objects as an input. I wonder if there is another way to vectorize this approach? > > Thanks, > Sean > > > -----Original Message----- > From: Valerie Obenchain [mailto:vobencha at fhcrc.org] > Sent: Thursday, July 11, 2013 10:02 AM > To: Taylor, Sean D > Cc: bioconductor at r-project.org; Michael Lawrence > Subject: Re: [BioC] Filtering BAM files by start position for VariantTools > > Hi Sean, > > As you've discovered, the 'which' in the 'param' (for reading bam files) specifies positions to overlap, not start or end. One approach to isolating reads with a specific starting position would be to filter the bam by 'pos'. > > library(VariantTools) > fl <- LungCancerLines::LungCancerBamFiles()$H1993 > > mystart <- 1110426 > filt <- list(setStart=function(x) x$pos %in% mystart) > dest <- tempfile() > filterBam(fl, dest, filter=FilterRules(filt)) > scn <- scanBam(dest) > > Confirm all reads start with 'mystart': > > > table(scn[[1]]$pos) > > 1110426 > 2388 > > If you want a tally of all nucleotides for all sequences starting with 'mystart' then no need to supply a 'which': > param <- VariantTallyParam(gmapR::TP53Genome(), > readlen=100L, > high_base_quality=23L) > tally <- tallyVariants(fl, param) > > > Valerie > > > On 07/09/2013 02:06 PM, Taylor, Sean D wrote: >> I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site. >> >> #Example code: >>> bams <- LungCancerLines::LungCancerBamFiles() >>> bam <- bams$H1993 >>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') >>> tally.param <- VariantTallyParam(gmapR::TP53Genome(), >> + readlen = 100L, >> + high_base_quality = 23L, >> + which = which) >>> raw.variants <- tallyVariants(bam, tally.param) >> >> This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426. >> >> Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode: >>> raw.variants<-lapply (start(bam), function (x){ >> which<-GRanges(seqnames=c('chrM'), '+', start=x) >> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) >> tallyVariants(bamfile, tally.param) >> }) >> >> Thanks, >> Sean >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 >> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] grid parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 >> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 AnnotationDbi_1.22.5 >> [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 >> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 >> [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6 >> [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 >> [19] IRanges_1.18.1 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.16.0 bitops_1.0-5 >> [3] BSgenome_1.28.0 BSgenome.Hsapiens.UCSC.hg19_1.3.19 >> [5] graph_1.38.2 Matrix_1.0-12 >> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 >> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 >> [11] stats4_3.0.1 tools_3.0.1 >> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 >> [15] zlibbioc_1.6.0 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 10.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

On 07/11/2013 01:11 PM, Valerie Obenchain wrote: > As an fyi, Martin just checked in a change to Rsamtools (devel) that > allows filtering by range. This should speed up the filtering > substantially. I'll use the new syntax in the filter example below. > > I'm not sure of the best way to tallyVariants for all groups of reads > for each unique start position. Here's an approach I might take - maybe > others will chime in with their ideas. Start by reading in the bam and > identify the unique starts. > > library(Rsamtools) > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > gal <- readGAlignments(fl) > starts <- unique(start(gal)) > > mclapply() over the seqlevels containing unique starts. > (code not tested) Nope. This won't work. Trying to be too clever with the mclapply(). You could create two vectors, one of seqlevels and one of starts (unique combinations). Pass these to Map() and use the same logic as below. This would allow you to create the correct param, filter and get the tally results for each seqlevel-start combo. Valerie > > lst <- split(start(gal), as.factor(seqnames(gal))) > starts <- lapply(lst, unique) > vtparam <- VariantTallyParam(...) > mclapply(starts, > function(start, rname, fl, vtparam) { > ## 'what' imports only the fields used in the filter > ## 'which' is the position(s) to overlap > param <- ScanBamParam( > what=c("rname", "pos"), > which=GRanges(rname, > IRanges(start, width=1))) > filter <- FilterRules(list(atPos = function(x) { > (x$rname %in% rname) & > (x$pos %in% start)})) > dest0 <- filterBam(fl, tempfile(), > filter=filter, param=param) > tallyVariants(dest0, vtparam)}, > rname=names(starts), fl=fl, vtparam=vtparam) > > > Rsamtools in devel is broken today but should be ok tomorrow. > > Valerie > > On 07/11/2013 10:27 AM, Taylor, Sean D wrote: >> Thanks Valerie, I'll give that a try. And I see that you just >> clarified my question about the input file for tallyVariants. >> >> I had thought about doing this before but hadn't been able to figure >> out the proper filter syntax. Thanks! The only downside to this >> approach is that I ultimately want to do this for all unique start >> sites that align to my reference. That could entail generating >> thousands of tempfiles that all have to be read back in. It could be >> done with lapply, but it sounds rather memory and time intensive. >> Another approach that I tried was reading the bam file as a >> GappedAlignments object and then splitting it by start position into a >> GAlignmentsList. That was really easy, but so far as I can tell >> tallyVariants will not accept GappedAlignments objects as an input. I >> wonder if there is another way to vectorize this approach? >> >> Thanks, >> Sean >> >> >> -----Original Message----- >> From: Valerie Obenchain [mailto:vobencha at fhcrc.org] >> Sent: Thursday, July 11, 2013 10:02 AM >> To: Taylor, Sean D >> Cc: bioconductor at r-project.org; Michael Lawrence >> Subject: Re: [BioC] Filtering BAM files by start position for >> VariantTools >> >> Hi Sean, >> >> As you've discovered, the 'which' in the 'param' (for reading bam >> files) specifies positions to overlap, not start or end. One approach >> to isolating reads with a specific starting position would be to >> filter the bam by 'pos'. >> >> library(VariantTools) >> fl <- LungCancerLines::LungCancerBamFiles()$H1993 >> >> mystart <- 1110426 >> filt <- list(setStart=function(x) x$pos %in% mystart) >> dest <- tempfile() >> filterBam(fl, dest, filter=FilterRules(filt)) >> scn <- scanBam(dest) >> >> Confirm all reads start with 'mystart': >> >> > table(scn[[1]]$pos) >> >> 1110426 >> 2388 >> >> If you want a tally of all nucleotides for all sequences starting with >> 'mystart' then no need to supply a 'which': >> param <- VariantTallyParam(gmapR::TP53Genome(), >> readlen=100L, >> high_base_quality=23L) >> tally <- tallyVariants(fl, param) >> >> >> Valerie >> >> >> On 07/09/2013 02:06 PM, Taylor, Sean D wrote: >>> I am trying to read a specific set of records from a bam file for use >>> in the VariantTools package. I'm trying to construct a which argument >>> (a GRanges object) that will pull in a set of records from all reads >>> that only start at a specified position. (i.e. all reads that start >>> at position 100). So far I have only been able to specify reads that >>> overlap position 100, but have not been able to find a way to define >>> the start site. >>> >>> #Example code: >>>> bams <- LungCancerLines::LungCancerBamFiles() >>>> bam <- bams$H1993 >>>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') >>>> tally.param <- VariantTallyParam(gmapR::TP53Genome(), >>> + readlen = 100L, >>> + high_base_quality = 23L, >>> + which = which) >>>> raw.variants <- tallyVariants(bam, tally.param) >>> >>> This code shows all the variants at position 1110426, but not all the >>> variants from the reads that start at position 1110426. >>> >>> Ultimately, I am trying to do this for all start positions in my data >>> set, so I would want something that looks like this pseudocode: >>>> raw.variants<-lapply (start(bam), function (x){ >>> which<-GRanges(seqnames=c('chrM'), '+', start=x) >>> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) >>> tallyVariants(bamfile, tally.param) >>> }) >>> >>> Thanks, >>> Sean >>> >>>> sessionInfo() >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> LC_TIME=en_US.UTF-8 >>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 >>> LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> LC_ADDRESS=C >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 >>> LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] grid parallel stats graphics grDevices utils >>> datasets methods base >>> >>> other attached packages: >>> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 >>> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 >>> AnnotationDbi_1.22.5 >>> [7] Biobase_2.20.0 gmapR_1.2.0 >>> latticeExtra_0.6-24 >>> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 >>> [13] ade4_1.5-2 VariantTools_1.2.2 >>> VariantAnnotation_1.6.6 >>> [16] Rsamtools_1.12.3 Biostrings_2.28.0 >>> GenomicRanges_1.12.4 >>> [19] IRanges_1.18.1 BiocGenerics_0.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] biomaRt_2.16.0 bitops_1.0-5 >>> [3] BSgenome_1.28.0 >>> BSgenome.Hsapiens.UCSC.hg19_1.3.19 >>> [5] graph_1.38.2 Matrix_1.0-12 >>> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 >>> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 >>> [11] stats4_3.0.1 tools_3.0.1 >>> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 >>> [15] zlibbioc_1.6.0 >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

The tallyVariants function allows binned counts by read position. So I think all you need to do here is create a bin for the first position [1,1] and consider only the counts from that. The only down-side is that the counts are not quality filtered. Unfortunately, there is no cross- tallying of read position and quality, so there's no way to get this without a significant change to bam_tally. Also note that tallyVariants only reports positions with at least one alt read; not sure if this matters for your use case. You can turn this off if you use the low-level gmapR interface to bam_tally. Michael On Thu, Jul 11, 2013 at 1:24 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > On 07/11/2013 01:11 PM, Valerie Obenchain wrote: > >> As an fyi, Martin just checked in a change to Rsamtools (devel) that >> allows filtering by range. This should speed up the filtering >> substantially. I'll use the new syntax in the filter example below. >> >> I'm not sure of the best way to tallyVariants for all groups of reads >> for each unique start position. Here's an approach I might take - maybe >> others will chime in with their ideas. Start by reading in the bam and >> identify the unique starts. >> >> library(Rsamtools) >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> gal <- readGAlignments(fl) >> starts <- unique(start(gal)) >> >> mclapply() over the seqlevels containing unique starts. >> (code not tested) >> > > Nope. This won't work. Trying to be too clever with the mclapply(). > > You could create two vectors, one of seqlevels and one of starts (unique > combinations). Pass these to Map() and use the same logic as below. This > would allow you to create the correct param, filter and get the tally > results for each seqlevel-start combo. > > Valerie > > >> lst <- split(start(gal), as.factor(seqnames(gal))) >> starts <- lapply(lst, unique) >> vtparam <- VariantTallyParam(...) >> mclapply(starts, >> function(start, rname, fl, vtparam) { >> ## 'what' imports only the fields used in the filter >> ## 'which' is the position(s) to overlap >> param <- ScanBamParam( >> what=c("rname", "pos"), >> which=GRanges(rname, >> IRanges(start, width=1))) >> filter <- FilterRules(list(atPos = function(x) { >> (x$rname %in% rname) & >> (x$pos %in% start)})) >> dest0 <- filterBam(fl, tempfile(), >> filter=filter, param=param) >> tallyVariants(dest0, vtparam)}, >> rname=names(starts), fl=fl, vtparam=vtparam) >> >> >> Rsamtools in devel is broken today but should be ok tomorrow. >> >> Valerie >> >> On 07/11/2013 10:27 AM, Taylor, Sean D wrote: >> >>> Thanks Valerie, I'll give that a try. And I see that you just >>> clarified my question about the input file for tallyVariants. >>> >>> I had thought about doing this before but hadn't been able to figure >>> out the proper filter syntax. Thanks! The only downside to this >>> approach is that I ultimately want to do this for all unique start >>> sites that align to my reference. That could entail generating >>> thousands of tempfiles that all have to be read back in. It could be >>> done with lapply, but it sounds rather memory and time intensive. >>> Another approach that I tried was reading the bam file as a >>> GappedAlignments object and then splitting it by start position into a >>> GAlignmentsList. That was really easy, but so far as I can tell >>> tallyVariants will not accept GappedAlignments objects as an input. I >>> wonder if there is another way to vectorize this approach? >>> >>> Thanks, >>> Sean >>> >>> >>> -----Original Message----- >>> From: Valerie Obenchain [mailto:vobencha@fhcrc.org] >>> Sent: Thursday, July 11, 2013 10:02 AM >>> To: Taylor, Sean D >>> Cc: bioconductor@r-project.org; Michael Lawrence >>> Subject: Re: [BioC] Filtering BAM files by start position for >>> VariantTools >>> >>> Hi Sean, >>> >>> As you've discovered, the 'which' in the 'param' (for reading bam >>> files) specifies positions to overlap, not start or end. One approach >>> to isolating reads with a specific starting position would be to >>> filter the bam by 'pos'. >>> >>> library(VariantTools) >>> fl <- LungCancerLines::**LungCancerBamFiles()$H1993 >>> >>> mystart <- 1110426 >>> filt <- list(setStart=function(x) x$pos %in% mystart) >>> dest <- tempfile() >>> filterBam(fl, dest, filter=FilterRules(filt)) >>> scn <- scanBam(dest) >>> >>> Confirm all reads start with 'mystart': >>> >>> > table(scn[[1]]$pos) >>> >>> 1110426 >>> 2388 >>> >>> If you want a tally of all nucleotides for all sequences starting with >>> 'mystart' then no need to supply a 'which': >>> param <- VariantTallyParam(gmapR::**TP53Genome(), >>> readlen=100L, >>> high_base_quality=23L) >>> tally <- tallyVariants(fl, param) >>> >>> >>> Valerie >>> >>> >>> On 07/09/2013 02:06 PM, Taylor, Sean D wrote: >>> >>>> I am trying to read a specific set of records from a bam file for use >>>> in the VariantTools package. I'm trying to construct a which argument >>>> (a GRanges object) that will pull in a set of records from all reads >>>> that only start at a specified position. (i.e. all reads that start >>>> at position 100). So far I have only been able to specify reads that >>>> overlap position 100, but have not been able to find a way to define >>>> the start site. >>>> >>>> #Example code: >>>> >>>>> bams <- LungCancerLines::**LungCancerBamFiles() >>>>> bam <- bams$H1993 >>>>> which<-GRanges(seqnames=c('**TP53'), IRanges(1110426, width=1), '+') >>>>> tally.param <- VariantTallyParam(gmapR::**TP53Genome(), >>>>> >>>> + readlen = 100L, >>>> + high_base_quality = 23L, >>>> + which = which) >>>> >>>>> raw.variants <- tallyVariants(bam, tally.param) >>>>> >>>> >>>> This code shows all the variants at position 1110426, but not all the >>>> variants from the reads that start at position 1110426. >>>> >>>> Ultimately, I am trying to do this for all start positions in my data >>>> set, so I would want something that looks like this pseudocode: >>>> >>>>> raw.variants<-lapply (start(bam), function (x){ >>>>> >>>> which<-GRanges(seqnames=c('**chrM'), '+', start=x) >>>> tally.param<-**VariantTallyParam(gmap, readlen=100L, which=which) >>>> tallyVariants(bamfile, tally.param) >>>> }) >>>> >>>> Thanks, >>>> Sean >>>> >>>> sessionInfo() >>>>> >>>> R version 3.0.1 (2013-05-16) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> LC_TIME=en_US.UTF-8 >>>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 >>>> LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> LC_ADDRESS=C >>>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 >>>> LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] grid parallel stats graphics grDevices utils >>>> datasets methods base >>>> >>>> other attached packages: >>>> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 >>>> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 >>>> AnnotationDbi_1.22.5 >>>> [7] Biobase_2.20.0 gmapR_1.2.0 >>>> latticeExtra_0.6-24 >>>> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 >>>> [13] ade4_1.5-2 VariantTools_1.2.2 >>>> VariantAnnotation_1.6.6 >>>> [16] Rsamtools_1.12.3 Biostrings_2.28.0 >>>> GenomicRanges_1.12.4 >>>> [19] IRanges_1.18.1 BiocGenerics_0.6.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] biomaRt_2.16.0 bitops_1.0-5 >>>> [3] BSgenome_1.28.0 >>>> BSgenome.Hsapiens.UCSC.hg19_1.**3.19 >>>> [5] graph_1.38.2 Matrix_1.0-12 >>>> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 >>>> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 >>>> [11] stats4_3.0.1 tools_3.0.1 >>>> [13] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.9.2 XML_3.96-1.1 >>>> [15] zlibbioc_1.6.0 >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>>> >>> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.**conduct or<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> > > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Taylor, Not clear to me what you are trying to do exactly. FWIW here is some code that will group your read sequences (and qualities) by unique start position on the genome. It uses sequenceLayer() from the Rsamtools package (was added to devel recently). The code below has some similarities with the one I sent in that thread in June: https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html (1) Load the BAM file into a GAlignments object. library(Rsamtools) bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") param <- ScanBamParam(what=c("seq", "qual")) gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE, param=param) (2) Use sequenceLayer() to "lay" the query sequences and quality strings on the reference. qseq <- setNames(mcols(gal)$seq, names(gal)) qual <- setNames(mcols(gal)$qual, names(gal)) qseq_on_ref <- sequenceLayer(qseq, cigar(gal), from="query", to="reference") qual_on_ref <- sequenceLayer(qual, cigar(gal), from="query", to="reference") (3) Split by chromosome. qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal)) qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal)) pos_by_chrom <- splitAsList(start(gal), seqnames(gal)) (4) For each chromosome generate one GRanges object that contains unique alignment start positions and attach 3 metadata columns to it: the number of reads, the query sequences, and the quality strings. gr_by_chrom <- lapply(seqlevels(gal), function(seqname) { qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]] qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]] pos2 <- pos_by_chrom[[seqname]] qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2) qual_on_ref_per_pos <- split(qual_on_ref2, pos2) nread <- elementLengths(qseq_on_ref_per_pos) gr_mcols <- DataFrame(nread=unname(nread), qseq_on_ref=unname(qseq_on_ref_per_pos), qual_on_ref=unname(qual_on_ref_per_pos)) gr <- GRanges(Rle(seqname, nrow(gr_mcols)), IRanges(as.integer(names(nread)), width=1)) mcols(gr) <- gr_mcols seqlevels(gr) <- seqlevels(gal) gr }) (5) Combine all the GRanges objects obtained in (4) in 1 big GRanges object: gr <- do.call(c, gr_by_chrom) seqinfo(gr) <- seqinfo(gal) 'gr' is a GRanges object that contains unique alignment start positions: > gr[1:6] GRanges with 6 ranges and 3 metadata columns: seqnames ranges strand | nread <rle> <iranges> <rle> | <integer> [1] seq1 [ 1, 1] * | 1 [2] seq1 [ 3, 3] * | 1 [3] seq1 [ 5, 5] * | 1 [4] seq1 [ 6, 6] * | 1 [5] seq1 [ 9, 9] * | 1 [6] seq1 [13, 13] * | 2 qseq_on_ref <dnastringsetlist> [1] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG [2] CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT [3] AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC [4] GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT [5] GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG [6] ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA,ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCC AG qual_on_ref <bstringsetlist> [1] <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7 [2] <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): [3] <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5 [4] (-&----,----)-)-),'--)---',+-,),''*, [5] <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5% [6] <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666,<<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;< 0; --- seqlengths: seq1 seq2 1575 1584 2 reads align to start position 13. Let's have a close look at their sequences: > mcols(gr)$qseq_on_ref[[6]] A DNAStringSet instance of length 2 width seq names [1] 35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA EAS56_61:6:18:467... [2] 36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG EAS114_28:5:296:3... and their qualities: > mcols(gr)$qual_on_ref[[6]] A BStringSet instance of length 2 width seq names [1] 35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666 EAS56_61:6:18:467... [2] 36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0; EAS114_28:5:296:3... Note that the sequence and quality strings are those projected to the reference so the first letter in those strings are on top of start position 13, the 2nd letter on top of position 14, etc... Not sure what you want to do from here though... H. On 07/11/2013 10:27 AM, Taylor, Sean D wrote: > Thanks Valerie, I'll give that a try. And I see that you just clarified my question about the input file for tallyVariants. > > I had thought about doing this before but hadn't been able to figure out the proper filter syntax. Thanks! The only downside to this approach is that I ultimately want to do this for all unique start sites that align to my reference. That could entail generating thousands of tempfiles that all have to be read back in. It could be done with lapply, but it sounds rather memory and time intensive. Another approach that I tried was reading the bam file as a GappedAlignments object and then splitting it by start position into a GAlignmentsList. That was really easy, but so far as I can tell tallyVariants will not accept GappedAlignments objects as an input. I wonder if there is another way to vectorize this approach? > > Thanks, > Sean > > > -----Original Message----- > From: Valerie Obenchain [mailto:vobencha at fhcrc.org] > Sent: Thursday, July 11, 2013 10:02 AM > To: Taylor, Sean D > Cc: bioconductor at r-project.org; Michael Lawrence > Subject: Re: [BioC] Filtering BAM files by start position for VariantTools > > Hi Sean, > > As you've discovered, the 'which' in the 'param' (for reading bam files) specifies positions to overlap, not start or end. One approach to isolating reads with a specific starting position would be to filter the bam by 'pos'. > > library(VariantTools) > fl <- LungCancerLines::LungCancerBamFiles()$H1993 > > mystart <- 1110426 > filt <- list(setStart=function(x) x$pos %in% mystart) > dest <- tempfile() > filterBam(fl, dest, filter=FilterRules(filt)) > scn <- scanBam(dest) > > Confirm all reads start with 'mystart': > > > table(scn[[1]]$pos) > > 1110426 > 2388 > > If you want a tally of all nucleotides for all sequences starting with 'mystart' then no need to supply a 'which': > param <- VariantTallyParam(gmapR::TP53Genome(), > readlen=100L, > high_base_quality=23L) > tally <- tallyVariants(fl, param) > > > Valerie > > > On 07/09/2013 02:06 PM, Taylor, Sean D wrote: >> I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site. >> >> #Example code: >>> bams <- LungCancerLines::LungCancerBamFiles() >>> bam <- bams$H1993 >>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') >>> tally.param <- VariantTallyParam(gmapR::TP53Genome(), >> + readlen = 100L, >> + high_base_quality = 23L, >> + which = which) >>> raw.variants <- tallyVariants(bam, tally.param) >> >> This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426. >> >> Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode: >>> raw.variants<-lapply (start(bam), function (x){ >> which<-GRanges(seqnames=c('chrM'), '+', start=x) >> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) >> tallyVariants(bamfile, tally.param) >> }) >> >> Thanks, >> Sean >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 >> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] grid parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 >> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 AnnotationDbi_1.22.5 >> [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 >> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 >> [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6 >> [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 >> [19] IRanges_1.18.1 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.16.0 bitops_1.0-5 >> [3] BSgenome_1.28.0 BSgenome.Hsapiens.UCSC.hg19_1.3.19 >> [5] graph_1.38.2 Matrix_1.0-12 >> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 >> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 >> [11] stats4_3.0.1 tools_3.0.1 >> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 >> [15] zlibbioc_1.6.0 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.8 years ago Hervé Pagès 16k

0

Entering edit mode

Hi all, Thanks for lots of great input and ideas. You have given me a lot to try out. In answer to Herve, here is my ultimate goal: I am trying to look at the dynamics of some specific point mutations. In my control set that I am working with, the point mutations I am interested in exist with known frequencies from 0.1 - 50%. One would think that with NGS deep coverage you could easily pick out low frequency events, but the PCR and sequencing error rates are high enough to effectively mask things below 10%. We are trying out an approach to try to reduce this noise. If your library is prepared through nextera (which generates random fragments followed by a few rounds of PCR) and provided that you have a large enough genome, then it is likely that reads that align to the same start position are descendant from the same template fragment and should have identical sequence. By analyzing these 'families' individually we hope to eliminate a lot of noise by discarding variants that don't exceed a certain threshold within their family. Hence I want to sort the reads by start position into these 'families', tally the variants in the families, filter out the noise (and possibly collapse the family into a single consensus sequence), and then do a global variant tally on what's left. VariantTools sort of addresses this problem by reporting the number of 'cycles' supporting each variant. The number of cycles should be analogous to the number of families, but I couldn't see that I could use this to do the sort of thing I want to do beyond filtering out things that don't occur in many families. Which isn't quite what I had in mind. I will have to explore Michael's suggestion about binning further. Is your suggestion to set something like which<-GRanges(seqnames=c('foo'), IRanges(1,1)) tally.param <- VariantTallyParam(gmapR::TP53Genome(), readlen = 100L, high_base_quality = 23L, which = which) raw.variants <- tallyVariants(bam, tally.param) It seems thought that this gave me all reads that overlap that position, not just the ones that started at this position. I'm afraid I'm still learning the ins and outs of the package, so I'm not quite sure how to accomplish what you suggested. Herve, it looks like your code does allow me to impose the cigar alignment on the sequences, which is something that I have really wanted to be able to do for a while now so I'm excited to try this out. I'm intrigued by your solution and it seems similar to my first approach which was to read the bam file in as a GappedAlignments object and then split it by start position. I wonder though if I will end up with the same problem that I originally ran into, which is that (correct me if I'm wrong) the tallyVariants function does not accept GappedAlignments objects as a valid input, but rather wants a BAM file. I suppose I could try using something like a pileup function and sort through it that way, but it would be handy to be able to use tallyVariants as it already does what I want if I can just get the input right. On the other hand, if I end up needing to collapse the families into a single consensus, then this might be the preferable route as I can directly manipulate the whole sequence here. Valerie, thanks for the follow-up codes. Mclapply() was a good thought. I will play around with your Map() suggestion and see if I can get it to work. Hopefully some combination of all of this will get me to where I want to go. Thanks! Sean -----Original Message----- From: Hervé Pagès [mailto:hpages@fhcrc.org] Sent: Thursday, July 11, 2013 5:15 PM To: Taylor, Sean D Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org Subject: Re: [BioC] Filtering BAM files by start position for VariantTools Hi Taylor, Not clear to me what you are trying to do exactly. FWIW here is some code that will group your read sequences (and qualities) by unique start position on the genome. It uses sequenceLayer() from the Rsamtools package (was added to devel recently). The code below has some similarities with the one I sent in that thread in June: https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html (1) Load the BAM file into a GAlignments object. library(Rsamtools) bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") param <- ScanBamParam(what=c("seq", "qual")) gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE, param=param) (2) Use sequenceLayer() to "lay" the query sequences and quality strings on the reference. qseq <- setNames(mcols(gal)$seq, names(gal)) qual <- setNames(mcols(gal)$qual, names(gal)) qseq_on_ref <- sequenceLayer(qseq, cigar(gal), from="query", to="reference") qual_on_ref <- sequenceLayer(qual, cigar(gal), from="query", to="reference") (3) Split by chromosome. qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal)) qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal)) pos_by_chrom <- splitAsList(start(gal), seqnames(gal)) (4) For each chromosome generate one GRanges object that contains unique alignment start positions and attach 3 metadata columns to it: the number of reads, the query sequences, and the quality strings. gr_by_chrom <- lapply(seqlevels(gal), function(seqname) { qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]] qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]] pos2 <- pos_by_chrom[[seqname]] qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2) qual_on_ref_per_pos <- split(qual_on_ref2, pos2) nread <- elementLengths(qseq_on_ref_per_pos) gr_mcols <- DataFrame(nread=unname(nread), qseq_on_ref=unname(qseq_on_ref_per_pos), qual_on_ref=unname(qual_on_ref_per_pos)) gr <- GRanges(Rle(seqname, nrow(gr_mcols)), IRanges(as.integer(names(nread)), width=1)) mcols(gr) <- gr_mcols seqlevels(gr) <- seqlevels(gal) gr }) (5) Combine all the GRanges objects obtained in (4) in 1 big GRanges object: gr <- do.call(c, gr_by_chrom) seqinfo(gr) <- seqinfo(gal) 'gr' is a GRanges object that contains unique alignment start positions: > gr[1:6] GRanges with 6 ranges and 3 metadata columns: seqnames ranges strand | nread <rle> <iranges> <rle> | <integer> [1] seq1 [ 1, 1] * | 1 [2] seq1 [ 3, 3] * | 1 [3] seq1 [ 5, 5] * | 1 [4] seq1 [ 6, 6] * | 1 [5] seq1 [ 9, 9] * | 1 [6] seq1 [13, 13] * | 2 qseq_on_ref <dnastringsetlist> [1] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG [2] CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT [3] AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC [4] GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT [5] GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG [6] ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA,ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCC AG qual_on_ref <bstringsetlist> [1] <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7 [2] <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): [3] <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5 [4] (-&----,----)-)-),'--)---',+-,),''*, [5] <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5% [6] <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666,<<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;< 0; --- seqlengths: seq1 seq2 1575 1584 2 reads align to start position 13. Let's have a close look at their sequences: > mcols(gr)$qseq_on_ref[[6]] A DNAStringSet instance of length 2 width seq names [1] 35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA EAS56_61:6:18:467... [2] 36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG EAS114_28:5:296:3... and their qualities: > mcols(gr)$qual_on_ref[[6]] A BStringSet instance of length 2 width seq names [1] 35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666 EAS56_61:6:18:467... [2] 36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0; EAS114_28:5:296:3... Note that the sequence and quality strings are those projected to the reference so the first letter in those strings are on top of start position 13, the 2nd letter on top of position 14, etc... Not sure what you want to do from here though... H. On 07/11/2013 10:27 AM, Taylor, Sean D wrote: > Thanks Valerie, I'll give that a try. And I see that you just clarified my question about the input file for tallyVariants. > > I had thought about doing this before but hadn't been able to figure out the proper filter syntax. Thanks! The only downside to this approach is that I ultimately want to do this for all unique start sites that align to my reference. That could entail generating thousands of tempfiles that all have to be read back in. It could be done with lapply, but it sounds rather memory and time intensive. Another approach that I tried was reading the bam file as a GappedAlignments object and then splitting it by start position into a GAlignmentsList. That was really easy, but so far as I can tell tallyVariants will not accept GappedAlignments objects as an input. I wonder if there is another way to vectorize this approach? > > Thanks, > Sean > > > -----Original Message----- > From: Valerie Obenchain [mailto:vobencha at fhcrc.org] > Sent: Thursday, July 11, 2013 10:02 AM > To: Taylor, Sean D > Cc: bioconductor at r-project.org; Michael Lawrence > Subject: Re: [BioC] Filtering BAM files by start position for > VariantTools > > Hi Sean, > > As you've discovered, the 'which' in the 'param' (for reading bam files) specifies positions to overlap, not start or end. One approach to isolating reads with a specific starting position would be to filter the bam by 'pos'. > > library(VariantTools) > fl <- LungCancerLines::LungCancerBamFiles()$H1993 > > mystart <- 1110426 > filt <- list(setStart=function(x) x$pos %in% mystart) > dest <- tempfile() > filterBam(fl, dest, filter=FilterRules(filt)) > scn <- scanBam(dest) > > Confirm all reads start with 'mystart': > > > table(scn[[1]]$pos) > > 1110426 > 2388 > > If you want a tally of all nucleotides for all sequences starting with 'mystart' then no need to supply a 'which': > param <- VariantTallyParam(gmapR::TP53Genome(), > readlen=100L, > high_base_quality=23L) > tally <- tallyVariants(fl, param) > > > Valerie > > > On 07/09/2013 02:06 PM, Taylor, Sean D wrote: >> I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site. >> >> #Example code: >>> bams <- LungCancerLines::LungCancerBamFiles() >>> bam <- bams$H1993 >>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') >>> tally.param <- VariantTallyParam(gmapR::TP53Genome(), >> + readlen = 100L, >> + high_base_quality = 23L, >> + which = which) >>> raw.variants <- tallyVariants(bam, tally.param) >> >> This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426. >> >> Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode: >>> raw.variants<-lapply (start(bam), function (x){ >> which<-GRanges(seqnames=c('chrM'), '+', start=x) >> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) >> tallyVariants(bamfile, tally.param) >> }) >> >> Thanks, >> Sean >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 >> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] grid parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 >> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 AnnotationDbi_1.22.5 >> [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 >> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 >> [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6 >> [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 >> [19] IRanges_1.18.1 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.16.0 bitops_1.0-5 >> [3] BSgenome_1.28.0 BSgenome.Hsapiens.UCSC.hg19_1.3.19 >> [5] graph_1.38.2 Matrix_1.0-12 >> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 >> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 >> [11] stats4_3.0.1 tools_3.0.1 >> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 [15] >> zlibbioc_1.6.0 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.8 years ago Taylor, Sean D ▴ 250

0

Entering edit mode

On Thu, Jul 11, 2013 at 11:17 PM, Taylor, Sean D <sdtaylor@fhcrc.org> wrote: > Hi all, > > Thanks for lots of great input and ideas. You have given me a lot to try > out. In answer to Herve, here is my ultimate goal: > I am trying to look at the dynamics of some specific point mutations. In > my control set that I am working with, the point mutations I am interested > in exist with known frequencies from 0.1 - 50%. One would think that with > NGS deep coverage you could easily pick out low frequency events, but the > PCR and sequencing error rates are high enough to effectively mask things > below 10%. We are trying out an approach to try to reduce this noise. If > your library is prepared through nextera (which generates random fragments > followed by a few rounds of PCR) and provided that you have a large enough > genome, then it is likely that reads that align to the same start position > are descendant from the same template fragment and should have identical > sequence. By analyzing these 'families' individually we hope to eliminate a > lot of noise by discarding variants that don't exceed a certain threshold > within their family. Hence I want to sort the reads by start position into > these 'families', tally the variants in the families, filter out the noise > (and possibly collapse the family into a single consensus sequence), and > then do a global variant tally on what's left. > > This is an interesting idea. Usually, we resolve PCR/optical duplicates with the Picard MarkDuplicates command, which just chooses the read with the highest average quality. This approximates what you want, I think. Also, I find it hard to believe that sequence errors would be causing you trouble at 10%, as long as you're filtering for quality and have decent coverage. PCR might in limited cases, and Picard effectively takes care of that. > VariantTools sort of addresses this problem by reporting the number of > 'cycles' supporting each variant. The number of cycles should be analogous > to the number of families, but I couldn't see that I could use this to do > the sort of thing I want to do beyond filtering out things that don't > occur in many families. Which isn't quite what I had in mind. I will have > to explore Michael's suggestion about binning further. Is your suggestion > to set something like > which<-GRanges(seqnames=c('foo'), IRanges(1,1)) > tally.param <- VariantTallyParam(gmapR::TP53Genome(), > readlen = 100L, > high_base_quality = 23L, > which = which) > raw.variants <- tallyVariants(bam, tally.param) > It seems thought that this gave me all reads that overlap that position, > not just the ones that started at this position. I'm afraid I'm still > learning the ins and outs of the package, so I'm not quite sure how to > accomplish what you suggested. > > What you want is to specify the cycleBreaks argument to VariantTallyParam. It allows you to define bins by cycle. So if you made bins for all 100bp, you could do things like generate a consensus by read family. A family corresponding to position X would be cycle bin #1 at X, cycle bin #2 at X+1, etc. Then just pick the alt (or ref) with the highest count. Might be tricky to implement efficiently for the whole genome. Of course, this assumes you've got single end reads, otherwise this won't work, because the mate is not considered. > Herve, it looks like your code does allow me to impose the cigar alignment > on the sequences, which is something that I have really wanted to be able > to do for a while now so I'm excited to try this out. I'm intrigued by > your solution and it seems similar to my first approach which was to read > the bam file in as a GappedAlignments object and then split it by start > position. I wonder though if I will end up with the same problem that I > originally ran into, which is that (correct me if I'm wrong) the > tallyVariants function does not accept GappedAlignments objects as a valid > input, but rather wants a BAM file. I suppose I could try using something > like a pileup function and sort through it that way, but it would be handy > to be able to use tallyVariants as it already does what I want if I can > just get the input right. On the other hand, if I end up needing to > collapse the families into a single consensus, then this might be the > preferable route as I can directly manipulate the whole sequence here. > > Valerie, thanks for the follow-up codes. Mclapply() was a good thought. I > will play around with your Map() suggestion and see if I can get it to > work. Hopefully some combination of all of this will get me to where I want > to go. > > Thanks! > Sean > > > > > -----Original Message----- > From: Hervé Pagès [mailto:hpages@fhcrc.org] > Sent: Thursday, July 11, 2013 5:15 PM > To: Taylor, Sean D > Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor@r-project.org > Subject: Re: [BioC] Filtering BAM files by start position for VariantTools > > Hi Taylor, > > Not clear to me what you are trying to do exactly. FWIW here is some code > that will group your read sequences (and qualities) by unique start > position on the genome. It uses sequenceLayer() from the Rsamtools package > (was added to devel recently). The code below has some similarities with > the one I sent in that thread in June: > > https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html > > (1) Load the BAM file into a GAlignments object. > > library(Rsamtools) > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what=c("seq", "qual")) > gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE, param=param) > > (2) Use sequenceLayer() to "lay" the query sequences and quality strings > on the reference. > > qseq <- setNames(mcols(gal)$seq, names(gal)) > qual <- setNames(mcols(gal)$qual, names(gal)) > qseq_on_ref <- sequenceLayer(qseq, cigar(gal), > from="query", to="reference") > qual_on_ref <- sequenceLayer(qual, cigar(gal), > from="query", to="reference") > > (3) Split by chromosome. > > qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal)) > qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal)) > pos_by_chrom <- splitAsList(start(gal), seqnames(gal)) > > (4) For each chromosome generate one GRanges object that contains > unique alignment start positions and attach 3 metadata columns > to it: the number of reads, the query sequences, and the quality > strings. > > gr_by_chrom <- lapply(seqlevels(gal), > function(seqname) > { > qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]] > qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]] > pos2 <- pos_by_chrom[[seqname]] > qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2) > qual_on_ref_per_pos <- split(qual_on_ref2, pos2) > nread <- elementLengths(qseq_on_ref_per_pos) > gr_mcols <- DataFrame(nread=unname(nread), > qseq_on_ref=unname(qseq_on_ref_per_pos), > qual_on_ref=unname(qual_on_ref_per_pos)) > gr <- GRanges(Rle(seqname, nrow(gr_mcols)), > IRanges(as.integer(names(nread)), width=1)) > mcols(gr) <- gr_mcols > seqlevels(gr) <- seqlevels(gal) > gr > }) > > (5) Combine all the GRanges objects obtained in (4) in 1 big GRanges > object: > > gr <- do.call(c, gr_by_chrom) > seqinfo(gr) <- seqinfo(gal) > > 'gr' is a GRanges object that contains unique alignment start positions: > > > gr[1:6] > GRanges with 6 ranges and 3 metadata columns: > seqnames ranges strand | nread > <rle> <iranges> <rle> | <integer> > [1] seq1 [ 1, 1] * | 1 > [2] seq1 [ 3, 3] * | 1 > [3] seq1 [ 5, 5] * | 1 > [4] seq1 [ 6, 6] * | 1 > [5] seq1 [ 9, 9] * | 1 > [6] seq1 [13, 13] * | 2 > > qseq_on_ref > > <dnastringsetlist> > [1] > CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > [2] > CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT > [3] > AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC > [4] > GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT > [5] > GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG > [6] > ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA,ATTGTAAATGTGTGGTTTAACTCGTCCATGGC CCAG > > qual_on_ref > > <bstringsetlist> > [1] > <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7 > [2] > <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): > [3] > <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5 > [4] > (-&----,----)-)-),'--)---',+-,),''*, > [5] > <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5% > [6] > <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666,<<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;< ;<0; > --- > seqlengths: > seq1 seq2 > 1575 1584 > > 2 reads align to start position 13. Let's have a close look at their > sequences: > > > mcols(gr)$qseq_on_ref[[6]] > A DNAStringSet instance of length 2 > width seq names > [1] 35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA > EAS56_61:6:18:467... > [2] 36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG > EAS114_28:5:296:3... > > and their qualities: > > > mcols(gr)$qual_on_ref[[6]] > A BStringSet instance of length 2 > width seq names > [1] 35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666 > EAS56_61:6:18:467... > [2] 36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0; > EAS114_28:5:296:3... > > Note that the sequence and quality strings are those projected to the > reference so the first letter in those strings are on top of start position > 13, the 2nd letter on top of position 14, etc... > > Not sure what you want to do from here though... > > H. > > On 07/11/2013 10:27 AM, Taylor, Sean D wrote: > > Thanks Valerie, I'll give that a try. And I see that you just clarified > my question about the input file for tallyVariants. > > > > I had thought about doing this before but hadn't been able to figure out > the proper filter syntax. Thanks! The only downside to this approach is > that I ultimately want to do this for all unique start sites that align to > my reference. That could entail generating thousands of tempfiles that all > have to be read back in. It could be done with lapply, but it sounds rather > memory and time intensive. Another approach that I tried was reading the > bam file as a GappedAlignments object and then splitting it by start > position into a GAlignmentsList. That was really easy, but so far as I can > tell tallyVariants will not accept GappedAlignments objects as an input. I > wonder if there is another way to vectorize this approach? > > > > Thanks, > > Sean > > > > > > -----Original Message----- > > From: Valerie Obenchain [mailto:vobencha@fhcrc.org] > > Sent: Thursday, July 11, 2013 10:02 AM > > To: Taylor, Sean D > > Cc: bioconductor@r-project.org; Michael Lawrence > > Subject: Re: [BioC] Filtering BAM files by start position for > > VariantTools > > > > Hi Sean, > > > > As you've discovered, the 'which' in the 'param' (for reading bam files) > specifies positions to overlap, not start or end. One approach to isolating > reads with a specific starting position would be to filter the bam by 'pos'. > > > > library(VariantTools) > > fl <- LungCancerLines::LungCancerBamFiles()$H1993 > > > > mystart <- 1110426 > > filt <- list(setStart=function(x) x$pos %in% mystart) > > dest <- tempfile() > > filterBam(fl, dest, filter=FilterRules(filt)) > > scn <- scanBam(dest) > > > > Confirm all reads start with 'mystart': > > > > > table(scn[[1]]$pos) > > > > 1110426 > > 2388 > > > > If you want a tally of all nucleotides for all sequences starting with > 'mystart' then no need to supply a 'which': > > param <- VariantTallyParam(gmapR::TP53Genome(), > > readlen=100L, > > high_base_quality=23L) > > tally <- tallyVariants(fl, param) > > > > > > Valerie > > > > > > On 07/09/2013 02:06 PM, Taylor, Sean D wrote: > >> I am trying to read a specific set of records from a bam file for use > in the VariantTools package. I'm trying to construct a which argument (a > GRanges object) that will pull in a set of records from all reads that only > start at a specified position. (i.e. all reads that start at position 100). > So far I have only been able to specify reads that overlap position 100, > but have not been able to find a way to define the start site. > >> > >> #Example code: > >>> bams <- LungCancerLines::LungCancerBamFiles() > >>> bam <- bams$H1993 > >>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') > >>> tally.param <- VariantTallyParam(gmapR::TP53Genome(), > >> + readlen = 100L, > >> + high_base_quality = 23L, > >> + which = which) > >>> raw.variants <- tallyVariants(bam, tally.param) > >> > >> This code shows all the variants at position 1110426, but not all the > variants from the reads that start at position 1110426. > >> > >> Ultimately, I am trying to do this for all start positions in my data > set, so I would want something that looks like this pseudocode: > >>> raw.variants<-lapply (start(bam), function (x){ > >> which<-GRanges(seqnames=c('chrM'), '+', start=x) > >> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) > >> tallyVariants(bamfile, tally.param) > >> }) > >> > >> Thanks, > >> Sean > >> > >>> sessionInfo() > >> R version 3.0.1 (2013-05-16) > >> Platform: x86_64-unknown-linux-gnu (64-bit) > >> > >> locale: > >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 > >> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > >> [7] LC_PAPER=C LC_NAME=C > LC_ADDRESS=C > >> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] grid parallel stats graphics grDevices utils > datasets methods base > >> > >> other attached packages: > >> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 > >> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 > AnnotationDbi_1.22.5 > >> [7] Biobase_2.20.0 gmapR_1.2.0 > latticeExtra_0.6-24 > >> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 > >> [13] ade4_1.5-2 VariantTools_1.2.2 > VariantAnnotation_1.6.6 > >> [16] Rsamtools_1.12.3 Biostrings_2.28.0 > GenomicRanges_1.12.4 > >> [19] IRanges_1.18.1 BiocGenerics_0.6.0 > >> > >> loaded via a namespace (and not attached): > >> [1] biomaRt_2.16.0 bitops_1.0-5 > >> [3] BSgenome_1.28.0 > BSgenome.Hsapiens.UCSC.hg19_1.3.19 > >> [5] graph_1.38.2 Matrix_1.0-12 > >> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 > >> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 > >> [11] stats4_3.0.1 tools_3.0.1 > >> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 [15] > >> zlibbioc_1.6.0 > >> > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks Michael, This is an interesting idea. Usually, we resolve PCR/optical duplicates with the Picard MarkDuplicates command, which just chooses the read with the highest average quality. This approximates what you want, I think. Is the Picard MarkDuplicates command run by default? Or is this a separate command from a different package? Also, I find it hard to believe that sequence errors would be causing you trouble at 10%, as long as you're filtering for quality and have decent coverage. PCR might in limited cases, and Picard effectively takes care of that. 10% is fine, but we have problems at 1% and lower. I think even the defaults in tallyVariants() use 1% as a cutoff if I'm not mistaken. What you want is to specify the cycleBreaks argument to VariantTallyParam. It allows you to define bins by cycle. So if you made bins for all 100bp, you could do things like generate a consensus by read family. A family corresponding to position X would be cycle bin #1 at X, cycle bin #2 at X+1, etc. Then just pick the alt (or ref) with the highest count. Might be tricky to implement efficiently for the whole genome. Of course, this assumes you've got single end reads, otherwise this won't work, because the mate is not considered. Cool, I will try that out, that may be just what I was looking for. It might be tricky for the whole genome, but we are restricting ourselves to just the mitochondrial genome. Still gives us several thousand start sites to work from but should be easier than the whole genome. As for single end reads, can I just filter on the strand (i.e. '+' or '-')? Or will I have to perform separate alignments for each of the paired reads? [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Taylor, Sean D ▴ 250

0

Entering edit mode

On Sat, Jul 13, 2013 at 1:50 PM, Taylor, Sean D <sdtaylor@fhcrc.org> wrote: > Thanks Michael, **** > > This is an interesting idea. Usually, we resolve PCR/optical duplicates > with the Picard MarkDuplicates command, which just chooses the read with > the highest average quality. This approximates what you want, I think. *** > * > > Is the Picard MarkDuplicates command run by default? Or is this a separate > command from a different package? > It's a separate package, basically the Java implementation of samtools; you might that first and see how it improves your error rates, before taking a more complicated approach. > **** > > Also, I find it hard to believe that sequence errors would be causing you > trouble at 10%, as long as you're filtering for quality and have decent > coverage. PCR might in limited cases, and Picard effectively takes care of > that.**** > > 10% is fine, but we have problems at 1% and lower. I think even the > defaults in tallyVariants() use 1% as a cutoff if Im not mistaken.**** > > > It uses a ~4% cutoff. And yes, you'll start running into issues around 1%. Calling at such a low frequency is kind of ambitious. > **** > > What you want is to specify the cycleBreaks argument to VariantTallyParam. > It allows you to define bins by cycle. So if you made bins for all 100bp, > you could do things like generate a consensus by read family. A family > corresponding to position X would be cycle bin #1 at X, cycle bin #2 at > X+1, etc. Then just pick the alt (or ref) with the highest count. Might be > tricky to implement efficiently for the whole genome. Of course, this > assumes you've got single end reads, otherwise this won't work, because the > mate is not considered.**** > > Cool, I will try that out, that may be just what I was looking for. It > might be tricky for the whole genome, but we are restricting ourselves to > just the mitochondrial genome. Still gives us several thousand start sites > to work from but should be easier than the whole genome. As for single end > reads, can I just filter on the strand (i.e. + or -)? Or will I have to > perform separate alignments for each of the paired reads?**** > > > Now that I think about it, you should be fine just considering the alignments individually (as if they were unpaired). [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

In order to work through some of the code, I installed the devel version of R and updated all the packages. Now when I run tallyVariants() I get the following error message: Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : object 'castList' not found > traceback() 12: get(name, envir = asNamespace(pkg), inherits = FALSE) 11: IRanges:::castList 10: safe_mclapply(ind, function(i, ...) { FUN(gr[i], ...) }, ...) 9: applyByChromosome(si, bam_tally_region, mc.cores = mc.cores) 8: .local(x, ...) 7: tallyVariants(x, tally.param) 6: tallyVariants(x, tally.param) 5: .local(x, ...) 4: callVariants(BamFile(x), ...) 3: callVariants(BamFile(x), ...) 2: callVariants(destination, tally.param) 1: callVariants(destination, tally.param) Perhaps this is something missing/changed in the devel version of IRanges? Updated sessionInfo() below: R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocInstaller_1.11.3 latticeExtra_0.6-24 lattice_0.20-15 [4] RColorBrewer_1.0-5 genoPlotR_0.8 ade4_1.5-2 [7] gmapR_1.3.2 VariantTools_1.3.2 VariantAnnotation_1.7.34 [10] Rsamtools_1.13.24 Biostrings_2.29.13 GenomicRanges_1.13.33 [13] XVector_0.1.0 IRanges_1.19.18 BiocGenerics_0.7.3 loaded via a namespace (and not attached): [1] AnnotationDbi_1.23.16 Biobase_2.20.0 biomaRt_2.17.2 [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 [7] GenomicFeatures_1.13.19 graph_1.39.3 Matrix_1.0-12 [10] RBGL_1.37.2 RCurl_1.95-4.1 RSQLite_0.11.4 [13] rtracklayer_1.20.2 stats4_3.0.1 tools_3.0.1 [16] XML_3.96-1.1 zlibbioc_1.6.0 > Thanks again, Sean From: Michael Lawrence [mailto:lawrence.michael@gene.com] Sent: Saturday, July 13, 2013 1:59 PM To: Taylor, Sean D Cc: Michael Lawrence; Pages, Herve; Obenchain, Valerie J; bioconductor@r-project.org Subject: Re: [BioC] Filtering BAM files by start position for VariantTools On Sat, Jul 13, 2013 at 1:50 PM, Taylor, Sean D <sdtaylor@fhcrc.org<mailto:sdtaylor@fhcrc.org>> wrote: Thanks Michael, This is an interesting idea. Usually, we resolve PCR/optical duplicates with the Picard MarkDuplicates command, which just chooses the read with the highest average quality. This approximates what you want, I think. Is the Picard MarkDuplicates command run by default? Or is this a separate command from a different package? It's a separate package, basically the Java implementation of samtools; you might that first and see how it improves your error rates, before taking a more complicated approach. Also, I find it hard to believe that sequence errors would be causing you trouble at 10%, as long as you're filtering for quality and have decent coverage. PCR might in limited cases, and Picard effectively takes care of that. 10% is fine, but we have problems at 1% and lower. I think even the defaults in tallyVariants() use 1% as a cutoff if I'm not mistaken. It uses a ~4% cutoff. And yes, you'll start running into issues around 1%. Calling at such a low frequency is kind of ambitious. What you want is to specify the cycleBreaks argument to VariantTallyParam. It allows you to define bins by cycle. So if you made bins for all 100bp, you could do things like generate a consensus by read family. A family corresponding to position X would be cycle bin #1 at X, cycle bin #2 at X+1, etc. Then just pick the alt (or ref) with the highest count. Might be tricky to implement efficiently for the whole genome. Of course, this assumes you've got single end reads, otherwise this won't work, because the mate is not considered. Cool, I will try that out, that may be just what I was looking for. It might be tricky for the whole genome, but we are restricting ourselves to just the mitochondrial genome. Still gives us several thousand start sites to work from but should be easier than the whole genome. As for single end reads, can I just filter on the strand (i.e. '+' or '-')? Or will I have to perform separate alignments for each of the paired reads? Now that I think about it, you should be fine just considering the alignments individually (as if they were unpaired). [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Taylor, Sean D ▴ 250

0

Entering edit mode

The necessary update to VariantTools will propagate soon to devel. Or you can grab it from svn. Michael On Tue, Jul 16, 2013 at 3:27 PM, Taylor, Sean D <sdtaylor@fhcrc.org> wrote: > In order to work through some of the code, I installed the devel version > of R and updated all the packages. Now when I run tallyVariants() I get the > following error message:**** > > ** ** > > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : **** > > object 'castList' not found**** > > ** ** > > > traceback()**** > > 12: get(name, envir = asNamespace(pkg), inherits = FALSE)**** > > 11: IRanges:::castList**** > > 10: safe_mclapply(ind, function(i, ...) {**** > > FUN(gr[i], ...)**** > > }, ...)**** > > 9: applyByChromosome(si, bam_tally_region, mc.cores = mc.cores)**** > > 8: .local(x, ...)**** > > 7: tallyVariants(x, tally.param)**** > > 6: tallyVariants(x, tally.param)**** > > 5: .local(x, ...)**** > > 4: callVariants(BamFile(x), ...)**** > > 3: callVariants(BamFile(x), ...)**** > > 2: callVariants(destination, tally.param)**** > > 1: callVariants(destination, tally.param)**** > > ** ** > > Perhaps this is something missing/changed in the devel version of IRanges? > **** > > ** ** > > Updated sessionInfo() below:**** > > ** ** > > R version 3.0.1 (2013-05-16)**** > > Platform: x86_64-unknown-linux-gnu (64-bit)**** > > ** ** > > locale:**** > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C **** > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 **** > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 **** > > [7] LC_PAPER=C LC_NAME=C **** > > [9] LC_ADDRESS=C LC_TELEPHONE=C **** > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C **** > > ** ** > > attached base packages:**** > > [1] grid parallel stats graphics grDevices utils datasets * > *** > > [8] methods base **** > > ** ** > > other attached packages:**** > > [1] BiocInstaller_1.11.3 latticeExtra_0.6-24 > lattice_0.20-15 **** > > [4] RColorBrewer_1.0-5 genoPlotR_0.8 > ade4_1.5-2 **** > > [7] gmapR_1.3.2 VariantTools_1.3.2 > VariantAnnotation_1.7.34**** > > [10] Rsamtools_1.13.24 Biostrings_2.29.13 > GenomicRanges_1.13.33 **** > > [13] XVector_0.1.0 IRanges_1.19.18 > BiocGenerics_0.7.3 **** > > ** ** > > loaded via a namespace (and not attached):**** > > [1] AnnotationDbi_1.23.16 Biobase_2.20.0 biomaRt_2.17.2 > **** > > [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 > **** > > [7] GenomicFeatures_1.13.19 graph_1.39.3 > Matrix_1.0-12 **** > > [10] RBGL_1.37.2 RCurl_1.95-4.1 > RSQLite_0.11.4 **** > > [13] rtracklayer_1.20.2 stats4_3.0.1 > tools_3.0.1 **** > > [16] XML_3.96-1.1 zlibbioc_1.6.0 **** > > >** ** > > ** ** > > Thanks again,**** > > Sean**** > > ** ** > > *From:* Michael Lawrence [mailto:lawrence.michael@gene.com] > *Sent:* Saturday, July 13, 2013 1:59 PM > *To:* Taylor, Sean D > *Cc:* Michael Lawrence; Pages, Herve; Obenchain, Valerie J; > bioconductor@r-project.org > > *Subject:* Re: [BioC] Filtering BAM files by start position for > VariantTools**** > > ** ** > > ** ** > > ** ** > > On Sat, Jul 13, 2013 at 1:50 PM, Taylor, Sean D <sdtaylor@fhcrc.org> > wrote:**** > > Thanks Michael, **** > > This is an interesting idea. Usually, we resolve PCR/optical duplicates > with the Picard MarkDuplicates command, which just chooses the read with > the highest average quality. This approximates what you want, I think. *** > * > > Is the Picard MarkDuplicates command run by default? Or is this a separate > command from a different package?**** > > ** ** > > It's a separate package, basically the Java implementation of samtools; > you might that first and see how it improves your error rates, before > taking a more complicated approach.**** > > Also, I find it hard to believe that sequence errors would be causing > you trouble at 10%, as long as you're filtering for quality and have decent > coverage. PCR might in limited cases, and Picard effectively takes care of > that.**** > > 10% is fine, but we have problems at 1% and lower. I think even the > defaults in tallyVariants() use 1% as a cutoff if Im not mistaken.**** > > **** > > ** ** > > It uses a ~4% cutoff. And yes, you'll start running into issues around 1%. > Calling at such a low frequency is kind of ambitious.**** > > **** > > What you want is to specify the cycleBreaks argument to > VariantTallyParam. It allows you to define bins by cycle. So if you made > bins for all 100bp, you could do things like generate a consensus by read > family. A family corresponding to position X would be cycle bin #1 at X, > cycle bin #2 at X+1, etc. Then just pick the alt (or ref) with the highest > count. Might be tricky to implement efficiently for the whole genome. Of > course, this assumes you've got single end reads, otherwise this won't > work, because the mate is not considered.**** > > Cool, I will try that out, that may be just what I was looking for. It > might be tricky for the whole genome, but we are restricting ourselves to > just the mitochondrial genome. Still gives us several thousand start sites > to work from but should be easier than the whole genome. As for single end > reads, can I just filter on the strand (i.e. + or -)? Or will I have to > perform separate alignments for each of the paired reads?**** > > **** > > ** ** > > Now that I think about it, you should be fine just considering the > alignments individually (as if they were unpaired).**** > > **** > > ** ** > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Sean, On 07/11/2013 11:17 PM, Taylor, Sean D wrote: > Hi all, > > Thanks for lots of great input and ideas. You have given me a lot to try out. In answer to Herve, here is my ultimate goal: > I am trying to look at the dynamics of some specific point mutations. In my control set that I am working with, the point mutations I am interested in exist with known frequencies from 0.1 - 50%. One would think that with NGS deep coverage you could easily pick out low frequency events, but the PCR and sequencing error rates are high enough to effectively mask things below 10%. We are trying out an approach to try to reduce this noise. If your library is prepared through nextera (which generates random fragments followed by a few rounds of PCR) and provided that you have a large enough genome, then it is likely that reads that align to the same start position are descendant from the same template fragment and should have identical sequence. By analyzing these 'families' individually we hope to eliminate a lot of noise by discarding variants that don't exceed a certain threshold within their family. Hence I want to sort the reads by start position into these 'families', tall y the var iants in the families, filter out the noise (and possibly collapse the family into a single consensus sequence), and then do a global variant tally on what's left. > > VariantTools sort of addresses this problem by reporting the number of 'cycles' supporting each variant. The number of cycles should be analogous to the number of families, but I couldn't see that I could use this to do the sort of thing I want to do beyond filtering out things that don't occur in many families. Which isn't quite what I had in mind. I will have to explore Michael's suggestion about binning further. Is your suggestion to set something like > which<-GRanges(seqnames=c('foo'), IRanges(1,1)) > tally.param <- VariantTallyParam(gmapR::TP53Genome(), > readlen = 100L, > high_base_quality = 23L, > which = which) > raw.variants <- tallyVariants(bam, tally.param) > It seems thought that this gave me all reads that overlap that position, not just the ones that started at this position. I'm afraid I'm still learning the ins and outs of the package, so I'm not quite sure how to accomplish what you suggested. > > Herve, it looks like your code does allow me to impose the cigar alignment on the sequences, which is something that I have really wanted to be able to do for a while now so I'm excited to try this out. I'm intrigued by your solution and it seems similar to my first approach which was to read the bam file in as a GappedAlignments object and then split it by start position. I wonder though if I will end up with the same problem that I originally ran into, which is that (correct me if I'm wrong) the tallyVariants function does not accept GappedAlignments objects as a valid input, but rather wants a BAM file. I suppose I could try using something like a pileup function and sort through it that way, but it would be handy to be able to use tallyVariants as it already does what I want if I can just get the input right. On the other hand, if I end up needing to collapse the families into a single consensus, then this might be the preferable route as I can directly manipulate the whole se quence here. The tallyVariants() function does not accept a GAlignments object. I guess this is because the power horse behind it is the gmap/gsnap C code wrapped into the gmapR package. However if the next thing you want to do after discarding variants that don't exceed a certain threshold within their family is to extract a consensus sequence, then here is one way to get there (following up on the code I sent previously): (6) For each start position, remove reads with under-represented sequence (e.g. threshold = 20% for the toy data I'm using here which is low coverage). qseq_on_ref <- mcols(gr)$qseq_on_ref ## One trick here is to assign a unique number to unique ## sequences and to replace the sequences by this number. ## This is because step (7) is going to be easier (and a ## little bit faster) with numbers than with sequences. tmp <- unlist(qseq_on_ref, use.names=FALSE) qseq_on_ref_id <- relist(match(tmp, tmp), qseq_on_ref) Quick look at 'qseq_on_ref_id': > qseq_on_ref_id IntegerList of length 1934 [[1]] 1 [[2]] 2 [[3]] 3 [[4]] 4 [[5]] 5 [[6]] 6 7 [[7]] 8 [[8]] 9 [[9]] 10 11 [[10]] 12 ... <1924 more elements> It's an IntegerList object with the same length and "shape" as 'qseq_on_ref'. (7) Remove under represented ids from each list element of 'qseq_on_ref_id': ## This implementation will be very slow on real data (it's ## looping on all the start positions) but there are ways to ## make it much faster. Just illustrating the concept for now. qseq_on_ref_id2 <- endoapply(qseq_on_ref_id, function(ids) ids[countMatches(ids, ids) >= 0.2 * length(ids)]) (8) Remove corresponding sequences from 'qseq_on_ref': tmp <- unlist(qseq_on_ref_id2, use.names=FALSE) qseq_on_ref2 <- relist(unlist(qseq_on_ref, use.names=FALSE)[tmp], qseq_on_ref_id2) For steps (9)-(10), I'm just re-using the code I posted in that thread: https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html (9) Compute 1 consensus matrix per chromosome: split_factor <- rep.int(seqnames(gr), elementLengths(qseq_on_ref2)) qseq_on_ref2 <- unlist(qseq_on_ref2, use.names=FALSE) qseq_on_ref2_by_chrom <- splitAsList(qseq_on_ref2, split_factor) qseq_pos_by_chrom <- splitAsList(start(gr), split_factor) cm_by_chrom <- lapply(names(qseq_pos_by_chrom), function(seqname) consensusMatrix(qseq_on_ref2_by_chrom[[seqname]], as.prob=TRUE, shift=qseq_pos_by_chrom[[seqname]]-1, width=seqlengths(gr)[[seqname]])) names(cm_by_chrom) <- names(qseq_pos_by_chrom) ## 'cm_by_chrom' is a list of consensus matrices. Each matrix ## has 17 rows (1 per letter in the DNA alphabet) and 1 column ## per chromosome position. (10) Compute the consensus string from each consensus matrix. We'll put "+" ins the strings wherever there is no coverage for that position, and "N" where there is coverage but no consensus. cs_by_chrom <- lapply(cm_by_chrom, function(cm) { ## Because consensusString() doesn't like consensus ## matrices with columns that contain only zeroes (and ## you will have columns like for chromosome positions ## that don't receive any coverage), we need to "fix" ## 'cm' first. idx <- colSums(cm) == 0 cm["+", idx] <- 1 DNAString(consensusString(cm, ambiguityMap="N")) }) consensusString() provides some flexibility to let you extract the consensus in different ways. See ?consensusString for the details. HTH, H. > > Valerie, thanks for the follow-up codes. Mclapply() was a good thought. I will play around with your Map() suggestion and see if I can get it to work. Hopefully some combination of all of this will get me to where I want to go. > > Thanks! > Sean > > > > > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Thursday, July 11, 2013 5:15 PM > To: Taylor, Sean D > Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org > Subject: Re: [BioC] Filtering BAM files by start position for VariantTools > > Hi Taylor, > > Not clear to me what you are trying to do exactly. FWIW here is some code that will group your read sequences (and qualities) by unique start position on the genome. It uses sequenceLayer() from the Rsamtools package (was added to devel recently). The code below has some similarities with the one I sent in that thread in June: > > https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html > > (1) Load the BAM file into a GAlignments object. > > library(Rsamtools) > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what=c("seq", "qual")) > gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE, param=param) > > (2) Use sequenceLayer() to "lay" the query sequences and quality strings > on the reference. > > qseq <- setNames(mcols(gal)$seq, names(gal)) > qual <- setNames(mcols(gal)$qual, names(gal)) > qseq_on_ref <- sequenceLayer(qseq, cigar(gal), > from="query", to="reference") > qual_on_ref <- sequenceLayer(qual, cigar(gal), > from="query", to="reference") > > (3) Split by chromosome. > > qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal)) > qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal)) > pos_by_chrom <- splitAsList(start(gal), seqnames(gal)) > > (4) For each chromosome generate one GRanges object that contains > unique alignment start positions and attach 3 metadata columns > to it: the number of reads, the query sequences, and the quality > strings. > > gr_by_chrom <- lapply(seqlevels(gal), > function(seqname) > { > qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]] > qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]] > pos2 <- pos_by_chrom[[seqname]] > qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2) > qual_on_ref_per_pos <- split(qual_on_ref2, pos2) > nread <- elementLengths(qseq_on_ref_per_pos) > gr_mcols <- DataFrame(nread=unname(nread), > qseq_on_ref=unname(qseq_on_ref_per_pos), > qual_on_ref=unname(qual_on_ref_per_pos)) > gr <- GRanges(Rle(seqname, nrow(gr_mcols)), > IRanges(as.integer(names(nread)), width=1)) > mcols(gr) <- gr_mcols > seqlevels(gr) <- seqlevels(gal) > gr > }) > > (5) Combine all the GRanges objects obtained in (4) in 1 big GRanges > object: > > gr <- do.call(c, gr_by_chrom) > seqinfo(gr) <- seqinfo(gal) > > 'gr' is a GRanges object that contains unique alignment start positions: > > > gr[1:6] > GRanges with 6 ranges and 3 metadata columns: > seqnames ranges strand | nread > <rle> <iranges> <rle> | <integer> > [1] seq1 [ 1, 1] * | 1 > [2] seq1 [ 3, 3] * | 1 > [3] seq1 [ 5, 5] * | 1 > [4] seq1 [ 6, 6] * | 1 > [5] seq1 [ 9, 9] * | 1 > [6] seq1 [13, 13] * | 2 > > qseq_on_ref > > <dnastringsetlist> > [1] > CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > [2] > CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT > [3] > AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC > [4] > GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT > [5] > GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG > [6] > ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA,ATTGTAAATGTGTGGTTTAACTCGTCCATGGC CCAG > > qual_on_ref > > <bstringsetlist> > [1] > <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7 > [2] > <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): > [3] > <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5 > [4] > (-&----,----)-)-),'--)---',+-,),''*, > [5] > <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5% > [6] > <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666,<<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;< ;<0; > --- > seqlengths: > seq1 seq2 > 1575 1584 > > 2 reads align to start position 13. Let's have a close look at their > sequences: > > > mcols(gr)$qseq_on_ref[[6]] > A DNAStringSet instance of length 2 > width seq names > [1] 35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA > EAS56_61:6:18:467... > [2] 36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG > EAS114_28:5:296:3... > > and their qualities: > > > mcols(gr)$qual_on_ref[[6]] > A BStringSet instance of length 2 > width seq names > [1] 35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666 > EAS56_61:6:18:467... > [2] 36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0; > EAS114_28:5:296:3... > > Note that the sequence and quality strings are those projected to the reference so the first letter in those strings are on top of start position 13, the 2nd letter on top of position 14, etc... > > Not sure what you want to do from here though... > > H. > > On 07/11/2013 10:27 AM, Taylor, Sean D wrote: >> Thanks Valerie, I'll give that a try. And I see that you just clarified my question about the input file for tallyVariants. >> >> I had thought about doing this before but hadn't been able to figure out the proper filter syntax. Thanks! The only downside to this approach is that I ultimately want to do this for all unique start sites that align to my reference. That could entail generating thousands of tempfiles that all have to be read back in. It could be done with lapply, but it sounds rather memory and time intensive. Another approach that I tried was reading the bam file as a GappedAlignments object and then splitting it by start position into a GAlignmentsList. That was really easy, but so far as I can tell tallyVariants will not accept GappedAlignments objects as an input. I wonder if there is another way to vectorize this approach? >> >> Thanks, >> Sean >> >> >> -----Original Message----- >> From: Valerie Obenchain [mailto:vobencha at fhcrc.org] >> Sent: Thursday, July 11, 2013 10:02 AM >> To: Taylor, Sean D >> Cc: bioconductor at r-project.org; Michael Lawrence >> Subject: Re: [BioC] Filtering BAM files by start position for >> VariantTools >> >> Hi Sean, >> >> As you've discovered, the 'which' in the 'param' (for reading bam files) specifies positions to overlap, not start or end. One approach to isolating reads with a specific starting position would be to filter the bam by 'pos'. >> >> library(VariantTools) >> fl <- LungCancerLines::LungCancerBamFiles()$H1993 >> >> mystart <- 1110426 >> filt <- list(setStart=function(x) x$pos %in% mystart) >> dest <- tempfile() >> filterBam(fl, dest, filter=FilterRules(filt)) >> scn <- scanBam(dest) >> >> Confirm all reads start with 'mystart': >> >> > table(scn[[1]]$pos) >> >> 1110426 >> 2388 >> >> If you want a tally of all nucleotides for all sequences starting with 'mystart' then no need to supply a 'which': >> param <- VariantTallyParam(gmapR::TP53Genome(), >> readlen=100L, >> high_base_quality=23L) >> tally <- tallyVariants(fl, param) >> >> >> Valerie >> >> >> On 07/09/2013 02:06 PM, Taylor, Sean D wrote: >>> I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site. >>> >>> #Example code: >>>> bams <- LungCancerLines::LungCancerBamFiles() >>>> bam <- bams$H1993 >>>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') >>>> tally.param <- VariantTallyParam(gmapR::TP53Genome(), >>> + readlen = 100L, >>> + high_base_quality = 23L, >>> + which = which) >>>> raw.variants <- tallyVariants(bam, tally.param) >>> >>> This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426. >>> >>> Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode: >>>> raw.variants<-lapply (start(bam), function (x){ >>> which<-GRanges(seqnames=c('chrM'), '+', start=x) >>> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) >>> tallyVariants(bamfile, tally.param) >>> }) >>> >>> Thanks, >>> Sean >>> >>>> sessionInfo() >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 >>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] grid parallel stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 >>> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 AnnotationDbi_1.22.5 >>> [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 >>> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 >>> [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6 >>> [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 >>> [19] IRanges_1.18.1 BiocGenerics_0.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] biomaRt_2.16.0 bitops_1.0-5 >>> [3] BSgenome_1.28.0 BSgenome.Hsapiens.UCSC.hg19_1.3.19 >>> [5] graph_1.38.2 Matrix_1.0-12 >>> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 >>> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 >>> [11] stats4_3.0.1 tools_3.0.1 >>> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 [15] >>> zlibbioc_1.6.0 >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.8 years ago Hervé Pagès 16k

0

Entering edit mode

Herve, that helps a lot, and not just for this particular project but also for some other things I have been working on. It will take me a bit to try all this out, but I might follow up later if (when?) I run into difficulties. Thanks! Sean > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Friday, July 12, 2013 11:32 AM > To: Taylor, Sean D > Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org > Subject: Re: [BioC] Filtering BAM files by start position for VariantTools > > Hi Sean, > > On 07/11/2013 11:17 PM, Taylor, Sean D wrote: > > Hi all, > > > > Thanks for lots of great input and ideas. You have given me a lot to try out. > In answer to Herve, here is my ultimate goal: > > I am trying to look at the dynamics of some specific point mutations. > > In my control set that I am working with, the point mutations I am > > interested in exist with known frequencies from 0.1 - 50%. One would > > think that with NGS deep coverage you could easily pick out low > > frequency events, but the PCR and sequencing error rates are high > > enough to effectively mask things below 10%. We are trying out an > > approach to try to reduce this noise. If your library is prepared > > through nextera (which generates random fragments followed by a few > > rounds of PCR) and provided that you have a large enough genome, then > > it is likely that reads that align to the same start position are > > descendant from the same template fragment and should have identical > > sequence. By analyzing these 'families' individually we hope to > > eliminate a lot of noise by discarding variants that don't exceed a > > certain threshold within their family. Hence I want to sort the reads > > by start position into these 'families', tall > y the var > iants in the families, filter out the noise (and possibly collapse the family > into a single consensus sequence), and then do a global variant tally on > what's left. > > > > VariantTools sort of addresses this problem by reporting the number of > 'cycles' supporting each variant. The number of cycles should be analogous > to the number of families, but I couldn't see that I could use this to do the > sort of thing I want to do beyond filtering out things that don't occur in > many families. Which isn't quite what I had in mind. I will have to explore > Michael's suggestion about binning further. Is your suggestion to set > something like > > which<-GRanges(seqnames=c('foo'), IRanges(1,1)) > > tally.param <- VariantTallyParam(gmapR::TP53Genome(), > > readlen = 100L, > > high_base_quality = 23L, > > which = which) > > raw.variants <- tallyVariants(bam, tally.param) It seems thought > > that this gave me all reads that overlap that position, not just the ones that > started at this position. I'm afraid I'm still learning the ins and outs of the > package, so I'm not quite sure how to accomplish what you suggested. > > > > Herve, it looks like your code does allow me to impose the cigar > > alignment on the sequences, which is something that I have really > > wanted to be able to do for a while now so I'm excited to try this > > out. I'm intrigued by your solution and it seems similar to my first > > approach which was to read the bam file in as a GappedAlignments > > object and then split it by start position. I wonder though if I will > > end up with the same problem that I originally ran into, which is that > > (correct me if I'm wrong) the tallyVariants function does not accept > > GappedAlignments objects as a valid input, but rather wants a BAM > > file. I suppose I could try using something like a pileup function > > and sort through it that way, but it would be handy to be able to use > > tallyVariants as it already does what I want if I can just get the > > input right. On the other hand, if I end up needing to collapse the > > families into a single consensus, then this might be the preferable > > route as I can directly manipulate the > whole se > quence here. > > The tallyVariants() function does not accept a GAlignments object. > I guess this is because the power horse behind it is the gmap/gsnap C code > wrapped into the gmapR package. > > However if the next thing you want to do after discarding variants that don't > exceed a certain threshold within their family is to extract a consensus > sequence, then here is one way to get there (following up on the code I > sent previously): > > (6) For each start position, remove reads with under-represented > sequence (e.g. threshold = 20% for the toy data I'm using > here which is low coverage). > > qseq_on_ref <- mcols(gr)$qseq_on_ref > ## One trick here is to assign a unique number to unique > ## sequences and to replace the sequences by this number. > ## This is because step (7) is going to be easier (and a > ## little bit faster) with numbers than with sequences. > tmp <- unlist(qseq_on_ref, use.names=FALSE) > qseq_on_ref_id <- relist(match(tmp, tmp), qseq_on_ref) > > Quick look at 'qseq_on_ref_id': > > > qseq_on_ref_id > IntegerList of length 1934 > [[1]] 1 > [[2]] 2 > [[3]] 3 > [[4]] 4 > [[5]] 5 > [[6]] 6 7 > [[7]] 8 > [[8]] 9 > [[9]] 10 11 > [[10]] 12 > ... > <1924 more elements> > > It's an IntegerList object with the same length and "shape" > as 'qseq_on_ref'. > > (7) Remove under represented ids from each list element > of 'qseq_on_ref_id': > > ## This implementation will be very slow on real data (it's > ## looping on all the start positions) but there are ways to > ## make it much faster. Just illustrating the concept for now. > qseq_on_ref_id2 <- endoapply(qseq_on_ref_id, > function(ids) ids[countMatches(ids, ids) >= 0.2 * length(ids)]) > > (8) Remove corresponding sequences from 'qseq_on_ref': > > tmp <- unlist(qseq_on_ref_id2, use.names=FALSE) > qseq_on_ref2 <- relist(unlist(qseq_on_ref, use.names=FALSE)[tmp], > qseq_on_ref_id2) > > For steps (9)-(10), I'm just re-using the code I posted in that thread: > > https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html > > (9) Compute 1 consensus matrix per chromosome: > > split_factor <- rep.int(seqnames(gr), elementLengths(qseq_on_ref2)) > qseq_on_ref2 <- unlist(qseq_on_ref2, use.names=FALSE) > qseq_on_ref2_by_chrom <- splitAsList(qseq_on_ref2, split_factor) > qseq_pos_by_chrom <- splitAsList(start(gr), split_factor) > > cm_by_chrom <- lapply(names(qseq_pos_by_chrom), > function(seqname) > consensusMatrix(qseq_on_ref2_by_chrom[[seqname]], > as.prob=TRUE, > shift=qseq_pos_by_chrom[[seqname]]-1, > width=seqlengths(gr)[[seqname]])) > names(cm_by_chrom) <- names(qseq_pos_by_chrom) > > ## 'cm_by_chrom' is a list of consensus matrices. Each matrix > ## has 17 rows (1 per letter in the DNA alphabet) and 1 column > ## per chromosome position. > > (10) Compute the consensus string from each consensus matrix. We'll > put "+" ins the strings wherever there is no coverage for that > position, and "N" where there is coverage but no consensus. > > cs_by_chrom <- lapply(cm_by_chrom, > function(cm) { > ## Because consensusString() doesn't like consensus > ## matrices with columns that contain only zeroes (and > ## you will have columns like for chromosome positions > ## that don't receive any coverage), we need to "fix" > ## 'cm' first. > idx <- colSums(cm) == 0 > cm["+", idx] <- 1 > DNAString(consensusString(cm, ambiguityMap="N")) > }) > > consensusString() provides some flexibility to let you extract the consensus > in different ways. See ?consensusString for the details. > > HTH, > > H. > > > > > Valerie, thanks for the follow-up codes. Mclapply() was a good thought. I > will play around with your Map() suggestion and see if I can get it to work. > Hopefully some combination of all of this will get me to where I want to go. > > > > Thanks! > > Sean > > > > > > > > > > -----Original Message----- > > From: Hervé Pagès [mailto:hpages at fhcrc.org] > > Sent: Thursday, July 11, 2013 5:15 PM > > To: Taylor, Sean D > > Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org > > Subject: Re: [BioC] Filtering BAM files by start position for > > VariantTools > > > > Hi Taylor, > > > > Not clear to me what you are trying to do exactly. FWIW here is some code > that will group your read sequences (and qualities) by unique start position > on the genome. It uses sequenceLayer() from the Rsamtools package (was > added to devel recently). The code below has some similarities with the > one I sent in that thread in June: > > > > https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html > > > > (1) Load the BAM file into a GAlignments object. > > > > library(Rsamtools) > > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") > > param <- ScanBamParam(what=c("seq", "qual")) > > gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE, > > param=param) > > > > (2) Use sequenceLayer() to "lay" the query sequences and quality strings > > on the reference. > > > > qseq <- setNames(mcols(gal)$seq, names(gal)) > > qual <- setNames(mcols(gal)$qual, names(gal)) > > qseq_on_ref <- sequenceLayer(qseq, cigar(gal), > > from="query", to="reference") > > qual_on_ref <- sequenceLayer(qual, cigar(gal), > > from="query", to="reference") > > > > (3) Split by chromosome. > > > > qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal)) > > qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal)) > > pos_by_chrom <- splitAsList(start(gal), seqnames(gal)) > > > > (4) For each chromosome generate one GRanges object that contains > > unique alignment start positions and attach 3 metadata columns > > to it: the number of reads, the query sequences, and the quality > > strings. > > > > gr_by_chrom <- lapply(seqlevels(gal), > > function(seqname) > > { > > qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]] > > qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]] > > pos2 <- pos_by_chrom[[seqname]] > > qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2) > > qual_on_ref_per_pos <- split(qual_on_ref2, pos2) > > nread <- elementLengths(qseq_on_ref_per_pos) > > gr_mcols <- DataFrame(nread=unname(nread), > > qseq_on_ref=unname(qseq_on_ref_per_pos), > > qual_on_ref=unname(qual_on_ref_per_pos)) > > gr <- GRanges(Rle(seqname, nrow(gr_mcols)), > > IRanges(as.integer(names(nread)), width=1)) > > mcols(gr) <- gr_mcols > > seqlevels(gr) <- seqlevels(gal) > > gr > > }) > > > > (5) Combine all the GRanges objects obtained in (4) in 1 big GRanges > > object: > > > > gr <- do.call(c, gr_by_chrom) > > seqinfo(gr) <- seqinfo(gal) > > > > 'gr' is a GRanges object that contains unique alignment start positions: > > > > > gr[1:6] > > GRanges with 6 ranges and 3 metadata columns: > > seqnames ranges strand | nread > > <rle> <iranges> <rle> | <integer> > > [1] seq1 [ 1, 1] * | 1 > > [2] seq1 [ 3, 3] * | 1 > > [3] seq1 [ 5, 5] * | 1 > > [4] seq1 [ 6, 6] * | 1 > > [5] seq1 [ 9, 9] * | 1 > > [6] seq1 [13, 13] * | 2 > > > > qseq_on_ref > > > > <dnastringsetlist> > > [1] > > CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > > [2] > > CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT > > [3] > > AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC > > [4] > > GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT > > [5] > > GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG > > [6] > > > ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA,ATTGTAAATGTGTGGTTTAACT > CGTCCATGGCCC > > AG > > > > qual_on_ref > > > > <bstringsetlist> > > [1] > > <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7 > > [2] > > <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): > > [3] > > <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5 > > [4] > > (-&----,----)-)-),'--)---',+-,),''*, > > [5] > > <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5% > > [6] > > > <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666,<<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;< ;<0; > > --- > > seqlengths: > > seq1 seq2 > > 1575 1584 > > > > 2 reads align to start position 13. Let's have a close look at their > > sequences: > > > > > mcols(gr)$qseq_on_ref[[6]] > > A DNAStringSet instance of length 2 > > width seq names > > [1] 35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA > > EAS56_61:6:18:467... > > [2] 36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG > > EAS114_28:5:296:3... > > > > and their qualities: > > > > > mcols(gr)$qual_on_ref[[6]] > > A BStringSet instance of length 2 > > width seq names > > [1] 35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666 > > EAS56_61:6:18:467... > > [2] 36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0; > > EAS114_28:5:296:3... > > > > Note that the sequence and quality strings are those projected to the > reference so the first letter in those strings are on top of start position 13, > the 2nd letter on top of position 14, etc... > > > > Not sure what you want to do from here though... > > > > H. > > > > On 07/11/2013 10:27 AM, Taylor, Sean D wrote: > >> Thanks Valerie, I'll give that a try. And I see that you just clarified my > question about the input file for tallyVariants. > >> > >> I had thought about doing this before but hadn't been able to figure out > the proper filter syntax. Thanks! The only downside to this approach is that > I ultimately want to do this for all unique start sites that align to my > reference. That could entail generating thousands of tempfiles that all have > to be read back in. It could be done with lapply, but it sounds rather > memory and time intensive. Another approach that I tried was reading the > bam file as a GappedAlignments object and then splitting it by start position > into a GAlignmentsList. That was really easy, but so far as I can tell > tallyVariants will not accept GappedAlignments objects as an input. I > wonder if there is another way to vectorize this approach? > >> > >> Thanks, > >> Sean > >> > >> > >> -----Original Message----- > >> From: Valerie Obenchain [mailto:vobencha at fhcrc.org] > >> Sent: Thursday, July 11, 2013 10:02 AM > >> To: Taylor, Sean D > >> Cc: bioconductor at r-project.org; Michael Lawrence > >> Subject: Re: [BioC] Filtering BAM files by start position for > >> VariantTools > >> > >> Hi Sean, > >> > >> As you've discovered, the 'which' in the 'param' (for reading bam files) > specifies positions to overlap, not start or end. One approach to isolating > reads with a specific starting position would be to filter the bam by 'pos'. > >> > >> library(VariantTools) > >> fl <- LungCancerLines::LungCancerBamFiles()$H1993 > >> > >> mystart <- 1110426 > >> filt <- list(setStart=function(x) x$pos %in% mystart) > >> dest <- tempfile() > >> filterBam(fl, dest, filter=FilterRules(filt)) > >> scn <- scanBam(dest) > >> > >> Confirm all reads start with 'mystart': > >> > >> > table(scn[[1]]$pos) > >> > >> 1110426 > >> 2388 > >> > >> If you want a tally of all nucleotides for all sequences starting with > 'mystart' then no need to supply a 'which': > >> param <- VariantTallyParam(gmapR::TP53Genome(), > >> readlen=100L, > >> high_base_quality=23L) > >> tally <- tallyVariants(fl, param) > >> > >> > >> Valerie > >> > >> > >> On 07/09/2013 02:06 PM, Taylor, Sean D wrote: > >>> I am trying to read a specific set of records from a bam file for use in the > VariantTools package. I'm trying to construct a which argument (a GRanges > object) that will pull in a set of records from all reads that only start at a > specified position. (i.e. all reads that start at position 100). So far I have only > been able to specify reads that overlap position 100, but have not been able > to find a way to define the start site. > >>> > >>> #Example code: > >>>> bams <- LungCancerLines::LungCancerBamFiles() > >>>> bam <- bams$H1993 > >>>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') > >>>> tally.param <- VariantTallyParam(gmapR::TP53Genome(), > >>> + readlen = 100L, > >>> + high_base_quality = 23L, > >>> + which = which) > >>>> raw.variants <- tallyVariants(bam, tally.param) > >>> > >>> This code shows all the variants at position 1110426, but not all the > variants from the reads that start at position 1110426. > >>> > >>> Ultimately, I am trying to do this for all start positions in my data set, so I > would want something that looks like this pseudocode: > >>>> raw.variants<-lapply (start(bam), function (x){ > >>> which<-GRanges(seqnames=c('chrM'), '+', start=x) > >>> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which) > >>> tallyVariants(bamfile, tally.param) > >>> }) > >>> > >>> Thanks, > >>> Sean > >>> > >>>> sessionInfo() > >>> R version 3.0.1 (2013-05-16) > >>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>> > >>> locale: > >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 > >>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > >>> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > >>> > >>> attached base packages: > >>> [1] grid parallel stats graphics grDevices utils datasets methods > base > >>> > >>> other attached packages: > >>> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2 > >>> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1 > AnnotationDbi_1.22.5 > >>> [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24 > >>> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8 > >>> [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6 > >>> [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4 > >>> [19] IRanges_1.18.1 BiocGenerics_0.6.0 > >>> > >>> loaded via a namespace (and not attached): > >>> [1] biomaRt_2.16.0 bitops_1.0-5 > >>> [3] BSgenome_1.28.0 > BSgenome.Hsapiens.UCSC.hg19_1.3.19 > >>> [5] graph_1.38.2 Matrix_1.0-12 > >>> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2 > >>> [9] RCurl_1.95-4.1 rtracklayer_1.20.2 > >>> [11] stats4_3.0.1 tools_3.0.1 > >>> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 [15] > >>> zlibbioc_1.6.0 > >>> > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319

ADD REPLY • link 10.8 years ago Taylor, Sean D ▴ 250

Login before adding your answer.