Create consensus fasta file from indexed Bam file

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hello, I have an indexed bam files from 454 sequencing on short reference sequences. I would like to create a consensus sequence in fasta format from this bam file using R Bioconductor. Is there a way to do that simply ? Thanks in advance Jacques -- output of sessionInfo(): R version 2.15.1 (2012-06-22) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C [5] LC_TIME=French_France.1252 attached base packages: [1] grDevices datasets splines graphics stats tcltk utils [8] methods base other attached packages: [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 [4] IRanges_1.16.6 BiocGenerics_0.4.0 TinnR_1.0-5 [7] R2HTML_2.2.1 Hmisc_3.10-1.1 survival_2.36-14 loaded via a namespace (and not attached): [1] bitops_1.0-5 cluster_1.14.4 grid_2.15.1 lattice_0.20-15 [5] parallel_2.15.1 stats4_2.15.1 tools_2.15.1 zlibbioc_1.4.0 -- Sent via the guest posting facility at bioconductor.org.

Sequencing Sequencing • 2.4k views

ADD COMMENT • link updated 10.8 years ago by Hervé Pagès 16k • written 10.8 years ago by Guest User ★ 13k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 16 hours ago

Seattle, WA, United States

Hi Jacques, I just added a tool to the Rsamtools package, that makes this easy. It's the sequenceLayer() function. Here is how to use it: (1) Load the BAM file into a GAlignments object: library(Rsamtools) bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") param <- ScanBamParam(what="seq") gal <- readGAlignmentsFromBam(bamfile, param=param) qseq <- mcols(gal)$seq # the query sequences (2) Use sequenceLayer() to "lay" the query sequences on the reference space. This will remove the parts from the query sequences that correspond to insertions and soft clipping, and it will fill them with - where deletions and/or skipped regions occurred: qseq_on_ref <- sequenceLayer(qseq, cigar(gal), from="query", to="reference") (3) Compute 1 consensus matrix per chromosome: qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal)) qseq_pos_by_chrom <- splitAsList(start(gal), seqnames(gal)) cm_by_chrom <- lapply(names(qseq_pos_by_chrom), function(seqname) consensusMatrix(qseq_on_ref_by_chrom[[seqname]], as.prob=TRUE, shift=qseq_pos_by_chrom[[seqname]]-1, width=seqlengths(gal)[[seqname]])) names(cm_by_chrom) <- names(qseq_pos_by_chrom) ## 'cm_by_chrom' is a list of consensus matrices. Each matrix ## has 17 rows (1 per letter in the DNA alphabet) and 1 column ## per chromosome position. (4) Compute the consensus string from each consensus matrix. We'll put "+" ins the strings wherever there is no coverage for that position, and "N" where there is coverage but no consensus. cs_by_chrom <- lapply(cm_by_chrom, function(cm) { ## Because consensusString() doesn't like consensus ## matrices with columns that contain only zeroes (and ## you will have columns like for chromosome positions ## that don't receive any coverage), we need to "fix" ## 'cm' first. idx <- colSums(cm) == 0 cm["+", idx] <- 1 DNAString(consensusString(cm, ambiguityMap="N")) }) (5) Write 'cs_by_chrom' to a FASTA file: writeXStringSet(DNAStringSet(cs_by_chrom), "consensus.fa") Please let us know how this goes with your 454 data. Note that you will need the 1.13.18 version of Rsamtools for this, which is the latest devel version of the package. This means you need to install BioC devel. See our website for how to do this. Rsamtools 1.13.18 will be available in the next 24 hours or so thru biocLite(). Cheers, H. On 06/19/2013 02:54 AM, Maintainer wrote: > > Hello, > I have an indexed bam files from 454 sequencing on short reference sequences. I would like to create a consensus sequence in fasta format from this bam file using R Bioconductor. Is there a way to do that simply ? > Thanks in advance > Jacques > > -- output of sessionInfo(): > > R version 2.15.1 (2012-06-22) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 > [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C > [5] LC_TIME=French_France.1252 > > attached base packages: > [1] grDevices datasets splines graphics stats tcltk utils > [8] methods base > > other attached packages: > [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 > [4] IRanges_1.16.6 BiocGenerics_0.4.0 TinnR_1.0-5 > [7] R2HTML_2.2.1 Hmisc_3.10-1.1 survival_2.36-14 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 cluster_1.14.4 grid_2.15.1 lattice_0.20-15 > [5] parallel_2.15.1 stats4_2.15.1 tools_2.15.1 zlibbioc_1.4.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 10.8 years ago Hervé Pagès 16k

0

Entering edit mode

An alternative approach would be to summarize the Rsamtools::applyPileup() or gmapR::bam_tally output. Michael On Wed, Jun 19, 2013 at 12:45 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Jacques, > > I just added a tool to the Rsamtools package, that makes this easy. > It's the sequenceLayer() function. Here is how to use it: > > (1) Load the BAM file into a GAlignments object: > > library(Rsamtools) > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what="seq") > gal <- readGAlignmentsFromBam(**bamfile, param=param) > qseq <- mcols(gal)$seq # the query sequences > > (2) Use sequenceLayer() to "lay" the query sequences on the reference > space. This will remove the parts from the query sequences that > correspond to insertions and soft clipping, and it will fill them > with - where deletions and/or skipped regions occurred: > > qseq_on_ref <- sequenceLayer(qseq, cigar(gal), > from="query", to="reference") > > (3) Compute 1 consensus matrix per chromosome: > > qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal)) > qseq_pos_by_chrom <- splitAsList(start(gal), seqnames(gal)) > > cm_by_chrom <- lapply(names(qseq_pos_by_**chrom), > function(seqname) > consensusMatrix(qseq_on_ref_**by_chrom[[seqname]], > as.prob=TRUE, > shift=qseq_pos_by_chrom[[**seqname]]-1, > width=seqlengths(gal)[[**seqname]])) > names(cm_by_chrom) <- names(qseq_pos_by_chrom) > > ## 'cm_by_chrom' is a list of consensus matrices. Each matrix > ## has 17 rows (1 per letter in the DNA alphabet) and 1 column > ## per chromosome position. > > (4) Compute the consensus string from each consensus matrix. We'll > put "+" ins the strings wherever there is no coverage for that > position, and "N" where there is coverage but no consensus. > > cs_by_chrom <- lapply(cm_by_chrom, > function(cm) { > ## Because consensusString() doesn't like consensus > ## matrices with columns that contain only zeroes (and > ## you will have columns like for chromosome positions > ## that don't receive any coverage), we need to "fix" > ## 'cm' first. > idx <- colSums(cm) == 0 > cm["+", idx] <- 1 > DNAString(consensusString(cm, ambiguityMap="N")) > }) > > (5) Write 'cs_by_chrom' to a FASTA file: > > writeXStringSet(DNAStringSet(**cs_by_chrom), "consensus.fa") > > Please let us know how this goes with your 454 data. > > Note that you will need the 1.13.18 version of Rsamtools for this, > which is the latest devel version of the package. This means you need > to install BioC devel. See our website for how to do this. > Rsamtools 1.13.18 will be available in the next 24 hours or so thru > biocLite(). > > Cheers, > H. > > > > On 06/19/2013 02:54 AM, Maintainer wrote: > >> >> Hello, >> I have an indexed bam files from 454 sequencing on short reference >> sequences. I would like to create a consensus sequence in fasta format from >> this bam file using R Bioconductor. Is there a way to do that simply ? >> Thanks in advance >> Jacques >> >> -- output of sessionInfo(): >> >> R version 2.15.1 (2012-06-22) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 >> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C >> [5] LC_TIME=French_France.1252 >> >> attached base packages: >> [1] grDevices datasets splines graphics stats tcltk utils >> [8] methods base >> >> other attached packages: >> [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 >> [4] IRanges_1.16.6 BiocGenerics_0.4.0 TinnR_1.0-5 >> [7] R2HTML_2.2.1 Hmisc_3.10-1.1 survival_2.36-14 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 cluster_1.14.4 grid_2.15.1 lattice_0.20-15 >> [5] parallel_2.15.1 stats4_2.15.1 tools_2.15.1 zlibbioc_1.4.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> ______________________________**______________________________** >> ____________ >> devteam-bioc mailing list >> To unsubscribe from this mailing list send a blank email to >> devteam-bioc-leave@lists.**fhcrc.org <devteam-bioc- leave@lists.fhcrc.org=""> >> You can also unsubscribe or change your personal options at >> https://lists.fhcrc.org/**mailman/listinfo/devteam- bioc<https: lists.fhcrc.org="" mailman="" listinfo="" devteam-bioc=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Herv?, Just tested what you proposed on my files. It worked very fine. Thanks a lot for the help. Regards > Hi Jacques, > > I just added a tool to the Rsamtools package, that makes this easy. > It's the sequenceLayer() function. Here is how to use it: > > (1) Load the BAM file into a GAlignments object: > > library(Rsamtools) > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what="seq") > gal <- readGAlignmentsFromBam(bamfile, param=param) > qseq <- mcols(gal)$seq # the query sequences > > (2) Use sequenceLayer() to "lay" the query sequences on the reference > space. This will remove the parts from the query sequences that > correspond to insertions and soft clipping, and it will fill them > with - where deletions and/or skipped regions occurred: > > qseq_on_ref <- sequenceLayer(qseq, cigar(gal), > from="query", to="reference") > > (3) Compute 1 consensus matrix per chromosome: > > qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal)) > qseq_pos_by_chrom <- splitAsList(start(gal), seqnames(gal)) > > cm_by_chrom <- lapply(names(qseq_pos_by_chrom), > function(seqname) > consensusMatrix(qseq_on_ref_by_chrom[[seqname]], > as.prob=TRUE, > shift=qseq_pos_by_chrom[[seqname]]-1, > width=seqlengths(gal)[[seqname]])) > names(cm_by_chrom) <- names(qseq_pos_by_chrom) > > ## 'cm_by_chrom' is a list of consensus matrices. Each matrix > ## has 17 rows (1 per letter in the DNA alphabet) and 1 column > ## per chromosome position. > > (4) Compute the consensus string from each consensus matrix. We'll > put "+" ins the strings wherever there is no coverage for that > position, and "N" where there is coverage but no consensus. > > cs_by_chrom <- lapply(cm_by_chrom, > function(cm) { > ## Because consensusString() doesn't like consensus > ## matrices with columns that contain only zeroes (and > ## you will have columns like for chromosome positions > ## that don't receive any coverage), we need to "fix" > ## 'cm' first. > idx <- colSums(cm) == 0 > cm["+", idx] <- 1 > DNAString(consensusString(cm, ambiguityMap="N")) > }) > > (5) Write 'cs_by_chrom' to a FASTA file: > > writeXStringSet(DNAStringSet(cs_by_chrom), "consensus.fa") > > Please let us know how this goes with your 454 data. > > Note that you will need the 1.13.18 version of Rsamtools for this, > which is the latest devel version of the package. This means you need > to install BioC devel. See our website for how to do this. > Rsamtools 1.13.18 will be available in the next 24 hours or so thru > biocLite(). > > Cheers, > H. > > > On 06/19/2013 02:54 AM, Maintainer wrote: >> >> Hello, >> I have an indexed bam files from 454 sequencing on short reference >> sequences. I would like to create a consensus sequence in fasta >> format from this bam file using R Bioconductor. Is there a way to do >> that simply ? >> Thanks in advance >> Jacques >> >> -- output of sessionInfo(): >> >> R version 2.15.1 (2012-06-22) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 >> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C >> [5] LC_TIME=French_France.1252 >> >> attached base packages: >> [1] grDevices datasets splines graphics stats tcltk utils >> [8] methods base >> >> other attached packages: >> [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 >> [4] IRanges_1.16.6 BiocGenerics_0.4.0 TinnR_1.0-5 >> [7] R2HTML_2.2.1 Hmisc_3.10-1.1 survival_2.36-14 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 cluster_1.14.4 grid_2.15.1 lattice_0.20-15 >> [5] parallel_2.15.1 stats4_2.15.1 tools_2.15.1 zlibbioc_1.4.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> ___________________________________________________________________ _____ >> devteam-bioc mailing list >> To unsubscribe from this mailing list send a blank email to >> devteam-bioc-leave at lists.fhcrc.org >> You can also unsubscribe or change your personal options at >> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc >> > -- Jacques Le Gouis INRA, UMR INRA/UBP 1095 G?n?tique, Diversit? et Ecophysiologie des C?r?ales Domaine de Crouelle 5 Chemin de Beaulieu 63039 Clermont Ferrand Cedex 2 Tel : +33 (0)4-73-62-43-11 Fax : +33 (0)4-73-67-18-39 e-mail : jlegouis at clermont.inra.fr http://www4.clermont.inra.fr/umr1095/abc

ADD REPLY • link 10.8 years ago Jacques LEGOUIS ▴ 10

Login before adding your answer.