Consensus sequence

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

I have a use case that I am not sure the best way to proceed. I have a non-model organism for which we would like to know the sequence for a set of genes. It's a bird species, and I can for example align to the zebra finch genome, then run bam_tally over a region to get variants where my species differs from zebra finch, and then infer the sequence based on the reference and the variants. But that seems harder than it should be. Is there some function that I am missing that I can feed a GRanges and get back the consensus sequence for that region, based on say a simple vote of the reads that overlap it? Thanks, Jim [[alternative HTML version deleted]]

Organism • 1.8k views

ADD COMMENT • link updated 9.6 years ago by Hervé Pagès 16k • written 9.6 years ago by James W. MacDonald 65k

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 3 hours ago

Seattle, WA, United States

Hi Jim, On 09/08/2014 07:32 PM, James W. MacDonald wrote: > I have a use case that I am not sure the best way to proceed. > > I have a non-model organism for which we would like to know the sequence > for a set of genes. It's a bird species, and I can for example align to the > zebra finch genome, then run bam_tally over a region to get variants where > my species differs from zebra finch, and then infer the sequence based on > the reference and the variants. > > But that seems harder than it should be. Is there some function that I am > missing that I can feed a GRanges and get back the consensus sequence for > that region, based on say a simple vote of the reads that overlap it? So basically you're going to use the same approach as if you had reads from a zebra finch individual. The real gene sequences for that individual would probably also differ from the reference sequence, maybe less than for your bird that is not zebra finch, but that does not really change the overall game right? I agree you should try to avoid the intermediate step of variant calling. Sounds a little bit simpler to aim directly for the consensus sequence of your regions of interest. Here is one way to do this: library(GenomicAlignments) ## Uses pileLettersAt() and alpahabetFrequency() internally. ## Arguments in ... are passed to alpahabetFrequency(). ## The 'param' arg must be like in stackStringsFromBam() (see ## ?stackStringsFromBam for the details). alphabetFrequencyFromBam <- function(file, index=file, param, ...) { param <- GenomicAlignments:::.normarg_param(param) region <- bamWhich(param) bamWhat(param) <- "seq" gal <- readGAlignmentsFromBam(file, index=index, param=param) seqlevels(gal) <- names(region) qseq <- mcols(gal)$seq at <- GRanges(names(region), IRanges(start(region[[1L]]):end(region[[1L]]), width=1L)) piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), at) alphabetFrequency(piles, ...) } ## A naive consensus string extraction: at each position, select ## the letter with highest frequency. Insert letter "." if all ## frequencies are 0 at a given position. Insert letter "*" is ## there is a tie. consensusStringFromFrequencyMatrix <- function(fm) { selection <- apply(fm, 2, function(x) { i <- which.max(x) if (x[i] == 0L) return(5L) if (sum(x == x[i]) != 1L) return(6L) i }) paste0(c(DNA_BASES, ".", "*")[selection], collapse="") } Then: ## Get some genes. library(GenomicFeatures) txdb <- makeTranscriptDbFromUCSC(genome="taeGut2", table="refGene") my_genes <- sample(genes(txdb), 20) ## Extract the corresponding consensus sequences from your BAM file. my_gene_seqs <- sapply(seq_along(my_genes), function(i) { af <- alphabetFrequencyFromBam(bamfile, param=my_genes[i], baseOnly=TRUE) fm <- t(af[ , DNA_BASES]) consensusStringFromFrequencyMatrix(fm) }) IMPORTANT: The strand in 'my_genes' is ignored, that is, all sequences are extracted from the plus strand (and going from 5' to 3' along that strand), even for genes that belong to the minus strand. It's not going to be very efficient (approx. 1 s per gene). Cheers, H. > > Thanks, > > Jim > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 9.6 years ago Hervé Pagès 16k

0

Entering edit mode

I should add that the code I sent previously makes sense if the genes in your individual contains only a "small" number of substitutions with respect to the reference genes. If they contain indels and/or a high number of substitutions, it would not be appropriated (it would still "work" but the consensus sequences could not be trusted). The problem with indels is that the alphabetFrequencyFromBam() function doesn't know how to handle them (because it only knows how to report things that align with nucleotides in the reference genome, but insertions don't). Also having too many substitutions would make the alignment process less reliable (would make it less likely that reads coming from gene X align to the region corresponding to gene X in the reference). H. On 09/10/2014 03:44 PM, Hervé Pagès wrote: > Hi Jim, > > On 09/08/2014 07:32 PM, James W. MacDonald wrote: >> I have a use case that I am not sure the best way to proceed. >> >> I have a non-model organism for which we would like to know the sequence >> for a set of genes. It's a bird species, and I can for example align >> to the >> zebra finch genome, then run bam_tally over a region to get variants >> where >> my species differs from zebra finch, and then infer the sequence based on >> the reference and the variants. >> >> But that seems harder than it should be. Is there some function that I am >> missing that I can feed a GRanges and get back the consensus sequence for >> that region, based on say a simple vote of the reads that overlap it? > > So basically you're going to use the same approach as if you had reads > from a zebra finch individual. The real gene sequences for that > individual would probably also differ from the reference sequence, > maybe less than for your bird that is not zebra finch, but that does > not really change the overall game right? > > I agree you should try to avoid the intermediate step of variant > calling. Sounds a little bit simpler to aim directly for the > consensus sequence of your regions of interest. Here is one way > to do this: > > library(GenomicAlignments) > > ## Uses pileLettersAt() and alpahabetFrequency() internally. > ## Arguments in ... are passed to alpahabetFrequency(). > ## The 'param' arg must be like in stackStringsFromBam() (see > ## ?stackStringsFromBam for the details). > alphabetFrequencyFromBam <- function(file, index=file, param, ...) > { > param <- GenomicAlignments:::.normarg_param(param) > region <- bamWhich(param) > bamWhat(param) <- "seq" > gal <- readGAlignmentsFromBam(file, index=index, param=param) > seqlevels(gal) <- names(region) > qseq <- mcols(gal)$seq > at <- GRanges(names(region), > IRanges(start(region[[1L]]):end(region[[1L]]), > width=1L)) > piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), > at) > alphabetFrequency(piles, ...) > } > > ## A naive consensus string extraction: at each position, select > ## the letter with highest frequency. Insert letter "." if all > ## frequencies are 0 at a given position. Insert letter "*" is > ## there is a tie. > consensusStringFromFrequencyMatrix <- function(fm) > { > selection <- apply(fm, 2, > function(x) { > i <- which.max(x) > if (x[i] == 0L) > return(5L) > if (sum(x == x[i]) != 1L) > return(6L) > i > }) > paste0(c(DNA_BASES, ".", "*")[selection], collapse="") > } > > Then: > > ## Get some genes. > library(GenomicFeatures) > txdb <- makeTranscriptDbFromUCSC(genome="taeGut2", table="refGene") > my_genes <- sample(genes(txdb), 20) > > ## Extract the corresponding consensus sequences from your BAM file. > my_gene_seqs <- sapply(seq_along(my_genes), > function(i) { > af <- alphabetFrequencyFromBam(bamfile, param=my_genes[i], > baseOnly=TRUE) > fm <- t(af[ , DNA_BASES]) > consensusStringFromFrequencyMatrix(fm) > }) > > IMPORTANT: The strand in 'my_genes' is ignored, that is, all sequences > are extracted from the plus strand (and going from 5' to 3' along that > strand), even for genes that belong to the minus strand. > > It's not going to be very efficient (approx. 1 s per gene). > > Cheers, > H. > >> >> Thanks, >> >> Jim >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 9.6 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Thanks! That works better than the kludgetastic function I set up. But note I had to add one line: alphabetFrequencyFromBam <- function(file, index=file, param, ...){ seqlevels(param) <- seqlevelsInUse(param) ## added this param <- GenomicAlignments:::.normarg_param(param) region <- bamWhich(param) bamWhat(param) <- "seq" gal <- readGAlignmentsFromBam(file, index=index, param=param) seqlevels(gal) <- names(region) qseq <- mcols(gal)$seq at <- GRanges(names(region), IRanges(start(region[[1L]]):end(region[[1L]]), width=1L)) piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), at) alphabetFrequency(piles, ...) } because of how normarg_param() treats GRanges with multiple seqlevels: > bamWhich(GenomicAlignments:::.normarg_param("chr5:1307857-1308626")) IRangesList of length 1 $chr5 IRanges of length 1 start end width [1] 1307857 1308626 770 > bdnf <- GRanges("chr5", IRanges(1307857,1308626)) > bamWhich(GenomicAlignments:::.normarg_param(bdnf)) IRangesList of length 1 $chr5 IRanges of length 1 start end width [1] 1307857 1308626 770 > gns <- genes(txdb) > bdnf2 <- gns["751584"] > bdnf2 GRanges with 1 range and 1 metadata column: seqnames ranges strand | gene_id <rle> <iranges> <rle> | <characterlist> 751584 chr5 [1307857, 1308626] - | 751584 --- seqlengths: chr1 ... chrZ_KB442450_random 118548696 ... 2283 > bamWhich(GenomicAlignments:::.normarg_param(bdnf2)) IRangesList of length 37096 $chr1 IRanges of length 0 $chr2 IRanges of length 0 $chr3 IRanges of length 0 ... <37093 more elements> But this param instance still technically fulfills the criteria for stackStringsFromBam: > unlist(bamWhich(GenomicAlignments:::.normarg_param(bdnf2))) IRanges of length 1 start end width names [1] 1307857 1308626 770 chr5.751584 Best, Jim On Wed, Sep 10, 2014 at 7:33 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > I should add that the code I sent previously makes sense if > the genes in your individual contains only a "small" number of > substitutions with respect to the reference genes. If they > contain indels and/or a high number of substitutions, it would > not be appropriated (it would still "work" but the consensus > sequences could not be trusted). The problem with indels is that > the alphabetFrequencyFromBam() function doesn't know how to > handle them (because it only knows how to report things that > align with nucleotides in the reference genome, but insertions > don't). Also having too many substitutions would make the > alignment process less reliable (would make it less likely > that reads coming from gene X align to the region > corresponding to gene X in the reference). > > H. > > > > On 09/10/2014 03:44 PM, Hervé Pagès wrote: > >> Hi Jim, >> >> On 09/08/2014 07:32 PM, James W. MacDonald wrote: >> >>> I have a use case that I am not sure the best way to proceed. >>> >>> I have a non-model organism for which we would like to know the sequence >>> for a set of genes. It's a bird species, and I can for example align >>> to the >>> zebra finch genome, then run bam_tally over a region to get variants >>> where >>> my species differs from zebra finch, and then infer the sequence based on >>> the reference and the variants. >>> >>> But that seems harder than it should be. Is there some function that I am >>> missing that I can feed a GRanges and get back the consensus sequence for >>> that region, based on say a simple vote of the reads that overlap it? >>> >> >> So basically you're going to use the same approach as if you had reads >> from a zebra finch individual. The real gene sequences for that >> individual would probably also differ from the reference sequence, >> maybe less than for your bird that is not zebra finch, but that does >> not really change the overall game right? >> >> I agree you should try to avoid the intermediate step of variant >> calling. Sounds a little bit simpler to aim directly for the >> consensus sequence of your regions of interest. Here is one way >> to do this: >> >> library(GenomicAlignments) >> >> ## Uses pileLettersAt() and alpahabetFrequency() internally. >> ## Arguments in ... are passed to alpahabetFrequency(). >> ## The 'param' arg must be like in stackStringsFromBam() (see >> ## ?stackStringsFromBam for the details). >> alphabetFrequencyFromBam <- function(file, index=file, param, ...) >> { >> param <- GenomicAlignments:::.normarg_param(param) >> region <- bamWhich(param) >> bamWhat(param) <- "seq" >> gal <- readGAlignmentsFromBam(file, index=index, param=param) >> seqlevels(gal) <- names(region) >> qseq <- mcols(gal)$seq >> at <- GRanges(names(region), >> IRanges(start(region[[1L]]):end(region[[1L]]), >> width=1L)) >> piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), >> at) >> alphabetFrequency(piles, ...) >> } >> >> ## A naive consensus string extraction: at each position, select >> ## the letter with highest frequency. Insert letter "." if all >> ## frequencies are 0 at a given position. Insert letter "*" is >> ## there is a tie. >> consensusStringFromFrequencyMatrix <- function(fm) >> { >> selection <- apply(fm, 2, >> function(x) { >> i <- which.max(x) >> if (x[i] == 0L) >> return(5L) >> if (sum(x == x[i]) != 1L) >> return(6L) >> i >> }) >> paste0(c(DNA_BASES, ".", "*")[selection], collapse="") >> } >> >> Then: >> >> ## Get some genes. >> library(GenomicFeatures) >> txdb <- makeTranscriptDbFromUCSC(genome="taeGut2", table="refGene") >> my_genes <- sample(genes(txdb), 20) >> >> ## Extract the corresponding consensus sequences from your BAM file. >> my_gene_seqs <- sapply(seq_along(my_genes), >> function(i) { >> af <- alphabetFrequencyFromBam(bamfile, param=my_genes[i], >> baseOnly=TRUE) >> fm <- t(af[ , DNA_BASES]) >> consensusStringFromFrequencyMatrix(fm) >> }) >> >> IMPORTANT: The strand in 'my_genes' is ignored, that is, all sequences >> are extracted from the plus strand (and going from 5' to 3' along that >> strand), even for genes that belong to the minus strand. >> >> It's not going to be very efficient (approx. 1 s per gene). >> >> Cheers, >> H. >> >> >>> Thanks, >>> >>> Jim >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD REPLY • link 9.6 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Jim, Thanks for the feedback. Yesterday I added alphabetFrequencyFromBam() to GenomicAlignments (in devel). Just applied your fix to it now. Cheers, H. On 09/12/2014 08:06 AM, James W. MacDonald wrote: > Hi Herve, > > Thanks! That works better than the kludgetastic function I set up. > > But note I had to add one line: > > alphabetFrequencyFromBam <- function(file, index=file, param, ...){ > seqlevels(param) <- seqlevelsInUse(param) > ## added this > param <- GenomicAlignments:::.normarg_param(param) > region <- bamWhich(param) > bamWhat(param) <- "seq" > gal <- readGAlignmentsFromBam(file, index=index, param=param) > seqlevels(gal) <- names(region) > qseq <- mcols(gal)$seq > at <- GRanges(names(region), > IRanges(start(region[[1L]]):end(region[[1L]]), > width=1L)) > piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), > at) > alphabetFrequency(piles, ...) > } > > because of how normarg_param() treats GRanges with multiple seqlevels: > > > bamWhich(GenomicAlignments:::.normarg_param("chr5:1307857-1308626")) > IRangesList of length 1 > $chr5 > IRanges of length 1 > start end width > [1] 1307857 1308626 770 > > > bdnf <- GRanges("chr5", IRanges(1307857,1308626)) > > bamWhich(GenomicAlignments:::.normarg_param(bdnf)) > IRangesList of length 1 > $chr5 > IRanges of length 1 > start end width > [1] 1307857 1308626 770 > > > gns <- genes(txdb) > > bdnf2 <- gns["751584"] > > bdnf2 > GRanges with 1 range and 1 metadata column: > seqnames ranges strand | gene_id > <rle> <iranges> <rle> | <characterlist> > 751584 chr5 [1307857, 1308626] - | 751584 > --- > seqlengths: > chr1 ... chrZ_KB442450_random > 118548696 ... 2283 > > bamWhich(GenomicAlignments:::.normarg_param(bdnf2)) > IRangesList of length 37096 > $chr1 > IRanges of length 0 > > $chr2 > IRanges of length 0 > > $chr3 > IRanges of length 0 > > ... > <37093 more elements> > > But this param instance still technically fulfills the criteria for > stackStringsFromBam: > > > unlist(bamWhich(GenomicAlignments:::.normarg_param(bdnf2))) > IRanges of length 1 > start end width names > [1] 1307857 1308626 770 chr5.751584 > > Best, > > Jim > > > > > On Wed, Sep 10, 2014 at 7:33 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > I should add that the code I sent previously makes sense if > the genes in your individual contains only a "small" number of > substitutions with respect to the reference genes. If they > contain indels and/or a high number of substitutions, it would > not be appropriated (it would still "work" but the consensus > sequences could not be trusted). The problem with indels is that > the alphabetFrequencyFromBam() function doesn't know how to > handle them (because it only knows how to report things that > align with nucleotides in the reference genome, but insertions > don't). Also having too many substitutions would make the > alignment process less reliable (would make it less likely > that reads coming from gene X align to the region > corresponding to gene X in the reference). > > H. > > > > On 09/10/2014 03:44 PM, Hervé Pagès wrote: > > Hi Jim, > > On 09/08/2014 07:32 PM, James W. MacDonald wrote: > > I have a use case that I am not sure the best way to proceed. > > I have a non-model organism for which we would like to know > the sequence > for a set of genes. It's a bird species, and I can for > example align > to the > zebra finch genome, then run bam_tally over a region to get > variants > where > my species differs from zebra finch, and then infer the > sequence based on > the reference and the variants. > > But that seems harder than it should be. Is there some > function that I am > missing that I can feed a GRanges and get back the consensus > sequence for > that region, based on say a simple vote of the reads that > overlap it? > > > So basically you're going to use the same approach as if you had > reads > from a zebra finch individual. The real gene sequences for that > individual would probably also differ from the reference sequence, > maybe less than for your bird that is not zebra finch, but that does > not really change the overall game right? > > I agree you should try to avoid the intermediate step of variant > calling. Sounds a little bit simpler to aim directly for the > consensus sequence of your regions of interest. Here is one way > to do this: > > library(GenomicAlignments) > > ## Uses pileLettersAt() and alpahabetFrequency() internally. > ## Arguments in ... are passed to alpahabetFrequency(). > ## The 'param' arg must be like in stackStringsFromBam() (see > ## ?stackStringsFromBam for the details). > alphabetFrequencyFromBam <- function(file, index=file, > param, ...) > { > param <- GenomicAlignments:::.normarg___param(param) > region <- bamWhich(param) > bamWhat(param) <- "seq" > gal <- readGAlignmentsFromBam(file, index=index, param=param) > seqlevels(gal) <- names(region) > qseq <- mcols(gal)$seq > at <- GRanges(names(region), > IRanges(start(region[[1L]]):__end(region[[1L]]), > width=1L)) > piles <- pileLettersAt(qseq, seqnames(gal), start(gal), > cigar(gal), > at) > alphabetFrequency(piles, ...) > } > > ## A naive consensus string extraction: at each position, select > ## the letter with highest frequency. Insert letter "." if all > ## frequencies are 0 at a given position. Insert letter "*" is > ## there is a tie. > consensusStringFromFrequencyMa__trix <- function(fm) > { > selection <- apply(fm, 2, > function(x) { > i <- which.max(x) > if (x[i] == 0L) > return(5L) > if (sum(x == x[i]) != 1L) > return(6L) > i > }) > paste0(c(DNA_BASES, ".", "*")[selection], collapse="") > } > > Then: > > ## Get some genes. > library(GenomicFeatures) > txdb <- makeTranscriptDbFromUCSC(__genome="taeGut2", > table="refGene") > my_genes <- sample(genes(txdb), 20) > > ## Extract the corresponding consensus sequences from your > BAM file. > my_gene_seqs <- sapply(seq_along(my_genes), > function(i) { > af <- alphabetFrequencyFromBam(__bamfile, param=my_genes[i], > baseOnly=TRUE) > fm <- t(af[ , DNA_BASES]) > consensusStringFromFrequencyMa__trix(fm) > }) > > IMPORTANT: The strand in 'my_genes' is ignored, that is, all > sequences > are extracted from the plus strand (and going from 5' to 3' > along that > strand), even for genes that belong to the minus strand. > > It's not going to be very efficient (approx. 1 s per gene). > > Cheers, > H. > > > Thanks, > > Jim > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 9.6 years ago Hervé Pagès 16k

Login before adding your answer.