Search
Question: Blast a BSgenome
0
gravatar for Ugo Borello
4.4 years ago by
Ugo Borello340
France
Ugo Borello340 wrote:
Good morning, I have a custom made Bsgenome data package and I would like to align a cDNA sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the aligned sequences. Any suggestions? Any BLAST-like function in Bioconductor? If I understand correctly Biostrings::matchPattern works with small sequences, right? And I need in addition to account for gaps in the alignment! Does annotate::blastSequences could accept my Bsgenome genomic sequence as argument? Thank you in advance for your help. Ugo
ADD COMMENTlink modified 4.4 years ago by Hervé Pagès ♦♦ 13k • written 4.4 years ago by Ugo Borello340
0
gravatar for Hervé Pagès
4.4 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
Hi Ugo, On 07/05/2013 04:18 AM, Ugo Borello wrote: > Good morning, > I have a custom made Bsgenome data package and I would like to align a cDNA > sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the > aligned sequences. > Any suggestions? Any BLAST-like function in Bioconductor? > If I understand correctly Biostrings::matchPattern works with small > sequences, right? And I need in addition to account for gaps in the > alignment! There should be no restriction on the length of the sequences you can pass to matchPattern(), and it supports a small number of mismatches and indels. However, there seems to be a bug that currently limits the length of the pattern to 256 chars when 'with.indels=TRUE'. I just fixed that in Biostrings 2.29.13: library(BSgenome.Hsapiens.UCSC.hg19) cdna <- getSeq(Hsapiens, "chr17", start=150001, width=1000) ## Introducing 3 indels (1 insertion of length 2, 1 insertion of ## length 1, 1 deletion of length 2) and 4 mismatches: at <- IRanges(start=c(45, 112, 204, 388, 677, 725, 930), width=c(0, 1, 1, 1, 0, 1, 2)) cdna2 <- replaceAt(cdna, at, value=c("GG", "A", "C", "T", "G", "C", "")) Testing: > matchPattern(cdna2, Hsapiens$chr17, max.mismatch=9, with.indels=TRUE) Views on a 81195210-letter DNAString subject subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGG T views: start end width [1] 150001 151000 1000 [TCTCACCCACTCTAGAAGGGGCTGGC...CGGCCACCTCATTTCATCGGCCACC] This doesn't use an heuristic like BLAST so it's guaranteed to return all alignments with an edit distance <= 9 from the pattern. As a consequence, it's also much slower than BLAST and cannot realistically be used with a 'max.mismatch' value >= 15 when 'with.indels' is TRUE (will take about 4 hours to do a full Human genome search with 'max.mismatch=15' and 'with.indels=TRUE'). To return all the matches on the entire genome as a GRanges object, you can do something like (one gotcha is to remember to search the 2 strands of each chromosome): all_hits <- lapply(seqnames(Hsapiens), function(seqname) { subject <- Hsapiens[[seqname]] plus_hits <- matchPattern(cdna2, subject, max.mismatch=9, with.indels=TRUE) minus_hits <- matchPattern(reverseComplement(cdna2), subject, max.mismatch=9, with.indels=TRUE) ans_ranges <- c(as(plus_hits, "IRanges"), as(minus_hits, "IRanges")) ans_strand <- Rle(c("+", "-"), c(length(plus_hits), length(minus_hits))) ans <- GRanges(Rle(seqname, length(ans_ranges)), ans_ranges, strand=ans_strand) seqlevels(ans) <- seqlevels(Hsapiens) ans }) all_hits <- do.call(c, all_hits) seqinfo(all_hits) <- seqinfo(Hsapiens) Then, if you want to get the corresponding sequences: getSeq(Hsapiens, all_hits) Biostrings 2.29.13 should become available thru biocLite() to Bioc- devel users in the next 24 hours or so. Cheers, H. > Does annotate::blastSequences could accept my Bsgenome genomic sequence as > argument? > Thank you in advance for your help. > > Ugo > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENTlink written 4.4 years ago by Hervé Pagès ♦♦ 13k
Thank you very much Herve'. Ugo > From: Hervé Pagès <hpages at="" fhcrc.org=""> > Date: Mon, 08 Jul 2013 14:41:32 -0700 > To: Ugo Borello <ugo.borello at="" inserm.fr=""> > Cc: <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] Blast a BSgenome > > Hi Ugo, > > On 07/05/2013 04:18 AM, Ugo Borello wrote: >> Good morning, >> I have a custom made Bsgenome data package and I would like to align a cDNA >> sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the >> aligned sequences. >> Any suggestions? Any BLAST-like function in Bioconductor? >> If I understand correctly Biostrings::matchPattern works with small >> sequences, right? And I need in addition to account for gaps in the >> alignment! > > There should be no restriction on the length of the sequences you can > pass to matchPattern(), and it supports a small number of mismatches > and indels. However, there seems to be a bug that currently limits the > length of the pattern to 256 chars when 'with.indels=TRUE'. I just > fixed that in Biostrings 2.29.13: > > library(BSgenome.Hsapiens.UCSC.hg19) > cdna <- getSeq(Hsapiens, "chr17", start=150001, width=1000) > > ## Introducing 3 indels (1 insertion of length 2, 1 insertion of > ## length 1, 1 deletion of length 2) and 4 mismatches: > at <- IRanges(start=c(45, 112, 204, 388, 677, 725, 930), > width=c(0, 1, 1, 1, 0, 1, 2)) > cdna2 <- replaceAt(cdna, at, value=c("GG", "A", "C", "T", "G", "C", "")) > > Testing: > >> matchPattern(cdna2, Hsapiens$chr17, max.mismatch=9, with.indels=TRUE) > Views on a 81195210-letter DNAString subject > subject: > AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGGTGTGGGTGTGGTGTGTGGGTGTGGGTGT GGT > views: > start end width > [1] 150001 151000 1000 > [TCTCACCCACTCTAGAAGGGGCTGGC...CGGCCACCTCATTTCATCGGCCACC] > > This doesn't use an heuristic like BLAST so it's guaranteed to return > all alignments with an edit distance <= 9 from the pattern. As a > consequence, it's also much slower than BLAST and cannot realistically > be used with a 'max.mismatch' value >= 15 when 'with.indels' is TRUE > (will take about 4 hours to do a full Human genome search with > 'max.mismatch=15' and 'with.indels=TRUE'). > > To return all the matches on the entire genome as a GRanges object, you > can do something like (one gotcha is to remember to search the 2 strands > of each chromosome): > > all_hits <- lapply(seqnames(Hsapiens), > function(seqname) > { > subject <- Hsapiens[[seqname]] > plus_hits <- matchPattern(cdna2, subject, > max.mismatch=9, with.indels=TRUE) > minus_hits <- matchPattern(reverseComplement(cdna2), subject, > max.mismatch=9, with.indels=TRUE) > ans_ranges <- c(as(plus_hits, "IRanges"), as(minus_hits, > "IRanges")) > ans_strand <- Rle(c("+", "-"), c(length(plus_hits), > length(minus_hits))) > ans <- GRanges(Rle(seqname, length(ans_ranges)), > ans_ranges, > strand=ans_strand) > seqlevels(ans) <- seqlevels(Hsapiens) > ans > }) > > all_hits <- do.call(c, all_hits) > seqinfo(all_hits) <- seqinfo(Hsapiens) > > Then, if you want to get the corresponding sequences: > > getSeq(Hsapiens, all_hits) > > Biostrings 2.29.13 should become available thru biocLite() to Bioc- devel > users in the next 24 hours or so. > > Cheers, > H. > >> Does annotate::blastSequences could accept my Bsgenome genomic sequence as >> argument? >> Thank you in advance for your help. >> >> Ugo >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLYlink written 4.4 years ago by Ugo Borello340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 144 users visited in the last hour