Blast a BSgenome
1
0
Entering edit mode
Ugo Borello ▴ 340
@ugo-borello-5753
Last seen 6.4 years ago
France
Good morning, I have a custom made Bsgenome data package and I would like to align a cDNA sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the aligned sequences. Any suggestions? Any BLAST-like function in Bioconductor? If I understand correctly Biostrings::matchPattern works with small sequences, right? And I need in addition to account for gaps in the alignment! Does annotate::blastSequences could accept my Bsgenome genomic sequence as argument? Thank you in advance for your help. Ugo
BSgenome BSgenome BSgenome BSgenome • 1.6k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 19 hours ago
Seattle, WA, United States
Hi Ugo, On 07/05/2013 04:18 AM, Ugo Borello wrote: > Good morning, > I have a custom made Bsgenome data package and I would like to align a cDNA > sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the > aligned sequences. > Any suggestions? Any BLAST-like function in Bioconductor? > If I understand correctly Biostrings::matchPattern works with small > sequences, right? And I need in addition to account for gaps in the > alignment! There should be no restriction on the length of the sequences you can pass to matchPattern(), and it supports a small number of mismatches and indels. However, there seems to be a bug that currently limits the length of the pattern to 256 chars when 'with.indels=TRUE'. I just fixed that in Biostrings 2.29.13: library(BSgenome.Hsapiens.UCSC.hg19) cdna <- getSeq(Hsapiens, "chr17", start=150001, width=1000) ## Introducing 3 indels (1 insertion of length 2, 1 insertion of ## length 1, 1 deletion of length 2) and 4 mismatches: at <- IRanges(start=c(45, 112, 204, 388, 677, 725, 930), width=c(0, 1, 1, 1, 0, 1, 2)) cdna2 <- replaceAt(cdna, at, value=c("GG", "A", "C", "T", "G", "C", "")) Testing: > matchPattern(cdna2, Hsapiens$chr17, max.mismatch=9, with.indels=TRUE) Views on a 81195210-letter DNAString subject subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGG T views: start end width [1] 150001 151000 1000 [TCTCACCCACTCTAGAAGGGGCTGGC...CGGCCACCTCATTTCATCGGCCACC] This doesn't use an heuristic like BLAST so it's guaranteed to return all alignments with an edit distance <= 9 from the pattern. As a consequence, it's also much slower than BLAST and cannot realistically be used with a 'max.mismatch' value >= 15 when 'with.indels' is TRUE (will take about 4 hours to do a full Human genome search with 'max.mismatch=15' and 'with.indels=TRUE'). To return all the matches on the entire genome as a GRanges object, you can do something like (one gotcha is to remember to search the 2 strands of each chromosome): all_hits <- lapply(seqnames(Hsapiens), function(seqname) { subject <- Hsapiens[[seqname]] plus_hits <- matchPattern(cdna2, subject, max.mismatch=9, with.indels=TRUE) minus_hits <- matchPattern(reverseComplement(cdna2), subject, max.mismatch=9, with.indels=TRUE) ans_ranges <- c(as(plus_hits, "IRanges"), as(minus_hits, "IRanges")) ans_strand <- Rle(c("+", "-"), c(length(plus_hits), length(minus_hits))) ans <- GRanges(Rle(seqname, length(ans_ranges)), ans_ranges, strand=ans_strand) seqlevels(ans) <- seqlevels(Hsapiens) ans }) all_hits <- do.call(c, all_hits) seqinfo(all_hits) <- seqinfo(Hsapiens) Then, if you want to get the corresponding sequences: getSeq(Hsapiens, all_hits) Biostrings 2.29.13 should become available thru biocLite() to Bioc- devel users in the next 24 hours or so. Cheers, H. > Does annotate::blastSequences could accept my Bsgenome genomic sequence as > argument? > Thank you in advance for your help. > > Ugo > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Thank you very much Herve'. Ugo > From: Hervé Pagès <hpages at="" fhcrc.org=""> > Date: Mon, 08 Jul 2013 14:41:32 -0700 > To: Ugo Borello <ugo.borello at="" inserm.fr=""> > Cc: <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] Blast a BSgenome > > Hi Ugo, > > On 07/05/2013 04:18 AM, Ugo Borello wrote: >> Good morning, >> I have a custom made Bsgenome data package and I would like to align a cDNA >> sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the >> aligned sequences. >> Any suggestions? Any BLAST-like function in Bioconductor? >> If I understand correctly Biostrings::matchPattern works with small >> sequences, right? And I need in addition to account for gaps in the >> alignment! > > There should be no restriction on the length of the sequences you can > pass to matchPattern(), and it supports a small number of mismatches > and indels. However, there seems to be a bug that currently limits the > length of the pattern to 256 chars when 'with.indels=TRUE'. I just > fixed that in Biostrings 2.29.13: > > library(BSgenome.Hsapiens.UCSC.hg19) > cdna <- getSeq(Hsapiens, "chr17", start=150001, width=1000) > > ## Introducing 3 indels (1 insertion of length 2, 1 insertion of > ## length 1, 1 deletion of length 2) and 4 mismatches: > at <- IRanges(start=c(45, 112, 204, 388, 677, 725, 930), > width=c(0, 1, 1, 1, 0, 1, 2)) > cdna2 <- replaceAt(cdna, at, value=c("GG", "A", "C", "T", "G", "C", "")) > > Testing: > >> matchPattern(cdna2, Hsapiens$chr17, max.mismatch=9, with.indels=TRUE) > Views on a 81195210-letter DNAString subject > subject: > AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGGTGTGGGTGTGGTGTGTGGGTGTGGGTGT GGT > views: > start end width > [1] 150001 151000 1000 > [TCTCACCCACTCTAGAAGGGGCTGGC...CGGCCACCTCATTTCATCGGCCACC] > > This doesn't use an heuristic like BLAST so it's guaranteed to return > all alignments with an edit distance <= 9 from the pattern. As a > consequence, it's also much slower than BLAST and cannot realistically > be used with a 'max.mismatch' value >= 15 when 'with.indels' is TRUE > (will take about 4 hours to do a full Human genome search with > 'max.mismatch=15' and 'with.indels=TRUE'). > > To return all the matches on the entire genome as a GRanges object, you > can do something like (one gotcha is to remember to search the 2 strands > of each chromosome): > > all_hits <- lapply(seqnames(Hsapiens), > function(seqname) > { > subject <- Hsapiens[[seqname]] > plus_hits <- matchPattern(cdna2, subject, > max.mismatch=9, with.indels=TRUE) > minus_hits <- matchPattern(reverseComplement(cdna2), subject, > max.mismatch=9, with.indels=TRUE) > ans_ranges <- c(as(plus_hits, "IRanges"), as(minus_hits, > "IRanges")) > ans_strand <- Rle(c("+", "-"), c(length(plus_hits), > length(minus_hits))) > ans <- GRanges(Rle(seqname, length(ans_ranges)), > ans_ranges, > strand=ans_strand) > seqlevels(ans) <- seqlevels(Hsapiens) > ans > }) > > all_hits <- do.call(c, all_hits) > seqinfo(all_hits) <- seqinfo(Hsapiens) > > Then, if you want to get the corresponding sequences: > > getSeq(Hsapiens, all_hits) > > Biostrings 2.29.13 should become available thru biocLite() to Bioc- devel > users in the next 24 hours or so. > > Cheers, > H. > >> Does annotate::blastSequences could accept my Bsgenome genomic sequence as >> argument? >> Thank you in advance for your help. >> >> Ugo >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 1050 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6