Search
Question: Blast a BSgenome
0
5.0 years ago by
Ugo Borello340
France
Ugo Borello340 wrote:
Good morning, I have a custom made Bsgenome data package and I would like to align a cDNA sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the aligned sequences. Any suggestions? Any BLAST-like function in Bioconductor? If I understand correctly Biostrings::matchPattern works with small sequences, right? And I need in addition to account for gaps in the alignment! Does annotate::blastSequences could accept my Bsgenome genomic sequence as argument? Thank you in advance for your help. Ugo
modified 5.0 years ago by Hervé Pagès ♦♦ 13k • written 5.0 years ago by Ugo Borello340
0
5.0 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
Hi Ugo, On 07/05/2013 04:18 AM, Ugo Borello wrote: > Good morning, > I have a custom made Bsgenome data package and I would like to align a cDNA > sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the > aligned sequences. > Any suggestions? Any BLAST-like function in Bioconductor? > If I understand correctly Biostrings::matchPattern works with small > sequences, right? And I need in addition to account for gaps in the > alignment! There should be no restriction on the length of the sequences you can pass to matchPattern(), and it supports a small number of mismatches and indels. However, there seems to be a bug that currently limits the length of the pattern to 256 chars when 'with.indels=TRUE'. I just fixed that in Biostrings 2.29.13: library(BSgenome.Hsapiens.UCSC.hg19) cdna <- getSeq(Hsapiens, "chr17", start=150001, width=1000) ## Introducing 3 indels (1 insertion of length 2, 1 insertion of ## length 1, 1 deletion of length 2) and 4 mismatches: at <- IRanges(start=c(45, 112, 204, 388, 677, 725, 930), width=c(0, 1, 1, 1, 0, 1, 2)) cdna2 <- replaceAt(cdna, at, value=c("GG", "A", "C", "T", "G", "C", "")) Testing: > matchPattern(cdna2, Hsapienschr17, max.mismatch=9, with.indels=TRUE) Views on a 81195210-letter DNAString subject subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGG T views: start end width [1] 150001 151000 1000 [TCTCACCCACTCTAGAAGGGGCTGGC...CGGCCACCTCATTTCATCGGCCACC] This doesn't use an heuristic like BLAST so it's guaranteed to return all alignments with an edit distance <= 9 from the pattern. As a consequence, it's also much slower than BLAST and cannot realistically be used with a 'max.mismatch' value >= 15 when 'with.indels' is TRUE (will take about 4 hours to do a full Human genome search with 'max.mismatch=15' and 'with.indels=TRUE'). To return all the matches on the entire genome as a GRanges object, you can do something like (one gotcha is to remember to search the 2 strands of each chromosome): all_hits <- lapply(seqnames(Hsapiens), function(seqname) { subject <- Hsapiens[[seqname]] plus_hits <- matchPattern(cdna2, subject, max.mismatch=9, with.indels=TRUE) minus_hits <- matchPattern(reverseComplement(cdna2), subject, max.mismatch=9, with.indels=TRUE) ans_ranges <- c(as(plus_hits, "IRanges"), as(minus_hits, "IRanges")) ans_strand <- Rle(c("+", "-"), c(length(plus_hits), length(minus_hits))) ans <- GRanges(Rle(seqname, length(ans_ranges)), ans_ranges, strand=ans_strand) seqlevels(ans) <- seqlevels(Hsapiens) ans }) all_hits <- do.call(c, all_hits) seqinfo(all_hits) <- seqinfo(Hsapiens) Then, if you want to get the corresponding sequences: getSeq(Hsapiens, all_hits) Biostrings 2.29.13 should become available thru biocLite() to Bioc- devel users in the next 24 hours or so. Cheers, H. > Does annotate::blastSequences could accept my Bsgenome genomic sequence as > argument? > Thank you in advance for your help. > > Ugo > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 ADD COMMENTlink written 5.0 years ago by Hervé Pagès ♦♦ 13k Thank you very much Herve'. Ugo > From: Hervé Pagès <hpages at="" fhcrc.org=""> > Date: Mon, 08 Jul 2013 14:41:32 -0700 > To: Ugo Borello <ugo.borello at="" inserm.fr=""> > Cc: <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] Blast a BSgenome > > Hi Ugo, > > On 07/05/2013 04:18 AM, Ugo Borello wrote: >> Good morning, >> I have a custom made Bsgenome data package and I would like to align a cDNA >> sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the >> aligned sequences. >> Any suggestions? Any BLAST-like function in Bioconductor? >> If I understand correctly Biostrings::matchPattern works with small >> sequences, right? And I need in addition to account for gaps in the >> alignment! > > There should be no restriction on the length of the sequences you can > pass to matchPattern(), and it supports a small number of mismatches > and indels. However, there seems to be a bug that currently limits the > length of the pattern to 256 chars when 'with.indels=TRUE'. I just > fixed that in Biostrings 2.29.13: > > library(BSgenome.Hsapiens.UCSC.hg19) > cdna <- getSeq(Hsapiens, "chr17", start=150001, width=1000) > > ## Introducing 3 indels (1 insertion of length 2, 1 insertion of > ## length 1, 1 deletion of length 2) and 4 mismatches: > at <- IRanges(start=c(45, 112, 204, 388, 677, 725, 930), > width=c(0, 1, 1, 1, 0, 1, 2)) > cdna2 <- replaceAt(cdna, at, value=c("GG", "A", "C", "T", "G", "C", "")) > > Testing: > >> matchPattern(cdna2, Hsapienschr17, max.mismatch=9, with.indels=TRUE) > Views on a 81195210-letter DNAString subject > subject: > AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGGTGTGGGTGTGGTGTGTGGGTGTGGGTGT GGT > views: > start end width > [1] 150001 151000 1000 > [TCTCACCCACTCTAGAAGGGGCTGGC...CGGCCACCTCATTTCATCGGCCACC] > > This doesn't use an heuristic like BLAST so it's guaranteed to return > all alignments with an edit distance <= 9 from the pattern. As a > consequence, it's also much slower than BLAST and cannot realistically > be used with a 'max.mismatch' value >= 15 when 'with.indels' is TRUE > (will take about 4 hours to do a full Human genome search with > 'max.mismatch=15' and 'with.indels=TRUE'). > > To return all the matches on the entire genome as a GRanges object, you > can do something like (one gotcha is to remember to search the 2 strands > of each chromosome): > > all_hits <- lapply(seqnames(Hsapiens), > function(seqname) > { > subject <- Hsapiens[[seqname]] > plus_hits <- matchPattern(cdna2, subject, > max.mismatch=9, with.indels=TRUE) > minus_hits <- matchPattern(reverseComplement(cdna2), subject, > max.mismatch=9, with.indels=TRUE) > ans_ranges <- c(as(plus_hits, "IRanges"), as(minus_hits, > "IRanges")) > ans_strand <- Rle(c("+", "-"), c(length(plus_hits), > length(minus_hits))) > ans <- GRanges(Rle(seqname, length(ans_ranges)), > ans_ranges, > strand=ans_strand) > seqlevels(ans) <- seqlevels(Hsapiens) > ans > }) > > all_hits <- do.call(c, all_hits) > seqinfo(all_hits) <- seqinfo(Hsapiens) > > Then, if you want to get the corresponding sequences: > > getSeq(Hsapiens, all_hits) > > Biostrings 2.29.13 should become available thru biocLite() to Bioc- devel > users in the next 24 hours or so. > > Cheers, > H. > >> Does annotate::blastSequences could accept my Bsgenome genomic sequence as >> argument? >> Thank you in advance for your help. >> >> Ugo >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319