alignShortReads

0

Entering edit mode

Sonja Althammer ▴ 10

@sonja-althammer-6105

Last seen 9.6 years ago

Hi! I am having trouble aligning a DNAStringSet to a BSgenome. The example in the help works for me.... Could anyone explain to me the error I am getting, please and how I can make it work? Actually I would like to align these reads to a specific gene... Is there maybe a better way? Thanks! Sonja > alignShortReads(reads, Hsapiens, seqNames="chr1") Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, : non base DNA letter found in Trusted Band for pattern 248 > class(reads) [1] "DNAStringSet" attr(,"package") [1] "Biostrings" > class(Hsapiens) [1] "BSgenome" attr(,"package") [1] "BSgenome" [[alternative HTML version deleted]]

BSgenome BSgenome BSgenome BSgenome • 725 views

ADD COMMENT • link updated 10.7 years ago by Martin Morgan 25k • written 10.7 years ago by Sonja Althammer ▴ 10

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 2 days ago

United States

On 08/21/2013 02:54 PM, Sonja Althammer wrote: > Hi! > I am having trouble aligning a DNAStringSet to a BSgenome. The example in > the help works for me.... > Could anyone explain to me the error I am getting, please and how I can > make it work? Actually I would like to align these reads to a specific > gene... Is there maybe a better way? > Thanks! > Sonja > > >> alignShortReads(reads, Hsapiens, seqNames="chr1") > Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, alignShortReads is from R453Plus1Toolbox. DNAStringSet can contain ambiguity letters such as 'M', to indicate that the nucleotide could be either A or C > IUPAC_CODE_MAP A C G T M R W S Y K V "A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG" H D B N "ACT" "AGT" "CGT" "ACGT" but the aligner in alignShortReads expects only > DNA_BASES [1] "A" "C" "G" "T" If your sequences do contain just bases, it would make sense to run Biostrings::matchPDict directly on the sequence of the gene of interest. You could obtained the sequence of your gene(s) of interest by, e.g., using the function ?select and the Homo.sapiens package to get coordinates of the gene, ?getSeq to extract the sequence from the Hsapiens BSgenome object you already have, and matchPDict to do the alignment. Probably in a way that requires more care, I did the following to arrive at the sequence of the transcripts of BRCA1 library(BSgenome.Hsapiens.UCSC.hg19) library(Homo.sapiens) Discover what I can do: ## 'keytypes' I can query with... keytypes(Homo.sapiens) ## discover the correct way to specify a gene symbol I'm interested in grep("BRCA", keys(Homo.sapiens, "SYMBOL")) ## discover what info I can extract cols(Homo.sapiens) Now create ranges for transcripts ## select genomic coordinates for BRCA1 xx = select(Homo.sapiens, "BRCA1", c("TXID", "TXCHROM", "TXSTRAND", "TXSTART", "TXEND"), "SYMBOL") ## turn coordinates into a GRanges yy = with(xx, GRanges(TXCHROM, IRanges(TXSTART, TXEND), TXSTRAND)) or alternatively map to the central 'Entrez' identifier and... ## figure out Entrez ID corresponding to symbol eid = select(Homo.sapiens, "BRCA1", "ENTREZID", "SYMBOL")$ENTREZID ## extract transcript coordinates from TxDb yy = transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")[eid] What's a gene? Maybe it's just the range of transcripts: gn = range(yy) Finally retrieve the relevant sequence(s) seq = getSeq(Hsapiens, yy) ## or getSeq(Hsapiens, gn) Judging by the package you're using, you might have longer 454 reads, where the ungapped high fidelity matches of matchPDict are not appropriate. If you're interested in more general alignments, or if the sequences you're trying to align do have ambiguity letters, there is Biostrings::pairwiseAlignment. It might still be that this general purpose aligner is not appropriate for the task you're trying to accomplish, so some detail (and perhaps input from others on the list!) might be needed. Hope that provides some guidance. Martin > : > non base DNA letter found in Trusted Band for pattern 248 >> class(reads) > [1] "DNAStringSet" > attr(,"package") > [1] "Biostrings" > >> class(Hsapiens) > [1] "BSgenome" > attr(,"package") > [1] "BSgenome" > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 10.7 years ago Martin Morgan 25k

Login before adding your answer.