Question

Mapping Probes to a Genome

0

Entering edit mode

Dario Strbenac ★ 1.5k

@dario-strbenac-5916

Last seen 19 hours ago

Australia

Is there a simple way to map a list of probe sequences to a genome ? The use case is probes of a custom-designed NanoString assay. The RLF file provided by NanoString has a gene symbol and probe sequence, but not the genomic coordinates of the probe. I don't want the hassle of running bowtie for a small set of sequences (about 200) which map to the genome with no mismatches and I'd like to include the mapping procedure in an R Markdown document without using a complex aligner through system() calls.

For another similar question asked five years ago, it was recommended to map to the transcriptome with vwhichPDict but I want the genomic coordinates, so I'd like to map with BSgenome.Hsapiens.UCSC.hg19 (both strands) and obtain a GRangesList result.

BSgenome biostrings probe mapping • 2.5k views

ADD COMMENT • link updated 7.0 years ago by holgerbrandl ▴ 10 • written 7.6 years ago by Dario Strbenac ★ 1.5k

score 1 · Answer 1 · 2017-05-04

If you don't have to may probes of differeing length (which does not allow for direct usage of vmatchPDict you could aksi simply iterate with purr/dplyr

require(purrr)
require(dplyr)
require(Biostrings)
require(BSgenome.Dmelanogaster.UCSC.dm3)

mappedPrimers = primerInfo %>% 
   mutate(mapped=map(primer_seq, ~ as.data.frame(vmatchPDict(DNAStringSet(.x), Dmelanogaster)))) %>%
   unnest(mapped)

score 0 · Answer 2 · 2016-09-01

0

Entering edit mode

Hotz, Hans-Rudolf ▴ 400

@hotz-hans-rudolf-3951

Last seen 3.5 years ago

Switzerland

Hi Dario

Have you looked into the QuasR package (or just the Rbowtie package)? This allows you to use a 'complex aligner' without using a system call.

Regards, Hans-Rudolf

ADD COMMENT • link 7.6 years ago Hotz, Hans-Rudolf ▴ 400

0

Entering edit mode

This solution involves formatting the reads into a FASTA file, generating a genome index, and writing to disk BAM files and importing them into R. The other proposed answer solves the problem more directly and without reading and writing of results to and from the disk.

ADD REPLY • link 7.6 years ago Dario Strbenac ★ 1.5k

score 0 · Answer 3 · 2016-09-01

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 3 days ago

United States

vmatchPDict() returns a GRanges with mcols 'index' for the corresponding probe. I guess you'd match the probes and their reverse complement.

ADD COMMENT • link 7.6 years ago Martin Morgan 25k

0

Entering edit mode

This solution does not work because the probes do not all have the same lengths and vmatchPDict results in an error when it tries to coerce the pdict variable into a PDict object. I need to create a PDict object with tb.start and tb.end settings because of this, but if I do, vmatchPDict gives an error, because it only accepts a DNAStringSet for the pdict parameter, not PDict objects.

e.g. (artificial)

> vmatchPDict(DNAStringSet(c("GATC", "TAG")), Hsapiens)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  :
  element 2 in Trusted Band has a different length than first element
> vmatchPDict(PDict(DNAStringSet(c("GATC", "TAG")), tb.start=1, tb.end =3), Hsapiens)
Error in .local(pdict, subject, max.mismatch, min.mismatch, fixed, algorithm,  :
  'pdict' must be a DNAStringSet object

It's bizarre that the parameter is named pdict, but it results in an error if the user provides a PDict object.

Could an example also be added to the Examples section of the documentation of how to restrict the search to only chromosomes of interest ? There doesn't seem to be any way to only search chr1, ..., chr22, chrX, chrY, chrM for my human analysis scenario.

Actually, vmatchPDict is just a code stub in BSgenome version 1.40.1. Why is it even present in the release version of BSgenome ?

> vmatchPDict(DNAStringSet("GATC", Hsapiens))
Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels,  :
  vmatchPDict() is not ready yet, sorry

Code which doesn't work should only be in the development version (and be finished before the next Bioconductor release version).

ADD REPLY • link 7.6 years ago Dario Strbenac ★ 1.5k

0

Entering edit mode

group probes by length and analyze separately?

vmatchPDict is a generic, generic arguments can be given generic names like 'x' or 'object' which lack semantic content. Often though argument names are influenced by a combination of most-common use, historical constraint (to avoid breaking user code), or developer aesthetic (e.g., symmetry across related functions).

It's not true that vmatchPDict is a code stub, it is a generic with 4 methods, including the method (for signature BSgenome) that you invoke successfully on a BSgenome subject.

ADD REPLY • link 7.6 years ago Martin Morgan 25k

0

Entering edit mode

It might also be fun to try the non-Bioconductor package AhoCorasickTrie::AhoCorasickSearch().

ADD REPLY • link 7.6 years ago Martin Morgan 25k