Question: Mapping Probes to a Genome
gravatar for Dario Strbenac
21 months ago by
Dario Strbenac1.4k
Dario Strbenac1.4k wrote:

Is there a simple way to map a list of probe sequences to a genome ? The use case is probes of a custom-designed NanoString assay. The RLF file provided by NanoString has a gene symbol and probe sequence, but not the genomic coordinates of the probe. I don't want the hassle of running bowtie for a small set of sequences (about 200) which map to the genome with no mismatches and I'd like to include the mapping procedure in an R Markdown document without using a complex aligner through system() calls.

For another similar question asked five years ago, it was recommended to map to the transcriptome with vwhichPDict but I want the genomic coordinates, so I'd like to map with BSgenome.Hsapiens.UCSC.hg19 (both strands) and obtain a GRangesList result.

ADD COMMENTlink modified 13 months ago by holgerbrandl10 • written 21 months ago by Dario Strbenac1.4k
gravatar for holgerbrandl
13 months ago by
holgerbrandl10 wrote:

If you don't have to may probes of differeing length (which does not allow for direct usage of vmatchPDict you could aksi simply iterate with purr/dplyr


mappedPrimers = primerInfo %>% 
   mutate(mapped=map(primer_seq, ~, Dmelanogaster)))) %>%


ADD COMMENTlink written 13 months ago by holgerbrandl10
gravatar for Hotz, Hans-Rudolf
21 months ago by
Hotz, Hans-Rudolf390 wrote:

Hi Dario

Have you looked into the QuasR package (or just the Rbowtie package)? This allows you to use a 'complex aligner' without using a system call.


Regards, Hans-Rudolf


ADD COMMENTlink written 21 months ago by Hotz, Hans-Rudolf390

This solution involves formatting the reads into a FASTA file, generating a genome index, and writing to disk BAM files and importing them into R. The other proposed answer solves the problem more directly and without reading and writing of results to and from the disk.

ADD REPLYlink written 21 months ago by Dario Strbenac1.4k
gravatar for Martin Morgan
21 months ago by
Martin Morgan ♦♦ 21k
United States
Martin Morgan ♦♦ 21k wrote:

vmatchPDict() returns a GRanges with mcols 'index' for the corresponding probe. I guess you'd match the probes and their reverse complement.

ADD COMMENTlink written 21 months ago by Martin Morgan ♦♦ 21k

This solution does not work because the probes do not all have the same lengths and vmatchPDict results in an error when it tries to coerce the pdict variable into a PDict object. I need to create a PDict object with tb.start and tb.end settings because of this, but if I do, vmatchPDict gives an error, because it only accepts a DNAStringSet for the pdict parameter, not PDict objects.

e.g. (artificial)

> vmatchPDict(DNAStringSet(c("GATC", "TAG")), Hsapiens)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  :
  element 2 in Trusted Band has a different length than first element
> vmatchPDict(PDict(DNAStringSet(c("GATC", "TAG")), tb.start=1, tb.end =3), Hsapiens)
Error in .local(pdict, subject, max.mismatch, min.mismatch, fixed, algorithm,  :
  'pdict' must be a DNAStringSet object

It's bizarre that the parameter is named pdict, but it results in an error if the user provides a PDict object.

Could an example also be added to the Examples section of the documentation of how to restrict the search to only chromosomes of interest ? There doesn't seem to be any way to only search chr1, ..., chr22, chrX, chrY, chrM for my human analysis scenario.

Actually, vmatchPDict is just a code stub in BSgenome version 1.40.1. Why is it even present in the release version of BSgenome ?

> vmatchPDict(DNAStringSet("GATC", Hsapiens))
Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels,  :
  vmatchPDict() is not ready yet, sorry

Code which doesn't work should only be in the development version (and be finished before the next Bioconductor release version).

ADD REPLYlink written 21 months ago by Dario Strbenac1.4k

group probes by length and analyze separately? 

vmatchPDict is a generic, generic arguments can be given generic names like 'x' or 'object' which lack semantic content. Often though argument names are influenced by a combination of most-common use, historical constraint (to avoid breaking user code), or developer aesthetic (e.g., symmetry across related functions).

It's not true that vmatchPDict is a code stub, it is a generic with 4 methods, including the method (for signature BSgenome) that you invoke successfully on a BSgenome subject.

ADD REPLYlink written 21 months ago by Martin Morgan ♦♦ 21k

It might also be fun to try the non-Bioconductor package AhoCorasickTrie::AhoCorasickSearch().

ADD REPLYlink written 21 months ago by Martin Morgan ♦♦ 21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 302 users visited in the last hour