Search
Question: Mapping Probes to a Genome
0
2.3 years ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:

Is there a simple way to map a list of probe sequences to a genome ? The use case is probes of a custom-designed NanoString assay. The RLF file provided by NanoString has a gene symbol and probe sequence, but not the genomic coordinates of the probe. I don't want the hassle of running bowtie for a small set of sequences (about 200) which map to the genome with no mismatches and I'd like to include the mapping procedure in an R Markdown document without using a complex aligner through system() calls.

For another similar question asked five years ago, it was recommended to map to the transcriptome with vwhichPDict but I want the genomic coordinates, so I'd like to map with BSgenome.Hsapiens.UCSC.hg19 (both strands) and obtain a GRangesList result.

modified 19 months ago by holgerbrandl10 • written 2.3 years ago by Dario Strbenac1.4k
1
19 months ago by
holgerbrandl10
Germany
holgerbrandl10 wrote:

If you don't have to may probes of differeing length (which does not allow for direct usage of vmatchPDict you could aksi simply iterate with purr/dplyr

require(purrr)
require(dplyr)
require(Biostrings)
require(BSgenome.Dmelanogaster.UCSC.dm3)

mappedPrimers = primerInfo %>%
mutate(mapped=map(primer_seq, ~ as.data.frame(vmatchPDict(DNAStringSet(.x), Dmelanogaster)))) %>%
unnest(mapped)

0
2.3 years ago by
Switzerland
Hotz, Hans-Rudolf400 wrote:

Hi Dario

Have you looked into the QuasR package (or just the Rbowtie package)? This allows you to use a 'complex aligner' without using a system call.

Regards, Hans-Rudolf

This solution involves formatting the reads into a FASTA file, generating a genome index, and writing to disk BAM files and importing them into R. The other proposed answer solves the problem more directly and without reading and writing of results to and from the disk.

0
2.3 years ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

vmatchPDict() returns a GRanges with mcols 'index' for the corresponding probe. I guess you'd match the probes and their reverse complement.

This solution does not work because the probes do not all have the same lengths and vmatchPDict results in an error when it tries to coerce the pdict variable into a PDict object. I need to create a PDict object with tb.start and tb.end settings because of this, but if I do, vmatchPDict gives an error, because it only accepts a DNAStringSet for the pdict parameter, not PDict objects.

e.g. (artificial)

> vmatchPDict(DNAStringSet(c("GATC", "TAG")), Hsapiens)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  :
element 2 in Trusted Band has a different length than first element
> vmatchPDict(PDict(DNAStringSet(c("GATC", "TAG")), tb.start=1, tb.end =3), Hsapiens)
Error in .local(pdict, subject, max.mismatch, min.mismatch, fixed, algorithm,  :
'pdict' must be a DNAStringSet object

It's bizarre that the parameter is named pdict, but it results in an error if the user provides a PDict object.

Could an example also be added to the Examples section of the documentation of how to restrict the search to only chromosomes of interest ? There doesn't seem to be any way to only search chr1, ..., chr22, chrX, chrY, chrM for my human analysis scenario.

Actually, vmatchPDict is just a code stub in BSgenome version 1.40.1. Why is it even present in the release version of BSgenome ?

> vmatchPDict(DNAStringSet("GATC", Hsapiens))
Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels,  :
vmatchPDict() is not ready yet, sorry

Code which doesn't work should only be in the development version (and be finished before the next Bioconductor release version).

group probes by length and analyze separately?

vmatchPDict is a generic, generic arguments can be given generic names like 'x' or 'object' which lack semantic content. Often though argument names are influenced by a combination of most-common use, historical constraint (to avoid breaking user code), or developer aesthetic (e.g., symmetry across related functions).

It's not true that vmatchPDict is a code stub, it is a generic with 4 methods, including the method (for signature BSgenome) that you invoke successfully on a BSgenome subject.

It might also be fun to try the non-Bioconductor package AhoCorasickTrie::AhoCorasickSearch().