Search
Question: Mapping Probes to a Genome
0
gravatar for Dario Strbenac
14 months ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:

Is there a simple way to map a list of probe sequences to a genome ? The use case is probes of a custom-designed NanoString assay. The RLF file provided by NanoString has a gene symbol and probe sequence, but not the genomic coordinates of the probe. I don't want the hassle of running bowtie for a small set of sequences (about 200) which map to the genome with no mismatches and I'd like to include the mapping procedure in an R Markdown document without using a complex aligner through system() calls.

For another similar question asked five years ago, it was recommended to map to the transcriptome with vwhichPDict but I want the genomic coordinates, so I'd like to map with BSgenome.Hsapiens.UCSC.hg19 (both strands) and obtain a GRangesList result.

ADD COMMENTlink modified 6 months ago by holgerbrandl10 • written 14 months ago by Dario Strbenac1.4k
1
gravatar for holgerbrandl
6 months ago by
holgerbrandl10
Germany
holgerbrandl10 wrote:

If you don't have to may probes of differeing length (which does not allow for direct usage of vmatchPDict you could aksi simply iterate with purr/dplyr

require(purrr)
require(dplyr)
require(Biostrings)
require(BSgenome.Dmelanogaster.UCSC.dm3)

mappedPrimers = primerInfo %>% 
   mutate(mapped=map(primer_seq, ~ as.data.frame(vmatchPDict(DNAStringSet(.x), Dmelanogaster)))) %>%
   unnest(mapped)

 

ADD COMMENTlink written 6 months ago by holgerbrandl10
0
gravatar for Hotz, Hans-Rudolf
14 months ago by
Switzerland
Hotz, Hans-Rudolf380 wrote:

Hi Dario

Have you looked into the QuasR package (or just the Rbowtie package)? This allows you to use a 'complex aligner' without using a system call.

 

Regards, Hans-Rudolf

 

ADD COMMENTlink written 14 months ago by Hotz, Hans-Rudolf380

This solution involves formatting the reads into a FASTA file, generating a genome index, and writing to disk BAM files and importing them into R. The other proposed answer solves the problem more directly and without reading and writing of results to and from the disk.

ADD REPLYlink written 14 months ago by Dario Strbenac1.4k
0
gravatar for Martin Morgan
14 months ago by
Martin Morgan ♦♦ 20k
United States
Martin Morgan ♦♦ 20k wrote:

vmatchPDict() returns a GRanges with mcols 'index' for the corresponding probe. I guess you'd match the probes and their reverse complement.

ADD COMMENTlink written 14 months ago by Martin Morgan ♦♦ 20k

This solution does not work because the probes do not all have the same lengths and vmatchPDict results in an error when it tries to coerce the pdict variable into a PDict object. I need to create a PDict object with tb.start and tb.end settings because of this, but if I do, vmatchPDict gives an error, because it only accepts a DNAStringSet for the pdict parameter, not PDict objects.

e.g. (artificial)

> vmatchPDict(DNAStringSet(c("GATC", "TAG")), Hsapiens)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  :
  element 2 in Trusted Band has a different length than first element
> vmatchPDict(PDict(DNAStringSet(c("GATC", "TAG")), tb.start=1, tb.end =3), Hsapiens)
Error in .local(pdict, subject, max.mismatch, min.mismatch, fixed, algorithm,  :
  'pdict' must be a DNAStringSet object

It's bizarre that the parameter is named pdict, but it results in an error if the user provides a PDict object.

Could an example also be added to the Examples section of the documentation of how to restrict the search to only chromosomes of interest ? There doesn't seem to be any way to only search chr1, ..., chr22, chrX, chrY, chrM for my human analysis scenario.

Actually, vmatchPDict is just a code stub in BSgenome version 1.40.1. Why is it even present in the release version of BSgenome ?

> vmatchPDict(DNAStringSet("GATC", Hsapiens))
Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels,  :
  vmatchPDict() is not ready yet, sorry

Code which doesn't work should only be in the development version (and be finished before the next Bioconductor release version).

ADD REPLYlink written 14 months ago by Dario Strbenac1.4k

group probes by length and analyze separately? 

vmatchPDict is a generic, generic arguments can be given generic names like 'x' or 'object' which lack semantic content. Often though argument names are influenced by a combination of most-common use, historical constraint (to avoid breaking user code), or developer aesthetic (e.g., symmetry across related functions).

It's not true that vmatchPDict is a code stub, it is a generic with 4 methods, including the method (for signature BSgenome) that you invoke successfully on a BSgenome subject.

ADD REPLYlink written 14 months ago by Martin Morgan ♦♦ 20k

It might also be fun to try the non-Bioconductor package AhoCorasickTrie::AhoCorasickSearch().

ADD REPLYlink written 14 months ago by Martin Morgan ♦♦ 20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 138 users visited in the last hour