Question

Map siRNA-Sequence to Gene

0

Entering edit mode

antje.janosch • 0

@antjejanosch-8611

Last seen 9.4 years ago

Germany

Hi there,

I am very new to Bioconductor and also to the field of Bioinformatics. However, I have a bunch of siRNA-sequences and I need to find the target genes (human genome).

I got the advice to download the genome via AnnotationHub.

I managed to download the file "Homo_sapiens.GRCh37.74.dna.toplevel.fa" via AnnotationHub, though I am not sure that is the right source to use.

I was a bit confused by the error message:

In curl::curl_fetch_disk(url, x$path, handle = handle) : progress callback must return boolean

The data object now looks like this:

> res
class: FaFile
path: /Users/niederle/.AnnotationHub/12356
index: /Users/niederle/.AnnotationHub/16142
isOpen: TRUE
yieldSize: NA

Then, I should convert that into BioStrings but I did not manage to do that.

If I call:

> genome <- readDNAStringSet("/Users/xyz/.AnnotationHub/12356")
> genome
> genome
  A DNAStringSet instance of length 346
          width seq                                                                                                                              names               
  [1] 249250621 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 1 dna:chromosome ...
  [2] 135534747 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 10 dna:chromosome...
  [3] 135006516 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 11 dna:chromosome...
  [4] 133851895 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 12 dna:chromosome...
  [5] 115169878 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 13 dna:chromosome...
  ...       ... ...
[342]         0                                                                                                                                  /
[343]      1006 AABBCCDDGNGHHHKKMMNNRRSSTTVVWWYYYTKCMAABBBCCDD-GGHHKKMWMNSNNCRR...DDHGGHWHRGKAKMMNNRRSSTTVVWSWKYYWGC.TNYVMA.YST-VSH+D+-W-.T.KC.W
Error in nchar(snippet_name) : invalid multibyte string 1

I don't know how to continue. Can anybody give me some useful hints? Anyhow, I don't know how to extract the gene information.

sequence annotationhub biostrings • 2.2k views

ADD COMMENT • link updated 9.4 years ago by Hervé Pagès 16k • written 9.4 years ago by antje.janosch • 0

score 0 · Answer 1 · 2015-08-13

Hi,

If you want to align your siRNA-sequences to the Human genome, you can try using the string matching tools from the Biostrings package. I suggest you have a look at the "Efficient genome searching with Biostrings and the BSgenome data packages" vignette in the BSgenome package. The purpose of the alignment step is to find the locations of your siRNA-sequences on the reference genome.

Once you have these locations (typically as a GRanges object), you'll need to map them to a gene. To do this, get the appropriate gene model as a TxDb object and use findOverlaps() to find the overlaps between your siRNA locations and the gene regions. I suggest you have a look at the documentation in the GenomicFeatures and GenomicRanges packages to familiarize yourself with TxDb objects and GRanges objects.

Cheers,

H.