Ranges from a fasta file
1
1
Entering edit mode
@konstantinos-yeles-8961
Last seen 6 weeks ago
University of Salerno, Salerno, Italy

I imported a fasta file using readDNAStringSet :

fasta <- readDNAStringSet("~/piRbase.fa")
fasta@ranges
   group     start       end width           names
1      1  28720989  28721013    25 piR-hsa-1000547
2      1  42817439  42817467    29 piR-hsa-1483697
3      1  43210296  43210328    33 piR-hsa-1497078
4      1  47285542  47285569    28 piR-hsa-1635975
5      1  49067139  49067170    32 piR-hsa-1696655
6      1  50110086  50110115    30 piR-hsa-1732223

fasta
  A DNAStringSet instance of length 29
     width seq                                                                       names               
 [1]    25 TAACCAGGGACAGCAGCAAAACATG                                       piR-hsa-1000547
 [2]    29 CAGAAATTCCTAAAGAAGAGAAGGACCCT                                   piR-hsa-1483697
 [3]    33 TCTGAACTTTTGAAACCTTGTGTGGGATTGATG                               piR-hsa-1497078
 [4]    28 GTTACGGTTACTGTGGGCAAGTTTGAGA                                    piR-hsa-1635975
 [5]    32 TGAAAGGGCTGTTGTAATAGTGGATGATCGTG                                piR-hsa-1696655
  1. I'm trying to understand what are the fasta@ranges and how are created because these are not the "correct" coordinates in the genome.
  2. I have another bed file with the "correct" coordinates that I import using "rtracklayer" package to a Granges object. How is it possible to "merge" the information between the two files?

example bed:

GRanges object with 53 ranges and 2 metadata columns:
                  seqnames              ranges strand |            name     score
                     <Rle>           <IRanges>  <Rle> |     <character> <numeric>
   piR-hsa-222502     chr2 182983305-182983332      - |  piR-hsa-222502         0
   piR-hsa-470448    chr18   77016562-77016586      - |  piR-hsa-470448         0
   piR-hsa-504532    chr11   65211877-65211907      + |  piR-hsa-504532         0
   piR-hsa-612712    chr15   68290091-68290122      + |  piR-hsa-612712         0
   piR-hsa-1000547    chr17   35820399-35820423      + | piR-hsa-1000547         1
biostrings fasta genomicranges granges • 1.0k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 10 weeks ago
United States

Generally, slots (accessed with @) are 'internal business' of the object, and not any of our (user) business.

I think you want to use getSeq(), like

> dna = DNAStringSet(c(a="ATGC", b="TGCA"))
> gr = GRanges(c("a:1-2", "b:3-4"))
> getSeq(dna, gr)
  A DNAStringSet instance of length 2
    width seq
[1]     2 AT
[2]     2 CA
ADD COMMENT
0
Entering edit mode
rtracklayer::getSeq(fasta,gr_obj)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'getSeq' for signature'DNAStringSet'
> Biostrings::getSeq(fasta,gr_obj)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'getSeq' for signature 'DNAStringSet'
ADD REPLY
1
Entering edit mode

The method is actually in BSgenome; sorry for not being clear.

ADD REPLY
0
Entering edit mode

Thank you! Probably naive problem but I get :

BSgenome::getSeq(fasta,gr_obj)
Error: subscript contains invalid names



 table(names(gr_obj)%in%names(fasta))

TRUE 
  53 
sum(names(fasta)%in%names(gr_obj))
[1] 29

any suggestions on where to identify the problem?

ADD REPLY
1
Entering edit mode

A GRanges has seqnames() (e.g., chromsomes) as well as names() (e.g., exon IDs). You want to ensure that

all(seqnames(gr_obj) %in% names(fasta))
ADD REPLY
0
Entering edit mode

So names in the "fasta" file should have chr names and not something like smallRNA IDs e.g.:

 >piR-hsa-1

AGAAGACATTCGTGGAGGCGTC

>piR-hsa-2

ACGCCTCCACGTAGTGTCTT

>piR-hsa-3

ACGCCTCCACGAGTGTCTT
ADD REPLY
1
Entering edit mode

Backing up, what is it that you want to do? I had not read your question carefully enough, and though that you had sequences, e.g., of chromosomes, and you wanted to extract sub-sequences, e.g., of small RNAs, whose genomic coordinates you knew.

You could for instance add a column to your GRanges object

gr_obj$seq = fasta[names(gr_obj)]

or similar

ADD REPLY
0
Entering edit mode

Thank you for the time you spent.

Well, I try to make an object from 3 different files (the fasta with the sequences, the gr_obj with coordinates in the genome and another table with read sequences from smallRNAseq) in order to plot something like coverage but for each different sequence of individual small RNA. I found this example and I'll recreate the "Plotting genomic ranges". Thus showing differences in read length and 3' or 5' ends of identified smallRNAs. Currently, I'm using a combination of stringr, dplyr and granges.

ADD REPLY

Login before adding your answer.

Traffic: 337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6