Question: Ranges from a fasta file
1
gravatar for Konstantinos Yeles
7 months ago by
University of Salerno, Salerno, Italy
Konstantinos Yeles20 wrote:

I imported a fasta file using readDNAStringSet :

fasta <- readDNAStringSet("~/piRbase.fa")
fasta@ranges
   group     start       end width           names
1      1  28720989  28721013    25 piR-hsa-1000547
2      1  42817439  42817467    29 piR-hsa-1483697
3      1  43210296  43210328    33 piR-hsa-1497078
4      1  47285542  47285569    28 piR-hsa-1635975
5      1  49067139  49067170    32 piR-hsa-1696655
6      1  50110086  50110115    30 piR-hsa-1732223

fasta
  A DNAStringSet instance of length 29
     width seq                                                                       names               
 [1]    25 TAACCAGGGACAGCAGCAAAACATG                                       piR-hsa-1000547
 [2]    29 CAGAAATTCCTAAAGAAGAGAAGGACCCT                                   piR-hsa-1483697
 [3]    33 TCTGAACTTTTGAAACCTTGTGTGGGATTGATG                               piR-hsa-1497078
 [4]    28 GTTACGGTTACTGTGGGCAAGTTTGAGA                                    piR-hsa-1635975
 [5]    32 TGAAAGGGCTGTTGTAATAGTGGATGATCGTG                                piR-hsa-1696655
  1. I'm trying to understand what are the fasta@ranges and how are created because these are not the "correct" coordinates in the genome.
  2. I have another bed file with the "correct" coordinates that I import using "rtracklayer" package to a Granges object. How is it possible to "merge" the information between the two files?

example bed:

GRanges object with 53 ranges and 2 metadata columns:
                  seqnames              ranges strand |            name     score
                     <Rle>           <IRanges>  <Rle> |     <character> <numeric>
   piR-hsa-222502     chr2 182983305-182983332      - |  piR-hsa-222502         0
   piR-hsa-470448    chr18   77016562-77016586      - |  piR-hsa-470448         0
   piR-hsa-504532    chr11   65211877-65211907      + |  piR-hsa-504532         0
   piR-hsa-612712    chr15   68290091-68290122      + |  piR-hsa-612712         0
   piR-hsa-1000547    chr17   35820399-35820423      + | piR-hsa-1000547         1
ADD COMMENTlink modified 7 months ago by Martin Morgan ♦♦ 23k • written 7 months ago by Konstantinos Yeles20
Answer: Ranges from a fasta file
1
gravatar for Martin Morgan
7 months ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Generally, slots (accessed with @) are 'internal business' of the object, and not any of our (user) business.

I think you want to use getSeq(), like

> dna = DNAStringSet(c(a="ATGC", b="TGCA"))
> gr = GRanges(c("a:1-2", "b:3-4"))
> getSeq(dna, gr)
  A DNAStringSet instance of length 2
    width seq
[1]     2 AT
[2]     2 CA
ADD COMMENTlink written 7 months ago by Martin Morgan ♦♦ 23k
rtracklayer::getSeq(fasta,gr_obj)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'getSeq' for signature'DNAStringSet'
> Biostrings::getSeq(fasta,gr_obj)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'getSeq' for signature 'DNAStringSet'
ADD REPLYlink written 7 months ago by Konstantinos Yeles20
1

The method is actually in BSgenome; sorry for not being clear.

ADD REPLYlink written 7 months ago by Martin Morgan ♦♦ 23k

Thank you! Probably naive problem but I get :

BSgenome::getSeq(fasta,gr_obj)
Error: subscript contains invalid names



 table(names(gr_obj)%in%names(fasta))

TRUE 
  53 
sum(names(fasta)%in%names(gr_obj))
[1] 29

any suggestions on where to identify the problem?

ADD REPLYlink modified 7 months ago • written 7 months ago by Konstantinos Yeles20
1

A GRanges has seqnames() (e.g., chromsomes) as well as names() (e.g., exon IDs). You want to ensure that

all(seqnames(gr_obj) %in% names(fasta))
ADD REPLYlink written 7 months ago by Martin Morgan ♦♦ 23k

So names in the "fasta" file should have chr names and not something like smallRNA IDs e.g.:

 >piR-hsa-1

AGAAGACATTCGTGGAGGCGTC

>piR-hsa-2

ACGCCTCCACGTAGTGTCTT

>piR-hsa-3

ACGCCTCCACGAGTGTCTT
ADD REPLYlink written 7 months ago by Konstantinos Yeles20
1

Backing up, what is it that you want to do? I had not read your question carefully enough, and though that you had sequences, e.g., of chromosomes, and you wanted to extract sub-sequences, e.g., of small RNAs, whose genomic coordinates you knew.

You could for instance add a column to your GRanges object

gr_obj$seq = fasta[names(gr_obj)]

or similar

ADD REPLYlink written 7 months ago by Martin Morgan ♦♦ 23k

Thank you for the time you spent.

Well, I try to make an object from 3 different files (the fasta with the sequences, the gr_obj with coordinates in the genome and another table with read sequences from smallRNAseq) in order to plot something like coverage but for each different sequence of individual small RNA. I found this example and I'll recreate the "Plotting genomic ranges". Thus showing differences in read length and 3' or 5' ends of identified smallRNAs. Currently, I'm using a combination of stringr, dplyr and granges.

ADD REPLYlink written 7 months ago by Konstantinos Yeles20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 226 users visited in the last hour