Question

Ranges from a fasta file

1

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 4 months ago

Italy

I imported a fasta file using readDNAStringSet :

fasta <- readDNAStringSet("~/piRbase.fa")
fasta@ranges
   group     start       end width           names
1      1  28720989  28721013    25 piR-hsa-1000547
2      1  42817439  42817467    29 piR-hsa-1483697
3      1  43210296  43210328    33 piR-hsa-1497078
4      1  47285542  47285569    28 piR-hsa-1635975
5      1  49067139  49067170    32 piR-hsa-1696655
6      1  50110086  50110115    30 piR-hsa-1732223

fasta
  A DNAStringSet instance of length 29
     width seq                                                                       names               
 [1]    25 TAACCAGGGACAGCAGCAAAACATG                                       piR-hsa-1000547
 [2]    29 CAGAAATTCCTAAAGAAGAGAAGGACCCT                                   piR-hsa-1483697
 [3]    33 TCTGAACTTTTGAAACCTTGTGTGGGATTGATG                               piR-hsa-1497078
 [4]    28 GTTACGGTTACTGTGGGCAAGTTTGAGA                                    piR-hsa-1635975
 [5]    32 TGAAAGGGCTGTTGTAATAGTGGATGATCGTG                                piR-hsa-1696655

I'm trying to understand what are the fasta@ranges and how are created because these are not the "correct" coordinates in the genome.
I have another bed file with the "correct" coordinates that I import using "rtracklayer" package to a Granges object. How is it possible to "merge" the information between the two files?

example bed:

GRanges object with 53 ranges and 2 metadata columns:
                  seqnames              ranges strand |            name     score
                     <Rle>           <IRanges>  <Rle> |     <character> <numeric>
   piR-hsa-222502     chr2 182983305-182983332      - |  piR-hsa-222502         0
   piR-hsa-470448    chr18   77016562-77016586      - |  piR-hsa-470448         0
   piR-hsa-504532    chr11   65211877-65211907      + |  piR-hsa-504532         0
   piR-hsa-612712    chr15   68290091-68290122      + |  piR-hsa-612712         0
   piR-hsa-1000547    chr17   35820399-35820423      + | piR-hsa-1000547         1

biostrings fasta genomicranges granges • 2.7k views

ADD COMMENT • link updated 5.2 years ago by Martin Morgan 25k • written 5.2 years ago by Konstantinos Yeles ▴ 80

score 2 · Accepted Answer · 2019-02-14

2

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 6 days ago

United States

Generally, slots (accessed with @) are 'internal business' of the object, and not any of our (user) business.

I think you want to use getSeq(), like

> dna = DNAStringSet(c(a="ATGC", b="TGCA"))
> gr = GRanges(c("a:1-2", "b:3-4"))
> getSeq(dna, gr)
  A DNAStringSet instance of length 2
    width seq
[1]     2 AT
[2]     2 CA

ADD COMMENT • link 5.2 years ago Martin Morgan 25k

0

Entering edit mode

rtracklayer::getSeq(fasta,gr_obj)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'getSeq' for signature'DNAStringSet'
> Biostrings::getSeq(fasta,gr_obj)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'getSeq' for signature 'DNAStringSet'

ADD REPLY • link 5.2 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

The method is actually in BSgenome; sorry for not being clear.

ADD REPLY • link 5.2 years ago Martin Morgan 25k

0

Entering edit mode

Thank you! Probably naive problem but I get :

BSgenome::getSeq(fasta,gr_obj)
Error: subscript contains invalid names



 table(names(gr_obj)%in%names(fasta))

TRUE 
  53 
sum(names(fasta)%in%names(gr_obj))
[1] 29

any suggestions on where to identify the problem?

ADD REPLY • link 5.2 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

A GRanges has seqnames() (e.g., chromsomes) as well as names() (e.g., exon IDs). You want to ensure that

all(seqnames(gr_obj) %in% names(fasta))

ADD REPLY • link 5.2 years ago Martin Morgan 25k

0

Entering edit mode

So names in the "fasta" file should have chr names and not something like smallRNA IDs e.g.:

 >piR-hsa-1

AGAAGACATTCGTGGAGGCGTC

>piR-hsa-2

ACGCCTCCACGTAGTGTCTT

>piR-hsa-3

ACGCCTCCACGAGTGTCTT

ADD REPLY • link 5.2 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

Backing up, what is it that you want to do? I had not read your question carefully enough, and though that you had sequences, e.g., of chromosomes, and you wanted to extract sub-sequences, e.g., of small RNAs, whose genomic coordinates you knew.

You could for instance add a column to your GRanges object

gr_obj$seq = fasta[names(gr_obj)]

or similar

ADD REPLY • link 5.2 years ago Martin Morgan 25k

0

Entering edit mode

Thank you for the time you spent.

Well, I try to make an object from 3 different files (the fasta with the sequences, the gr_obj with coordinates in the genome and another table with read sequences from smallRNAseq) in order to plot something like coverage but for each different sequence of individual small RNA. I found this example and I'll recreate the "Plotting genomic ranges". Thus showing differences in read length and 3' or 5' ends of identified smallRNAs. Currently, I'm using a combination of stringr, dplyr and granges.

ADD REPLY • link 5.2 years ago Konstantinos Yeles ▴ 80