Indexing FASTA file using RSamtools
1
0
Entering edit mode
rbenel ▴ 40
@rbenel-13642
Last seen 2.3 years ago
Israel

Hi! I am trying to get the promoter sequences of a less well known organism and I am using a simple script I have used many times for model organisms.

I am using a GFF file downloaded from NCBI, and I have successfully made the txdb object

But since I don't have a BSgenome for this organism, the vignette states that I can use a FaFile instead, but here is where I am having trouble with Rsamtools. Which function indexes the FASTA file? scanFaIndex?


#use gff from NCBI 

gffFile <- read.gff("MINUs1PgV_16T.gff3", na.strings = c(".", "?"), GFF3 = TRUE)


#Save the GFF in a TxDb object
TxDb_PgV <- makeTxDbFromGFF(gffFile, format = c("auto"))

#Save the TxDb object just in case
saveDb(TxDb_PgV, file="TxDb_PgV_16T.sqlite")

#rename for convenience
txdb <- TxDb_PgV


#From here not working
#Create the Fafile from a fasta I downloaded from ncbi
genome <- FaFile("16TTT.fa")

Index2FASTA <- scanFaIndex(genome))

#get the promoter seq for all of the genes in the 
PromoterSeq <- GenomicFeatures::getPromoterSeq(PromoterRegions,
                                              genome, use.names = TRUE)


#output the file 
Biostrings::writeXStringSet(PromoterSeq, file = "promoterSeq.txt")

Thanks Martin!

Rsamtools GenomicFeatures • 4.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

It's a good practice to read the help page for functions that you are using. If you had done so for scanFaIndex, you would have seen this:

Value:

     'indexFa' visits the path in 'file' and create an index file at
     the same location but with extension '.fai').

Which I believe answers your question?

ADD COMMENT
0
Entering edit mode

Hi James, I did read the help page for the functions, but it is not clear to me if the scanFAIndex function creates and index or does it just recognizes an index input in the same location...

This is what I am asking.

ADD REPLY
0
Entering edit mode

It says

'scanFaIndex' reads the sequence names and and widths of recorded
     in an indexed fasta file, returning the information as a 'GRanges'
     object.

Which seems pretty clear to me that the file has to be indexed first. Is there something specific in that description that you can point to that is unclear? Maybe it should be updated.

ADD REPLY
0
Entering edit mode

OK James.

I am sorry, I am just trying to figure out which line in the script is not working, because I am still having trouble running the code.

Thank you for your help

ADD REPLY
0
Entering edit mode

One other thing you should always include is the output you get from R when something doesn't work. It's difficult to diagnose an error if we can't see what it was.

ADD REPLY
0
Entering edit mode

Yes, you are right. I didn't think the error was so informative.

Error in value[[3L]](cond) :
   record 1 (NC_021312.1:-73-2126) was truncated
  file: file/location/16T.fasta
ADD REPLY
0
Entering edit mode

I should have been clearer. You are using data that is mysterious in origin (saying you got it from NCBI doesn't tell us much), and you are not showing the code you use and the error. If people can't see exactly where the error crops up, and can't get the data to reproduce, then all we are left with is making guesses.

The error message is actually very informative. It's telling you that your data is truncated, which means exactly what it says - R was expecting the data to be of a particular length, is giving an error when that expectation isn't met.

ADD REPLY
0
Entering edit mode

Hi James,

Data from NCBI

Line of code causing error message:

#get the promoter seq for all of the genes in the 
PromoterSeq <- GenomicFeatures::getPromoterSeq(PromoterRegions,
                                              genome, use.names = TRUE)

Error produced:

Error in value[[3L]](cond) :
   record 1 (NC_021312.1:-73-2126) was truncated
  file: file/location/16T.fasta

I would be interested to know how to fix the problem. I understand the expectation isn't met, so what can I do to resolve it. As this is a support forum I was hoping maybe someone ran into the same issue and can help fix it.

Good day

ADD REPLY
0
Entering edit mode

It could be that the file is incomplete (is it the same length, e.g., as the version at NCBI?) or that the file is actually incorrect (you'd have to look at the file, maybe using a plain text editor or readLines() plus other manipulations, to see if the record is somehow 'truncated', different in some way from other records in the file).

It could also be that you're on Windows, and that the file is very large?

But when I go to your 'Data from NCBI' link, I don't really see which file you are downloading?

ADD REPLY
0
Entering edit mode

Hi Martin!

Using the link I sent, I selected "Send to" --> "Complete record" --> "File" --> gff3.

I did not make any changes to the file, but yes I am using Windows, but the file is only 151 KB.

Screen shot of file in a text editor

ADD REPLY
0
Entering edit mode

But how do you get your fasta file?

ADD REPLY
0
Entering edit mode

The same way. "Send to" --> "Complete record" --> "File" --> FASTA

ADD REPLY
0
Entering edit mode

Jumping out a bit to get some more space...

I downloaded the fasta file and it seems to be intact and with a single sequence

> options(Biostrings.coloring=FALSE)
> fa = rtracklayer::import("~/Downloads/sequence.fasta")
> fa
DNAStringSet object of length 1:
     width seq                                              names
[1] 459984 AGGTCGCCAGAGGCTAAGAAGAC...GAATCCAATTATTTTTTAATTA KC662249.1 Phaeoc...

Likewise for the gff3 file

> gff3 = rtracklayer::import("~/Downloads/sequence.gff3")
> length(gff3)
[1] 893
> head(gff3, 2)
GRanges object with 2 ranges and 17 metadata columns:
        seqnames    ranges strand |   source     type     score     phase
           <Rle> <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  [1] KC662249.1  1-459984      + |  Genbank   region        NA      <NA>
  [2] KC662249.1  546-1208      - |  Genbank   gene          NA      <NA>
                        ID          Dbxref       gbkey      genome    mol_type
               <character> <CharacterList> <character> <character> <character>
  [1] KC662249.1:1..459984    taxon:251749         Src     genomic genomic DNA
  [2]      gene-PGCG_00001                        Gene        <NA>        <NA>
                 nat-host      strain        Name   gene_biotype   locus_tag
              <character> <character> <character>    <character> <character>
  [1] Phaeocystis globosa         16T        <NA>           <NA>        <NA>
  [2]                <NA>        <NA>  PGCG_00001 protein_coding  PGCG_00001
               Parent     product  protein_id
      <CharacterList> <character> <character>
  [1]                        <NA>        <NA>
  [2]                        <NA>        <NA>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Do you see the same?

I made the TxDb object and created the index for the fasta file

> txdb = GenomicFeatures::makeTxDbFromGFF("~/Downloads/sequence.gff3", format = "auto")
> indexFa("~/Downloads/sequence.fasta")

So now I'm ready to run your problematic code, I guess you did

> GenomicFeatures::getPromoterSeq(promoters(txdb), fa)
Error in value[[3L]](cond) :
   record 1 (KC662249.1:-2605--406) was truncated
  file: /Users/ma38727/Downloads/sequence.fasta

? But this is different in detail from the command and error message you report, so maybe you are doing something else?

ADD REPLY
0
Entering edit mode

Hi Martin,

Yes that is the sequence of events that I performed. You are correct that the detail of the error is different i.e. the length of the truncated sequence. I will look into that to see why that is, but regardless how can we fix the error that you/me are receiving?

Is the issue with the fasta file? the gff3? or perhaps the building of the txdb object from the gff3 file?

Thank You!!!

Rina

ADD REPLY
0
Entering edit mode

The problem is that the sequence is circular, so that some 'promotors' overlap the origin, resulting in a request for 'negative' ranges, e.g., from 2605 to 406 before the origin.

Unfortunately, this seems to be a limitation of the DNAString infrastructure. This post Does there exists a getSeq like function for ranges that wrap around end to start ? suggests that one can use a BSgenome object for this, which would require making a BSgenome package following https://www.bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf. This seems like a lot of work, but maybe it is 'just' bookkeeping. You would need to make sure the circular sequence was identified as such during the process of making the package.

I'm not sure if this is the right advice/ Hervé Pagès might have something more correct to say.

ADD REPLY
0
Entering edit mode

This does seem to be a lot of work, but I will start looking into it!

If Herve Pages has any advice, I would appreciate.

ADD REPLY
0
Entering edit mode

And jumping out again... Probably you've moved on, but I wrote

restrictRanges <-
    function(gr, start = NA, end = NA)
{
    ranges(gr) <- restrict(ranges(gr), start, end, keep.all.ranges = TRUE)
    gr
}

getSeq2 <-
    function(x, gr)
{        
    getSeq <- BSgenome::getSeq
    lengths <- seqlengths(gr)[as.character(seqnames(gr))]

    before <- shift(restrictRanges(gr, end = 0L), lengths)
    within <- restrictRanges(gr, start = 1L, end = lengths)
    after <- shift(restrictRanges(gr, start = lengths + 1L), -lengths)

    DNAStringSet(
        paste0(getSeq(dna, before), getSeq(dna, within), getSeq(dna, after))
    )
}

as a 'helper' function to get sequences spanning the origin of a circular sequence. It is not very efficient, but then we're not talking about a huge amount of data.

Importing the gff3 and fasta files, and making sure they had the necessary information about sequence circularity, turned out to be quite a it of work

## import gff & fasta
gff <- rtracklayer::import("sequence.gff3")
dna <- rtracklayer::import("sequence.fasta")

## make seqinfo of gff & dna the same
si = seqinfo(gff)
seqlengths(si) = width(dna)
isCircular(si) = TRUE
seqinfo(gff) <- si
names(dna) <- seqnames(si)
seqinfo(dna) <- si

but then the promoter sequences are easily obtained with

> getSeq2(dna, promoters(gff))
## DNAStringSet object of length 893:
##       width seq
##   [1]  2200 CTTCGTCCGCAGCAAGTGAAGTTATATCTGTTG...GAGAAGGACAACGCCAAGCTACCAGATGCCGA
##   [2]  2200 TTTAAATATATTTAATTAATTATTAAAATATAT...TTCTGTGGTAATATAAATCCAAGATGTTGTAA
##   [3]  2200 TTTAAATATATTTAATTAATTATTAAAATATAT...TTCTGTGGTAATATAAATCCAAGATGTTGTAA
##   [4]  2200 GTTCTTGGTTGTGATAGTAGAGAGAACCTTATT...AATCCGCCCATCAATACAATCAGCAGTGAAGA
##   [5]  2200 GTTCTTGGTTGTGATAGTAGAGAGAACCTTATT...AATCCGCCCATCAATACAATCAGCAGTGAAGA
##   ...   ... ...
## [889]  2200 TAATTAAAAAATAATTGGATTCTTTAGTAAAAA...TTCTTAGCGGTCTTCTTAGCCTCTGGCGACCT
## [890]  2200 AAGAATAGAAAATTTAAAGGTAGGTAATAATAG...ATTACCACACATTCTTTGGGGGGTGGTGCTTG
## [891]  2200 AAGAATAGAAAATTTAAAGGTAGGTAATAATAG...ATTACCACACATTCTTTGGGGGGTGGTGCTTG
## [892]  2200 TAATTAAAAAATAATTGGATTCTTTAGTAAAAA...TTCTTAGCGGTCTTCTTAGCCTCTGGCGACCT
## [893]  2200 TAATTAAAAAATAATTGGATTCTTTAGTAAAAA...TTCTTAGCGGTCTTCTTAGCCTCTGGCGACCT
ADD REPLY
0
Entering edit mode

Hi Martin,

Thank you so much for continuing to work on this! I did try to move on to something else, but I will try this very soon!

Thank you again for all of your help!! It is very appreciated :)

ADD REPLY

Login before adding your answer.

Traffic: 375 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6