Hi! I am trying to get the promoter sequences of a less well known organism and I am using a simple script I have used many times for model organisms.
I am using a GFF file downloaded from NCBI, and I have successfully made the txdb
object
But since I don't have a BSgenome for this organism, the vignette states that I can use a FaFile
instead, but here is where I am having trouble with Rsamtools
. Which function indexes the FASTA
file? scanFaIndex
?
#use gff from NCBI
gffFile <- read.gff("MINUs1PgV_16T.gff3", na.strings = c(".", "?"), GFF3 = TRUE)
#Save the GFF in a TxDb object
TxDb_PgV <- makeTxDbFromGFF(gffFile, format = c("auto"))
#Save the TxDb object just in case
saveDb(TxDb_PgV, file="TxDb_PgV_16T.sqlite")
#rename for convenience
txdb <- TxDb_PgV
#From here not working
#Create the Fafile from a fasta I downloaded from ncbi
genome <- FaFile("16TTT.fa")
Index2FASTA <- scanFaIndex(genome))
#get the promoter seq for all of the genes in the
PromoterSeq <- GenomicFeatures::getPromoterSeq(PromoterRegions,
genome, use.names = TRUE)
#output the file
Biostrings::writeXStringSet(PromoterSeq, file = "promoterSeq.txt")
Thanks Martin!
Hi James, I did read the help page for the functions, but it is not clear to me if the
scanFAIndex
function creates and index or does it just recognizes an index input in the same location...This is what I am asking.
It says
Which seems pretty clear to me that the file has to be indexed first. Is there something specific in that description that you can point to that is unclear? Maybe it should be updated.
OK James.
I am sorry, I am just trying to figure out which line in the script is not working, because I am still having trouble running the code.
Thank you for your help
One other thing you should always include is the output you get from R when something doesn't work. It's difficult to diagnose an error if we can't see what it was.
Yes, you are right. I didn't think the error was so informative.
I should have been clearer. You are using data that is mysterious in origin (saying you got it from NCBI doesn't tell us much), and you are not showing the code you use and the error. If people can't see exactly where the error crops up, and can't get the data to reproduce, then all we are left with is making guesses.
The error message is actually very informative. It's telling you that your data is truncated, which means exactly what it says - R was expecting the data to be of a particular length, is giving an error when that expectation isn't met.
Hi James,
Data from NCBI
Line of code causing error message:
Error produced:
I would be interested to know how to fix the problem. I understand the expectation isn't met, so what can I do to resolve it. As this is a support forum I was hoping maybe someone ran into the same issue and can help fix it.
Good day
It could be that the file is incomplete (is it the same length, e.g., as the version at NCBI?) or that the file is actually incorrect (you'd have to look at the file, maybe using a plain text editor or
readLines()
plus other manipulations, to see if the record is somehow 'truncated', different in some way from other records in the file).It could also be that you're on Windows, and that the file is very large?
But when I go to your 'Data from NCBI' link, I don't really see which file you are downloading?
Hi Martin!
Using the link I sent, I selected "Send to" --> "Complete record" --> "File" --> gff3.
I did not make any changes to the file, but yes I am using Windows, but the file is only 151 KB.
But how do you get your fasta file?
The same way. "Send to" --> "Complete record" --> "File" --> FASTA
Jumping out a bit to get some more space...
I downloaded the fasta file and it seems to be intact and with a single sequence
Likewise for the gff3 file
Do you see the same?
I made the TxDb object and created the index for the fasta file
So now I'm ready to run your problematic code, I guess you did
? But this is different in detail from the command and error message you report, so maybe you are doing something else?
Hi Martin,
Yes that is the sequence of events that I performed. You are correct that the detail of the error is different i.e. the length of the truncated sequence. I will look into that to see why that is, but regardless how can we fix the error that you/me are receiving?
Is the issue with the fasta file? the gff3? or perhaps the building of the
txdb
object from the gff3 file?Thank You!!!
Rina
The problem is that the sequence is circular, so that some 'promotors' overlap the origin, resulting in a request for 'negative' ranges, e.g., from 2605 to 406 before the origin.
Unfortunately, this seems to be a limitation of the DNAString infrastructure. This post Does there exists a getSeq like function for ranges that wrap around end to start ? suggests that one can use a BSgenome object for this, which would require making a BSgenome package following https://www.bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf. This seems like a lot of work, but maybe it is 'just' bookkeeping. You would need to make sure the circular sequence was identified as such during the process of making the package.
I'm not sure if this is the right advice/ Hervé Pagès might have something more correct to say.
This does seem to be a lot of work, but I will start looking into it!
If Herve Pages has any advice, I would appreciate.
And jumping out again... Probably you've moved on, but I wrote
as a 'helper' function to get sequences spanning the origin of a circular sequence. It is not very efficient, but then we're not talking about a huge amount of data.
Importing the gff3 and fasta files, and making sure they had the necessary information about sequence circularity, turned out to be quite a it of work
but then the promoter sequences are easily obtained with
Hi Martin,
Thank you so much for continuing to work on this! I did try to move on to something else, but I will try this very soon!
Thank you again for all of your help!! It is very appreciated :)