Question

GenomicFeatures and custom genomes?

1

Entering edit mode

david.rinker ▴ 10

@davidrinker-13538

Last seen 6.5 years ago

Hi I want to use GenomicFeatures to extract upstream sequences from a genome using gene_IDs

My organism (Anopheles melas) has an assembled genome (fasta) and a gtf file.

It seems like these are all the files I should need, however I am not finding it easy to figure out on how to load these two data types for genomicFeatures to use? So far all I have figured out that I need to make a TxDb from gff; I have no idea what to do with the fasta file.

Could someone (BRIEFLY) tell me which packages and objects I need to use and how they all mesh together?

I have read over the manual as well as the Rsamtools manual but am more confused than anything.

Thank you

genomicfeatures • 1.5k views

ADD COMMENT • link updated 6.5 years ago by Hervé Pagès 16k • written 6.5 years ago by david.rinker ▴ 10

score 2 · Answer 1 · 2017-10-18

Depending on your OS, i'd suggest creating a '2bit' (rtracklayer, TwoBitFile) or indexed fasta (FaFile, Rsamtools) file.

For the gff, I think you can rtracklayer::import() it and get something lighter weight but still useful, or create the TxDb for greater flexibility.

The glue is Biostrings::getSeq(), which will take the 2bit or fasta file as first argument and genomic ranges as second returning the sequences corresponding to the ranges.

score 1 · Answer 2 · 2017-10-20

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 11 hours ago

Seattle, WA, United States

Hi,

Using extractUpstreamSeqs() from the GenomicFeatures package:

library(Biostrings)
genome <- readDNAStringSet(fasta_file)

library(GenomicFeatures)
txdb <- makeTxDbFromGFF(gtf_file)
extractUpstreamSeqs(genome, txdb)

If you have a 2-bit file instead of FASTA, replace genome <- readDNAStringSet(fasta_file) with:

library(rtracklayer)
genome <- TwoBitFile(twobit_file)

The 2-bit format supports very efficient random access so this solution avoids loading the full genome sequences in memory. Note however that this won't work if the sequences contain IUAPAC ambiguity letters other than N (the 2-bit format does not support them).

More information and examples in ?extractUpstreamSeqs.

Cheers,

H.

ADD COMMENT • link 6.5 years ago Hervé Pagès 16k

1

Entering edit mode

Just realizing now that extractUpstreamSeqs() doesn't take a DNAStringSet object, sorry. We should add this at some point. So for now you could either use:

library(Rsamtools)
genome <- FaFile(fasta_file)

as Martin suggested, or convert your FASTA file to 2-bit format with something like:

export(readDNAStringSet(fasta_file), "mygenome.2bit")

Note that using FaFile on a compressed FASTA file has been reported to be unreliable on Windows in the past.

Hope this helps,

H.

ADD REPLY • link 6.5 years ago Hervé Pagès 16k

score 0 · Answer 3 · 2017-10-18

Ok, guess this question is a bust.

If the developers are reading, it might be a useful addition to the manual and/or vignette to outline this process. More and more "non-model" organisms are being sequenced and analyzed--those users are inevitably going to want to do what I'm currently trying to do.

If I can figure this out I'll post my summary here to hopefully enlighten others.