Question

Using GenomicFeatures to extract the upstream and downstream region of a transcript

0

Entering edit mode

courtney.stairs • 0

@courtneystairs-13043

Last seen 6.9 years ago

Hi there,

I'm trying to use GenomicFeatures to extract the sequences upstream and downstream of my transcripts. I have loaded my GTF file, however, I'm not sure how to proceed.

I'm trying to extract 150 bp upstream and downstream of each gene as well as the CDS itself. If this isn't possible, I can extract the upstream and downstream regions and stitch them with the CDS. I see there is a extractUpstreamSeqs() and promoters() option, but I'm not quite sure how to use this (I'm terrible at R).

I'm working on a non-model organism that does not have UTRs encoded in the GTF file or else I could likely use a simpler method.

Any help would be greatly appreciated :)

Courtney

genomicfeatures • 7.2k views

ADD COMMENT • link updated 3.7 years ago by Hervé Pagès 16k • written 6.9 years ago by courtney.stairs • 0

1

Entering edit mode

You can try this, but probably won't work for cds.

library(RMariaDB)
library(GenomicFeatures)

TX<-makeTxDbFromUCSC("name.of.reference.genome.on.UCSC","ensGene")
TX
xscript<-transcripts(TX)      # GRanges for transcripts
genes<-genes(TX)      # GRanges for genes
ex<-exonsBy(TX,by="gene") # GRanges for each exon of each gene
cd<-cds(TX) #GRanges for each CDS

promoters(genes,upstream=150,downstream=150)->p1   # up/down 150bp from each gene

promoters(xscript,upstream=150,downstream=150)->p2 # up/down 150bp from each transcript

ADD REPLY • link 6.9 years ago mat149 ▴ 80

score 5 · Answer 1 · 2019-03-07

Hi,

Please be aware that some familiarity with R is highly recommended before you can efficiently use Bioconductor.

It's not clear to me whether you want to extract the upstream and downstream regions (i.e. their coordinates with respect to the reference genome) or the genomic sequences (DNA) corresponding to these regions. If you want the latter you'll need to have access to the genomic sequences of your organism. I'm assuming you want the latter. The genomic sequences are typically made available by the annotation provider as FASTA or 2-bit files. They need to match exactly the assembly that the annotations in your GTF file are based on.

First, use makeTxDbFromGFF() to import your GTF file as a TxDb object:

txdb <- makeTxDbFromGFF("path/to/GTF/file")

Note that makeTxDbFromGFF() only imports gene, transcript, exon, and CDS locations. Strictly speaking that's all what is needed to represent gene models. This means that even if your GTF file had entries for UTRs, makeTxDbFromGFF() would ignore them (UTRs can easily be inferred from exons and CDS, which is exactly what fiveUTRsByTranscript(txdb) and threeUTRsByTranscript(txdb) do for you).

Then to extract the 150 bp upstream sequences (with respect to the genes):

genes <- genes(txdb)
gene_upstream150_seqs <- extractUpstreamSeqs(x, genes, width=150)

where x is typically a BSgenome object containing the genomic sequences of your assembly, or a TwoBitFile or FaFile object (see ?extractUpstreamSeqs for more information). Making a BSgenome data package for your organism is a whole topic per se so I won't describe it here. See the "How to forge a BSgenome data package" vignette in the BSgenome software package if you are interested.

gene_upstream150_seqs will be a named DNAStringSet object parallel to genes i.e. it will contain one upstream sequence per gene in genes. Also the names on it will be the same as the names on genes. In addition it will carry some metadata of interest that you can get with mcols(gene_upstream150_seqs).

Unfortunately we don't have an extractDownstreamSeqs() function (was never requested, should be easy to add though), so a little bit more work is required in order to extract the 150 bp downstream sequences:

gene_downstream150 <- trim(suppressWarnings(flank(genes, width=150, start=FALSE)))
gene_downstream150_seqs <- getSeq(x, gene_downstream150)
mcols(gene_downstream150_seqs) <- mcols(gene_upstream150_seqs)

gene_downstream150_seqs will also be a named DNAStringSet object parallel to genes.

Finally to extract the CDS sequences:

cds <- cdsBy(txdb, by="tx")
cds_seqs <- extractTranscriptSeqs(x, cds)

cds_seqs will again be a named DNAStringSet object but this time it won't necessarily be parallel to genes. That's because a given gene can have zero or more than one CDS sequence. The names on cds_seqs will be the internal ids of the corresponding transcripts. The corresponding transcript names and gene ids can be put on the object as metadata columns with:

tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
mapping <- mcols(tx)
mapping$gene_id <- as.character(mapping$gene_id)
rownames(mapping) <- mapping$tx_id
mcols(cds_seqs) <- mapping[names(cds_seqs), ]

Hope this will help get you started.

Cheers,

H.