> edb <- ensDbFromGtf(gtf = "/home/stian/Desktop/Homo_sapiens.GRCh37.74.gtf")
Importing GTF file...OK
Processing metadata...OK
Processing genes...
Attribute availability:
o gene_id... OK
o gene_name... OK
o entrezid... Nope
o gene_biotype... OK
OK
Processing transcripts...
Attribute availability:
o transcript_id... OK
o gene_id... OK
o source... OK
OK
Processing exons...OK
Processing chromosomes...Fetch seqlengths from ensembl, dataset hsapiens_gene_ensembl version 74...OK
OK
Generating index...OK
-------------
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK
Warning message:
In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism, :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
to me that looks like it should. Are you concerned by the warning? The point is that in the GTF files from Ensembl the NCBI Entrezgene ID is not provided, and that's why you get the warning. It just means that the corresponding database column will be empty. All the other data is however available. As long as you don't want to use the `EntrezidFilter` to retrieve specific data from the database all will be fine.
Note however that the function you've used just created the SQLite database with the data. You can now load that data using the EnsDb constructor:
edb
[1] "./Homo_sapiens.GRCh37.74.sqlite"
## Interface the database, i.e. create an EnsDb object
edb <- EnsDb(edb)
> edb
EnsDb for Ensembl:
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.0.1
|Creation time: Wed May 18 12:39:34 2016
|ensembl_version: 74
|ensembl_host: unknown
|Organism: Homo_sapiens
|genome_build: GRCh37
|DBSCHEMAVERSION: 1.0
|source_file: Homo_sapiens.GRCh37.74.gtf.gz
| No. of genes: 63677.
| No. of transcripts: 215170.
Or alternatively you can generate an EnsDb package:
## That's the SQLite database you've just generated
DB <- "./Homo_sapiens.GRCh37.74.sqlite"
## Generate the package
makeEnsembldbPackage(author="stianlagstad", ensdb=DB, version="1.0.0", maintainer="stianlagstad@somewhere.org")
## You'll see that you have now a folder "EnsDb.Hsapiens.v74" in the present directory, which you can build and install (R CMD build EnsDb.* and R CMD INSTALL)
Thanks! I assumed that something had gone wrong since edb was a character vector after the call to ensDbFromGtf. My bad for not reading the instructions thoroughly.
I do have a little problem there might already be a solution for:
I use exonsBy to get the exons in a specific transcript. I wish to plot this transcript structure, and I wish to show which parts of the transcripts are the UTRs. To do this I fetch both tx_cds_seq_start and tx_cds_seq_end. These two coordinates are often within one of the exons (/ranges) I've fetched with exonsBy. Example:
library(ensembldb)
# edb <- ensDbFromGtf(gtf = "/home/stian/Desktop/Homo_sapiens.GRCh37.74.gtf")
edb <- "/home/stian/Desktop/Homo_sapiens.GRCh37.74.sqlite"
edb <- ensembldb::EnsDb(edb)
transcriptID <- "ENST00000398958"
# Get GRangesList of exons
transcriptExons <- ensembldb::exonsBy(
edb,
filter = list(TxidFilter(transcriptID)),
columns = list("gene_id", "gene_name", "tx_cds_seq_start", "tx_cds_seq_end"))
# Unlist to get GRanges
transcriptExons <- unlist(transcriptExons)
# Sort
transcriptExons <- sort(transcriptExons)
# Get cdc_start and end
cds_start <- mcols(transcriptExons)$tx_cds_seq_start[[1]]
cds_end <- mcols(transcriptExons)$tx_cds_seq_end[[1]]
# Both of these are within ranges in transcriptExons:
cds_start > start(transcriptExons) & cds_start < end(transcriptExons)
# [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
cds_end > start(transcriptExons) & cds_end < end(transcriptExons)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
I want the CDC and UTR ranges separate: To split range 4 and 12 so that I'll get 14 ranges in total instead of the 12 I get by default. Is this possible, or do I have to manually do this?
Right now I do it manually with this:
splitOnUtr <- function(gr, cds_start, cds_end) {
# takes a granges with exons in ranges. if the cdc_start / cdc_end positions are within exons, the exons are split into two ranges
# find exon containing start
cds_start_exon <- cds_start > start(gr) & cds_start < end(gr)
if (any(cds_start_exon)) {
# Create two copies of the exon in which the cds starts
first <- gr[cds_start_exon]
second <- gr[cds_start_exon]
# Update ranges for the new objects
end(first) <- cds_start-1
start(second) <- cds_start
# Remove the original range
gr <- gr[!cds_start_exon]
# Add new ranges
gr <- append(gr, first)
gr <- append(gr, second)
# Sort again
gr <- sort(gr)
}
# find exon containing end
cds_end_exon <- cds_end > start(gr) & cds_end < end(gr)
if (any(cds_end_exon)) {
# Create two copies of the exon in which the cds ends
first <- gr[cds_end_exon]
second <- gr[cds_end_exon]
# Update ranges for the new objects
end(first) <- cds_end-1
start(second) <- cds_end
# Remove the original range
gr <- gr[!cds_end_exon]
# Add new ranges
gr <- append(gr, first)
gr <- append(gr, second)
# Sort again
gr <- sort(gr)
}
gr
}
On one hand you could use the getGeneRegionTrackForGviz method and use Gviz to plot the transcript. Have also a look at the resulting data.frame, eventually that's all you want to have anyway, as it separates 3' UTR, 5' UTR and CDS.
Hi again, I think I've stumbled across a little bug. When using exonsBy, even with "tx_id" in the list given to the columns parameter, I do not see a column in the metadata that says tx_id. Example:
library(ensembldb)
edb <- ensDbFromGtf(gtf = "/home/stian/Desktop/Homo_sapiens.GRCh37.74.gtf")
edb <- ensembldb::EnsDb(edb)
transcripts <- ensembldb::exonsBy(
edb,
filter = list(ensembldb::GeneidFilter("ENSG00000180198")),
columns = list("gene_id", "gene_name", "tx_id", "tx_cds_seq_start", "tx_cds_seq_end", "exon_id", "exon_rank"))
# Warning messages:
# 1: In cleanColumns(x, columns) :
# Columns 'exon_rank' are not valid and have been removed
# 2: In cleanColumns(x, columns) :
# Columns 'exon_rank' are not valid and have been removed
# 3: In cleanColumns(x, columns) :
# Columns 'exon_rank' are not valid and have been removed
# But mcols(transcripts[[1]])$exon_rank exists, and in addition I've gotten mcols(transcripts[[1]])$exon_rank.1
mcols(transcripts[[1]])$exon_rank
# [1] "1" "10" "11" "2" "3" "4" "5" "6" "7" "8" "9"
mcols(transcripts[[1]])$exon_rank.1
# [1] "1" "10" "11" "2" "3" "4" "5" "6" "7" "8" "9"
# My problem however, is that mcols(transcripts[[1]])$tx_id is not there, even though "tx_id" was provided to columns in the call to exonsBy.
mcols(transcripts[[1]])$tx_id
# NULL
I don't get the warnings if I leave "exon_rank" out of the list passed to columns, bit I found it weird. I would like the tx_id to appear though.
thanks, that's indeed a bug. tx_id should be there if you specify that. I'll fix that. Regarding the issue with exon_rank, the problem there is that while in the result a column called exon_rank is displayed, the actual database column is exon_idx. I'll think of a fix there. For now you should specify "exon_idx" instead of "exon_rank" in the columns attribute, but I think of a fix for that too.
Thank you for the quick response. If I include "exon_idx" in the list given to columns, I don't end up getting a column in the metadata named "exon_idx". I do still get the "exon_rank" metadata column though, without asking for it.
Should I create issues in the github repository if I find more bugs?
It's fixed in version 1.4.3 (should be available in release-3.3 of Bioconductor). I still have to fix it in the current devel.
I fixed also another bug that became apparent above: exon_rank is stored as a character if an EnsDb is created from a GTF. I fixed that too, thus, with the new version of the package the EnsDb database stores exon_idx (exon_rank) as integer, and the exons are really returned in the correct ordering (not 1, 10, 11, 2 like above).
Yes, posting issues in my github repo would be nice.
Thanks! I assumed that something had gone wrong since edb was a character vector after the call to ensDbFromGtf. My bad for not reading the instructions thoroughly.
No problem ;)
Let me know if you have other problems, or, even better, suggestions to improve.
cheers, jo
I do have a little problem there might already be a solution for:
I use exonsBy to get the exons in a specific transcript. I wish to plot this transcript structure, and I wish to show which parts of the transcripts are the UTRs. To do this I fetch both tx_cds_seq_start and tx_cds_seq_end. These two coordinates are often within one of the exons (/ranges) I've fetched with exonsBy. Example:
I want the CDC and UTR ranges separate: To split range 4 and 12 so that I'll get 14 ranges in total instead of the 12 I get by default. Is this possible, or do I have to manually do this?
Right now I do it manually with this:
On one hand you could use the
getGeneRegionTrackForGviz
method and use Gviz to plot the transcript. Have also a look at the resulting data.frame, eventually that's all you want to have anyway, as it separates 3' UTR, 5' UTR and CDS.Alternatively, you could also get the CDS and UTR separately from the db:
Hi again, I think I've stumbled across a little bug. When using exonsBy, even with "tx_id" in the list given to the columns parameter, I do not see a column in the metadata that says tx_id. Example:
I don't get the warnings if I leave "exon_rank" out of the list passed to columns, bit I found it weird. I would like the tx_id to appear though.
Hi Stian,
thanks, that's indeed a bug. tx_id should be there if you specify that. I'll fix that. Regarding the issue with exon_rank, the problem there is that while in the result a column called exon_rank is displayed, the actual database column is exon_idx. I'll think of a fix there. For now you should specify "exon_idx" instead of "exon_rank" in the columns attribute, but I think of a fix for that too.
Thanks, jo
Thank you for the quick response. If I include "exon_idx" in the list given to columns, I don't end up getting a column in the metadata named "exon_idx". I do still get the "exon_rank" metadata column though, without asking for it.
Should I create issues in the github repository if I find more bugs?
It's fixed in version 1.4.3 (should be available in release-3.3 of Bioconductor). I still have to fix it in the current devel.
I fixed also another bug that became apparent above: exon_rank is stored as a character if an EnsDb is created from a GTF. I fixed that too, thus, with the new version of the package the EnsDb database stores exon_idx (exon_rank) as integer, and the exons are really returned in the correct ordering (not 1, 10, 11, 2 like above).
Yes, posting issues in my github repo would be nice.
Thanks again!
Thanks again, that's very useful!