Question

ensembldb: ensDbFromGtf error with ensembl version 74

0

Entering edit mode

stianlagstad ▴ 90

@stianlagstad-9723

Last seen 5.0 years ago

In reference to C: GenomicFeatures: Select exons by transcript. Any faster option than exonsBy?. I've tried to create a ensdb object using ensDbFromGtf() with the gtf file from here, but I'm unsuccessful:

> edb <- ensDbFromGtf(gtf = "/home/stian/Desktop/Homo_sapiens.GRCh37.74.gtf")
Importing GTF file...OK
Processing metadata...OK
Processing genes...
Attribute availability:
  o gene_id... OK
  o gene_name... OK
  o entrezid... Nope
  o gene_biotype... OK
OK
Processing transcripts...
Attribute availability:
  o transcript_id... OK
  o gene_id... OK
  o source... OK
OK
Processing exons...OK
Processing chromosomes...Fetch seqlengths from ensembl, dataset hsapiens_gene_ensembl version 74...OK
OK
Generating index...OK
  -------------
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK
Warning message:
In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!

Can anyone assist me?

ensembldb • 2.9k views

ADD COMMENT • link updated 8.8 years ago by Johannes Rainer ★ 2.1k • written 8.8 years ago by stianlagstad ▴ 90

score 1 · Accepted Answer · 2016-05-18

1

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 4 months ago

Italy

Hi,

to me that looks like it should. Are you concerned by the warning? The point is that in the GTF files from Ensembl the NCBI Entrezgene ID is not provided, and that's why you get the warning. It just means that the corresponding database column will be empty. All the other data is however available. As long as you don't want to use the `EntrezidFilter` to retrieve specific data from the database all will be fine.

Note however that the function you've used just created the SQLite database with the data. You can now load that data using the EnsDb constructor:

 edb
[1] "./Homo_sapiens.GRCh37.74.sqlite"

## Interface the database, i.e. create an EnsDb object
edb <- EnsDb(edb)

> edb
EnsDb for Ensembl:
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.0.1
|Creation time: Wed May 18 12:39:34 2016
|ensembl_version: 74
|ensembl_host: unknown
|Organism: Homo_sapiens
|genome_build: GRCh37
|DBSCHEMAVERSION: 1.0
|source_file: Homo_sapiens.GRCh37.74.gtf.gz
| No. of genes: 63677.
| No. of transcripts: 215170.

Or alternatively you can generate an EnsDb package:

## That's the SQLite database you've just generated
DB <- "./Homo_sapiens.GRCh37.74.sqlite"
## Generate the package
makeEnsembldbPackage(author="stianlagstad", ensdb=DB, version="1.0.0", maintainer="stianlagstad@somewhere.org")
## You'll see that you have now a folder "EnsDb.Hsapiens.v74" in the present directory, which you can build and install (R CMD build EnsDb.* and R CMD INSTALL)

hope that helps,

cheers, jo

ADD COMMENT • link 8.8 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Thanks! I assumed that something had gone wrong since edb was a character vector after the call to ensDbFromGtf. My bad for not reading the instructions thoroughly.

ADD REPLY • link 8.8 years ago stianlagstad ▴ 90

1

Entering edit mode

No problem ;)

Let me know if you have other problems, or, even better, suggestions to improve.

cheers, jo

ADD REPLY • link 8.8 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

I do have a little problem there might already be a solution for:

I use exonsBy to get the exons in a specific transcript. I wish to plot this transcript structure, and I wish to show which parts of the transcripts are the UTRs. To do this I fetch both tx_cds_seq_start and tx_cds_seq_end. These two coordinates are often within one of the exons (/ranges) I've fetched with exonsBy. Example:

library(ensembldb)

# edb <- ensDbFromGtf(gtf = "/home/stian/Desktop/Homo_sapiens.GRCh37.74.gtf")
edb <- "/home/stian/Desktop/Homo_sapiens.GRCh37.74.sqlite"
edb <- ensembldb::EnsDb(edb)

transcriptID <- "ENST00000398958"

# Get GRangesList of exons
transcriptExons <- ensembldb::exonsBy(
  edb,
  filter = list(TxidFilter(transcriptID)),
  columns = list("gene_id", "gene_name", "tx_cds_seq_start", "tx_cds_seq_end"))
# Unlist to get GRanges
transcriptExons <- unlist(transcriptExons)
# Sort
transcriptExons <- sort(transcriptExons)

# Get cdc_start and end
cds_start <- mcols(transcriptExons)$tx_cds_seq_start[[1]]
cds_end <- mcols(transcriptExons)$tx_cds_seq_end[[1]]

# Both of these are within ranges in transcriptExons:
cds_start > start(transcriptExons) & cds_start < end(transcriptExons)
# [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
cds_end > start(transcriptExons) & cds_end < end(transcriptExons)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

I want the CDC and UTR ranges separate: To split range 4 and 12 so that I'll get 14 ranges in total instead of the 12 I get by default. Is this possible, or do I have to manually do this?

Right now I do it manually with this:

splitOnUtr <- function(gr, cds_start, cds_end) {
  # takes a granges with exons in ranges. if the cdc_start / cdc_end positions are within exons, the exons are split into two ranges

  # find exon containing start
  cds_start_exon <- cds_start > start(gr) & cds_start < end(gr)

  if (any(cds_start_exon)) {
    # Create two copies of the exon in which the cds starts
    first <- gr[cds_start_exon]
    second <- gr[cds_start_exon]
    # Update ranges for the new objects
    end(first) <- cds_start-1
    start(second) <- cds_start
    # Remove the original range
    gr <- gr[!cds_start_exon]
    # Add new ranges
    gr <- append(gr, first)
    gr <- append(gr, second)
    # Sort again
    gr <- sort(gr)
  }

  # find exon containing end
  cds_end_exon <- cds_end > start(gr) & cds_end < end(gr)

  if (any(cds_end_exon)) {
    # Create two copies of the exon in which the cds ends
    first <- gr[cds_end_exon]
    second <- gr[cds_end_exon]
    # Update ranges for the new objects
    end(first) <- cds_end-1
    start(second) <- cds_end
    # Remove the original range
    gr <- gr[!cds_end_exon]
    # Add new ranges
    gr <- append(gr, first)
    gr <- append(gr, second)
    # Sort again
    gr <- sort(gr)
  }

  gr
}

ADD REPLY • link 8.8 years ago stianlagstad ▴ 90

1

Entering edit mode

On one hand you could use the getGeneRegionTrackForGviz method and use Gviz to plot the transcript. Have also a look at the resulting data.frame, eventually that's all you want to have anyway, as it separates 3' UTR, 5' UTR and CDS.

> gTrack <- getGeneRegionTrackForGviz(edb, filter=TxidFilter(transcriptID))
> gTrack
GRanges object with 14 ranges and 6 metadata columns:
       seqnames               ranges strand |        feature            gene
          <Rle>            <IRanges>  <Rle> |    <character>     <character>
   [1]        1 [28832455, 28832596]      + |           utr5 ENSG00000180198
   [2]        1 [28834640, 28834672]      + |           utr5 ENSG00000180198
   [3]        1 [28835342, 28835417]      + |           utr5 ENSG00000180198
   [4]        1 [28856370, 28856378]      + |           utr5 ENSG00000180198
   [5]        1 [28864520, 28865708]      + |           utr3 ENSG00000180198
   ...      ...                  ...    ... .            ...             ...
  [10]        1 [28861770, 28861892]      + | protein_coding ENSG00000180198
  [11]        1 [28862383, 28862538]      + | protein_coding ENSG00000180198
  [12]        1 [28862774, 28862893]      + | protein_coding ENSG00000180198
  [13]        1 [28863259, 28863411]      + | protein_coding ENSG00000180198
  [14]        1 [28864344, 28864519]      + | protein_coding ENSG00000180198
                  exon   exon_rank      transcript      symbol
           <character> <character>     <character> <character>
   [1] ENSE00001461695           1 ENST00000398958        RCC1
   [2] ENSE00003478538           2 ENST00000398958        RCC1
   [3] ENSE00001461694           3 ENST00000398958        RCC1
   [4] ENSE00003621347           4 ENST00000398958        RCC1
   [5] ENSE00001378938          12 ENST00000398958        RCC1
   ...             ...         ...             ...         ...
  [10] ENSE00003700481           8 ENST00000398958        RCC1
  [11] ENSE00003514701           9 ENST00000398958        RCC1
  [12] ENSE00001273554          10 ENST00000398958        RCC1
  [13] ENSE00003467600          11 ENST00000398958        RCC1
  [14] ENSE00001378938          12 ENST00000398958        RCC1
  -------
  seqinfo: 1 sequence from GRCh37 genome

> library(Gviz)
Loading required package: grid
> plotTracks(GeneRegionTrack(gTrack))

Alternatively, you could also get the CDS and UTR separately from the db:

> txf <- TxidFilter(transcriptID)
> txCds <- cdsBy(edb, by="tx", filter=txf)
> txCds
GRangesList object of length 1:
$ENST00000398958
GRanges object with 9 ranges and 2 metadata columns:
      seqnames               ranges strand |         exon_id   exon_rank
         <Rle>            <IRanges>  <Rle> |     <character> <character>
  [1]        1 [28856379, 28856451]      + | ENSE00003621347           4
  [2]        1 [28858315, 28858502]      + | ENSE00001273582           5
  [3]        1 [28858683, 28858862]      + | ENSE00003695605           6
  [4]        1 [28861562, 28861658]      + | ENSE00003696436           7
  [5]        1 [28861770, 28861892]      + | ENSE00003700481           8
  [6]        1 [28862383, 28862538]      + | ENSE00003514701           9
  [7]        1 [28862774, 28862893]      + | ENSE00001273554          10
  [8]        1 [28863259, 28863411]      + | ENSE00003467600          11
  [9]        1 [28864344, 28864519]      + | ENSE00001378938          12

-------
seqinfo: 1 sequence from GRCh37 genome
> tx3Utr <- threeUTRsByTranscript
threeUTRsByTranscript
> tx3Utr <- threeUTRsByTranscript(edb, filter=txf)
> tx3Utr
GRangesList object of length 1:
$ENST00000398958
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand |         exon_id   exon_rank
         <Rle>            <IRanges>  <Rle> |     <character> <character>
  [1]        1 [28864520, 28865708]      + | ENSE00001378938          12

-------
seqinfo: 1 sequence from GRCh37 genome

> tx5Utr <- fiveUTRsByTranscript(edb, filter=txf)
> tx5Utr
GRangesList object of length 1:
$ENST00000398958
GRanges object with 4 ranges and 2 metadata columns:
      seqnames               ranges strand |         exon_id   exon_rank
         <Rle>            <IRanges>  <Rle> |     <character> <character>
  [1]        1 [28832455, 28832596]      + | ENSE00001461695           1
  [2]        1 [28834640, 28834672]      + | ENSE00003478538           2
  [3]        1 [28835342, 28835417]      + | ENSE00001461694           3
  [4]        1 [28856370, 28856378]      + | ENSE00003621347           4

-------
seqinfo: 1 sequence from GRCh37 genome

ADD REPLY • link 8.8 years ago Johannes Rainer ★ 2.1k

1

Entering edit mode

Hi again, I think I've stumbled across a little bug. When using exonsBy, even with "tx_id" in the list given to the columns parameter, I do not see a column in the metadata that says tx_id. Example:

library(ensembldb)

edb <- ensDbFromGtf(gtf = "/home/stian/Desktop/Homo_sapiens.GRCh37.74.gtf")
edb <- ensembldb::EnsDb(edb)

transcripts <- ensembldb::exonsBy(
  edb,
  filter = list(ensembldb::GeneidFilter("ENSG00000180198")),
  columns = list("gene_id", "gene_name", "tx_id", "tx_cds_seq_start", "tx_cds_seq_end", "exon_id", "exon_rank"))
# Warning messages:
# 1: In cleanColumns(x, columns) :
#   Columns 'exon_rank' are not valid and have been removed
# 2: In cleanColumns(x, columns) :
#   Columns 'exon_rank' are not valid and have been removed
# 3: In cleanColumns(x, columns) :
#   Columns 'exon_rank' are not valid and have been removed

# But mcols(transcripts[[1]])$exon_rank exists, and in addition I've gotten mcols(transcripts[[1]])$exon_rank.1
mcols(transcripts[[1]])$exon_rank
# [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9" 
mcols(transcripts[[1]])$exon_rank.1
# [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9" 

# My problem however, is that mcols(transcripts[[1]])$tx_id is not there, even though "tx_id" was provided to columns in the call to exonsBy.
mcols(transcripts[[1]])$tx_id
# NULL

I don't get the warnings if I leave "exon_rank" out of the list passed to columns, bit I found it weird. I would like the tx_id to appear though.

ADD REPLY • link 8.8 years ago stianlagstad ▴ 90

1

Entering edit mode

Hi Stian,

thanks, that's indeed a bug. tx_id should be there if you specify that. I'll fix that. Regarding the issue with exon_rank, the problem there is that while in the result a column called exon_rank is displayed, the actual database column is exon_idx. I'll think of a fix there. For now you should specify "exon_idx" instead of "exon_rank" in the columns attribute, but I think of a fix for that too.

Thanks, jo

ADD REPLY • link 8.8 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Thank you for the quick response. If I include "exon_idx" in the list given to columns, I don't end up getting a column in the metadata named "exon_idx". I do still get the "exon_rank" metadata column though, without asking for it.

Should I create issues in the github repository if I find more bugs?

ADD REPLY • link 8.8 years ago stianlagstad ▴ 90

1

Entering edit mode

It's fixed in version 1.4.3 (should be available in release-3.3 of Bioconductor). I still have to fix it in the current devel.

I fixed also another bug that became apparent above: exon_rank is stored as a character if an EnsDb is created from a GTF. I fixed that too, thus, with the new version of the package the EnsDb database stores exon_idx (exon_rank) as integer, and the exons are really returned in the correct ordering (not 1, 10, 11, 2 like above).

Yes, posting issues in my github repo would be nice.

Thanks again!

ADD REPLY • link 8.8 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Thanks again, that's very useful!

ADD REPLY • link 8.8 years ago stianlagstad ▴ 90