Question: Issue creating a genomic state object in derfinder from an Ensembl GTF file
0
2.4 years ago by
United States

Hi,

I received the following question by Andrew Lamb which I'm posting here since it might be of interest to other users. I'll answer it shortly. He is using the following GTF file ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz.

Best,

Leonardo

-----

The only issue is that we're still having trouble annotating it using GRCH38 gtf file. Our RNASeq data was aligned using this version so we would like to be able to annotate with it as well. The issue comes when trying the makeGenomicState function, although it may be the issue was with the makeTxDbFromGFF function. Please see below:

library(GenomicFeatures)
library(derfinder)

gtf_file <-'/udd/reala/reference_files/DE/Homo_sapiens.GRCh38.81.gtf'
TXDB <- makeTxDbFromGFF(gtf_file, format = 'gtf')

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data).
2: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
3: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
4: Named parameters not used in query: internal_id, name, chrom, strand, start, end
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end
6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
7: Named parameters not used in query: gene_id, internal_tx_id

genomic_state <- makeGenomicState(TXDB)

extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'TXID'. Please use the keys method to see a listing of valid arguments.

I also tried:

genomic_state <- makeGenomicState(TXDB, style = "chrsStyle")

extendedMapSeqlevels: the 'style' chrsStyle is currently not supported for the 'species' homo_sapiens in GenomeInfoDb. Check valid naming styles by running GenomeInfoDb::genomeStyles(species). If it's not present, consider adding your genome by following the information at http://www.bioconductor.org/packages/release/bioc/vignettes/GenomeInfoDb/inst/doc/Accept-organism-for-GenomeInfoDb.pdf
'select()' returned 1:1 mapping between keys and columns
Error: logical subscript contains NAs

Any thought on what is happening here?

genomicfeatures derfinder • 1.4k views
modified 2.4 years ago • written 2.4 years ago by Leonardo Collado Torres640
Answer: Issue creating a genomic state object in derfinder from an Ensembl GTF file
1
2.4 years ago by
United States

Hi Andrew,

It took a bit of effort, but I found what was the problem. The initial error message was not informative and in the end, the problem is that the chromosome lengths are missing when using GenomicFeatures::makeTxDbFromGFF(). This other thread A: Setting genome information on a TxDb object helped me figure out how to specify the chromosome lengths since you currently can't do so for a txdb object. In your particular case since you are using Ensembl's notation for chromosomes, you are better off using style = 'Ensembl', currentStyle = 'Ensembl' so that the chromosome names won't be modified (by default derfinder tries to use UCSC names). You also have many chromosomes in your GTF file and will need to figure out the correct lengths for them, or drop the non-canonical ones in case you don't need the extra chromosomes/contigs.

## Un-evaluated code

library('GenomicFeatures')
library('derfinder')
library('devtools')

txdb <- makeTxDbFromGFF('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')

## Check chr names
seqlevels(txdb)

## Keep just chrs 1 to 22, X, Y, and MT
txdb <- keepSeqlevels(txdb, c(1:22, 'X', 'Y', 'MT'))
seqlevels(txdb)

## For speed testing: just use a portion of the txdb data
test_txdb <- keepSeqlevels(txdb, c('Y', 'MT'))

## Attempt to create genomic state object
gs <- makeGenomicState(test_txdb, style = 'Ensembl', currentStyle = 'Ensembl')
traceback()

## Issue is with the seqlengths being missing
seqlengths(test_txdb)

## Adding the sequence lengths doesn't work
hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
new_info <- hg38_chrominfo$UCSC_seqlength[match(seqlevels(test_txdb), mapSeqlevels(hg38_chrominfo$UCSC_seqlevel, 'Ensembl'))]
names(new_info) <- names(seqlengths(test_txdb))
seqlengths(test_txdb) <- new_info

## Using the approach described in https://support.bioconductor.org/p/67680/#67766
## That is, loading the GFF as a GRanges object and setting the lengths there
## before using makeTxDbFromGRanges()
library('rtracklayer')
gr <- import('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')

## Make a small GRanges object for testing -- pruning.mode is not used on Bioc-release, just Bioc-devel
gr_small <- keepSeqlevels(gr, c('Y', 'MT'), pruning.mode = 'tidy')

## Set the chromosome lengths, genome
seqlengths(gr_small) <- new_info
genome(gr_small) <- 'hg38'
txdb2 <- makeTxDbFromGRanges(gr_small)

## Now create the genomic state object
gs <- makeGenomicState(txdb2, style = 'Ensembl', currentStyle = 'Ensembl')
gs

## Reproducibility information
print('Reproducibility information:')
Sys.time()
options(width = 120)
session_info()

## Evaluated code

In derfinder 1.8.1 (release) and 1.9.7 (devel), the user will now get an error message if the chromosome lengths are missing. The help for the 'txdb' argument also points users to this thread. The two new versions will take 1-2 days to be available via biocLite(), but if you fix your txdb object you can continue right away since the new derfinder versions just show a more informative error and have more info on the docs.

Best,

Leonardo

Thanks for all your help Leo.

library(GenomicFeatures)
library(derfinder)
library(devtools)
library(rtracklayer)
gr <- import('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')
gr_small <- keepSeqlevels(gr, c('Y', 'MT'), pruning.mode = 'tidy')

Error in keepSeqlevels(gr, c("Y", "MT"), pruning.mode = "tidy") :
unused argument (pruning.mode = "tidy")

We don't have the devel version of bioC on our cluster(we're at 3.4), is the pruning mode available without using 3.5?

gr_small <- keepSeqlevels(gr, c('Y', 'MT'))
genome(gr_small) <- 'hg38'
txdb2 <- makeTxDbFromGRanges(gr_small)
gs <- makeGenomicState(txdb2, style = 'Ensembl', currentStyle = 'Ensembl')

> gs <- makeGenomicState(txdb2, style = 'Ensembl', currentStyle = 'Ensembl')
'select()' returned 1:1 mapping between keys and columns
Error: logical subscript contains NAs

I'm going to guess that not using the pruning.mode is why I'm getting an error here?

Hi Andrew,

"pruning.mode" is a recent argument and has to be specified in Bioc-devel, but I meant to say previously that everything would work in Bioc-release if you dropped that part. You were close in your last attempt, but you forgot the crucial step of adding the chromosome lengths. I assume that you were using derfinder 1.8.0 because 1.8.1 would have printed an error related to the missing chromosome lengths. Like I said before, derfinder 1.8.1 will be available via biocLite() in like a day or two. In any case, even with derfinder 1.8.0, you can get everything to run if you specify the chromosome lengths.

Here's the code and output using Bioc-release (3.4).

Best,

Leo

Oh I see, somehow I missed that line.

Thanks!

No problem ^^

Hi Leo, everything above worked, but I'm having another issue now. Using the script that wa sworking before but with the new GRCH38 txdb, the out put of matchGenes isn't quite right. The, columns, name, annotation and geneL, are all NA's.  Any Idea what this might be?

annotation_obj <- annotateTranscripts(txdb)
nearest_annotation_df <- matchGenes(der_grange_obj, subject = annotation_obj)

write_tsv(nearest_annotation_df, str_c(output_dir, '/nearest_annotation_df.tsv'))

name    annotation    description    region    distance    subregion    insideDistance    exonnumber    nexons    UTR    strand    geneL    codingL    Entrez    subjectHits
NA    NA    inside intron    inside    14741.0    inside intron    175.0    10.0    11    inside transcription region    -    15166.0    NA    ENSG00000227232    13715

Hi Andrew,

You should post this as a new question instead of a comment.

I'm not sure, but this would be a bumphunter issue, not a derfinder one. Most likely those functions assume that the TxDb object has a specific structure. For example, that the gene names are Entrez IDs instead of Gencode ids.

Best,

Leo

Thanks! I'll see if the Bumphunter documentation has anything to say.

Yeah, it's definitely a bumphunter issue as you can see at https://github.com/Bioconductor-mirror/bumphunter/blob/515c9c45a6a1b1fa03d1936fb946d8fbea184710/R/matchGenes.R#L182-L202. The current code assumes that you are working with Entrez gene ids. It should be possible to make the code more flexible to work with other types of gene ids.