Hi Arjen,
Hmm.. I'm surprised you can import this GTF file at all. Are we talking about the GTF file GCF_000317375.1_MicOch1.0_genomic.gtf
located here? I get the following error when I try to import it with makeTxDbFromGFF()
:
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("GCF_000317375.1_MicOch1.0_genomic.gtf")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) :
# some CDS cannot be mapped to an exon
# In addition: Warning message:
# In .get_cds_IDX(mcols0$type, mcols0$phase) :
# The "phase" metadata column contains non-NA values for features of type
# stop_codon. This information was ignored.
But if I use the GFF3 file (from the same location), everything is fine:
txdb <- makeTxDbFromGFF("GCF_000317375.1_MicOch1.0_genomic.gff")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
# some transcripts have no "transcript_id" attribute ==> their name
# ("tx_name" column in the TxDb object) was set to NA
# 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
# the transcript names ("tx_name" column in the TxDb object) imported
# from the "transcript_id" attribute are not unique
# 3: In .find_exon_cds(exons, cds) :
# The following transcripts have exons that contain more than one CDS
# (only the first CDS was kept for each exon): rna-NM_001289870.1,
# rna-NM_001290102.1, rna-NM_001290499.1, rna-NM_001291243.1
See my sessionInfo()
below.
About the unnamed transcripts: I do see an unnamed transcript at coordinates 29621640-29622008 like you did:
transcripts(txdb)[4994:4998]
# GRanges object with 5 ranges and 2 metadata columns:
# seqnames ranges strand | tx_id tx_name
# <Rle> <IRanges> <Rle> | <integer> <character>
# [1] NC_022011.1 29591627-29617240 - | 4994 XM_005345783.2
# [2] NC_022011.1 29591875-29614709 - | 4995 XM_013354377.2
# [3] NC_022011.1 29621640-29622008 - | 4996 <NA>
# [4] NC_022011.1 29655176-29690024 - | 4997 XM_005345785.3
# [5] NC_022011.1 29698713-29726003 - | 4998 XM_005345787.3
# -------
# seqinfo: 571 sequences from an unspecified genome; no seqlengths
This is actually what I would expect based on what I see in the GFF3 file:
NC_022011.1 Gnomon pseudogene 29621640 29622008 . - . ID=gene-LOC101999479;Dbxref=GeneID:101999479;Name=LOC101999479;gbkey=Gene;gene=LOC101999479;...
NC_022011.1 Gnomon exon 29621640 29622008 . - . ID=id-LOC101999479;Parent=gene-LOC101999479;Dbxref=GeneID:101999479;gbkey=exon;gene=LOC101999479;...
This is a gene (pseudogene to be precise) with no reported transcript but one reported exon. What happened here is that makeTxDbFromGFF()
treated it as if it had one transcript that spans the entire gene/exon. This is a feature. However it seems that makeTxDbFromGFF()
didn't assign a name to this inferred transcript, which may sound unfortunate, but the name of the associated gene is not far away:
transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))[4994:4998]
# GRanges object with 5 ranges and 3 metadata columns:
# seqnames ranges strand | gene_id tx_id
# <Rle> <IRanges> <Rle> | <CharacterList> <integer>
# [1] NC_022011.1 29591627-29617240 - | Klhdc4 4994
# [2] NC_022011.1 29591875-29614709 - | Klhdc4 4995
# [3] NC_022011.1 29621640-29622008 - | LOC101999479 4996
# [4] NC_022011.1 29655176-29690024 - | Slc7a5 4997
# [5] NC_022011.1 29698713-29726003 - | Ca5a 4998
# tx_name
# <character>
# [1] XM_005345783.2
# [2] XM_013354377.2
# [3] <NA>
# [4] XM_005345785.3
# [5] XM_005345787.3
# -------
# seqinfo: 571 sequences from an unspecified genome; no seqlengths
So it's easy to assign the gene name to all the unnamed transcripts with something like:
tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
unnamed_tx_idx <- which(is.na(mcols(tx)$tx_name))
mcols(tx)$tx_name[unnamed_tx_idx] <- as.character(mcols(tx)$gene_id[unnamed_tx_idx])
tx[4994:4998]
# GRanges object with 5 ranges and 3 metadata columns:
# seqnames ranges strand | gene_id tx_id
# <Rle> <IRanges> <Rle> | <CharacterList> <integer>
# [1] NC_022011.1 29591627-29617240 - | Klhdc4 4994
# [2] NC_022011.1 29591875-29614709 - | Klhdc4 4995
# [3] NC_022011.1 29621640-29622008 - | LOC101999479 4996
# [4] NC_022011.1 29655176-29690024 - | Slc7a5 4997
# [5] NC_022011.1 29698713-29726003 - | Ca5a 4998
# tx_name
# <character>
# [1] XM_005345783.2
# [2] XM_013354377.2
# [3] LOC101999479
# [4] XM_005345785.3
# [5] XM_005345787.3
# -------
# seqinfo: 571 sequences from an unspecified genome; no seqlengths
As for why makeTxDbFromGFF()
choked on the GTF file, I'm not sure, I didn't investigate. Maybe the file is too broken for makeTxDbFromGFF()
's taste? Even if NCBI, UCSC, and Ensembl are still using it, GTF is a deprecated format and GFF3 should be used whenever possible. See http://gmod.org/wiki/GFF2.
Hope this helps,
H.
sessionInfo():
R version 4.1.0 beta (2021-05-06 r80268)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04
Matrix products: default
BLAS: /home/hpages/R/R-4.1.r80268/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r80268/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicFeatures_1.44.2 AnnotationDbi_1.54.1 Biobase_2.52.0
[4] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4 IRanges_2.26.0
[7] S4Vectors_0.31.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 lattice_0.20-45
[3] prettyunits_1.1.1 png_0.1-7
[5] Rsamtools_2.8.0 Biostrings_2.60.2
[7] assertthat_0.2.1 digest_0.6.28
[9] utf8_1.2.2 BiocFileCache_2.0.0
[11] R6_2.5.1 RSQLite_2.2.8
[13] httr_1.4.2 pillar_1.6.3
[15] zlibbioc_1.38.0 rlang_0.4.11
[17] progress_1.2.2 curl_4.3.2
[19] rstudioapi_0.13 blob_1.2.2
[21] Matrix_1.3-4 BiocParallel_1.26.2
[23] stringr_1.4.0 RCurl_1.98-1.5
[25] bit_4.0.4 biomaRt_2.48.3
[27] DelayedArray_0.18.0 compiler_4.1.0
[29] rtracklayer_1.52.1 pkgconfig_2.0.3
[31] tidyselect_1.1.1 KEGGREST_1.32.0
[33] SummarizedExperiment_1.22.0 tibble_3.1.5
[35] GenomeInfoDbData_1.2.6 matrixStats_0.61.0
[37] XML_3.99-0.8 fansi_0.5.0
[39] crayon_1.4.1 dplyr_1.0.7
[41] dbplyr_2.1.1 GenomicAlignments_1.28.0
[43] bitops_1.0-7 rappdirs_0.3.3
[45] grid_4.1.0 lifecycle_1.0.1
[47] DBI_1.1.1 magrittr_2.0.1
[49] stringi_1.7.5 cachem_1.0.6
[51] XVector_0.32.0 xml2_1.3.2
[53] ellipsis_0.3.2 filelock_1.0.2
[55] generics_0.1.0 vctrs_0.3.8
[57] rjson_0.2.20 restfulr_0.0.13
[59] tools_4.1.0 bit64_4.0.5
[61] glue_1.4.2 purrr_0.3.4
[63] MatrixGenerics_1.4.3 hms_1.1.1
[65] fastmap_1.1.0 yaml_2.2.1
[67] memoise_2.0.0 BiocIO_1.2.0
Yes! Wow, with this information I understood how to replace NA values in a GRanges object (that I made from my gff) and use that Granges object to make a txdb that I can use in the GeneRegionTrack of GVIZ. I am so happy...!
Thanks a lot, Arjen
P.S. I must have put the wrong the link for the GTF, as I was also using the GFF file. Sorry about that....!
Hi I'm having the same issue with NAs in my txdb: some entries do not have transcript_id= atributes. This won't work bc I need the txdb object itself to get cds using cdsBy().
Is there a way I can create transcript_id=gene_id for these names = <NA> entries or even just drop them in any step of this? Thank you
Hi Thank you for the reply! I need to preserve the txdb structure in order to pass it in
is there a way to then retrieve it based on the modified
tx
?Rather than tacking this on to the end of an existing thread, please post a new question. And tag it with the packages you are using, and show how you get the 'dna' object.