Genomic Features error in TxDbfromGFF
1
0
Entering edit mode
sieminsk • 0
@3fc834aa
Last seen 3 months ago
Canada

I am attempting to generate a taxonomic database with Genomic Features package. I have attached the code and the error output below. The gff file was obtained from NCBI RefSeq Genomes database. Do I have to parse the gff file and if yes how to be able to use it with the package.

gffmodel <- file.path(dataDir, "GCF_000686985.2_Bra_napus_v2.0_genomic.gff")
(txdb <- makeTxDbFromGFF(gffmodel, format="gff"))

Output: 
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': subscript contains NAs

sessionInfo( )

R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] BiocParallel_1.24.1
[2] GenomicAlignments_1.26.0
[3] Rsamtools_2.6.0
[4] SummarizedExperiment_1.20.0
[5] MatrixGenerics_1.2.0
[6] matrixStats_0.57.0
[7] GenomicFeatures_1.42.1
[8] AnnotationDbi_1.52.0
[9] Biobase_2.50.0
[10] BSgenome.Athaliana.TAIR.TAIR9_1.3.1000 [11] BiocManager_1.30.10
[12] BSgenome_1.58.0
[13] rtracklayer_1.50.0
[14] GenomicRanges_1.42.0
[15] GenomeInfoDb_1.26.2
[16] stringr_1.4.0
[17] Biostrings_2.58.0
[18] XVector_0.30.0
[19] IRanges_2.24.1
[20] S4Vectors_0.28.1
[21] BiocGenerics_0.36.0

loaded via a namespace (and not attached): [1] progress_1.2.2 tidyselect_1.1.0 purrr_0.3.4
[4] lattice_0.20-41 vctrs_0.3.5 generics_0.1.0
[7] BiocFileCache_1.14.0 blob_1.2.1 XML_3.99-0.5
[10] rlang_0.4.9 pillar_1.4.7 glue_1.4.2
[13] DBI_1.1.0 rappdirs_0.3.1 bit64_4.0.5
[16] dbplyr_2.0.0 GenomeInfoDbData_1.2.4 lifecycle_0.2.0
[19] zlibbioc_1.36.0 memoise_1.1.0 biomaRt_2.46.0
[22] curl_4.3 Rcpp_1.0.5 openssl_1.4.3
[25] DelayedArray_0.16.0 bit_4.0.4 hms_0.5.3
[28] askpass_1.1 digest_0.6.27 stringi_1.5.3
[31] dplyr_1.0.2 grid_4.0.2 tools_4.0.2
[34] bitops_1.0-6 magrittr_2.0.1 RCurl_1.98-1.2
[37] RSQLite_2.2.1 tibble_3.0.4 crayon_1.3.4
[40] pkgconfig_2.0.3 ellipsis_0.3.1 Matrix_1.2-18
[43] xml2_1.3.2 prettyunits_1.1.1 assertthat_0.2.1
[46] httr_1.4.2 R6_2.5.0 compiler_4.0.2

GenomicFeatures TxDb • 464 views
ADD COMMENT
0
Entering edit mode

Well, so it looks like this GFF file contains an exon with no Parent attribute (ID=id-NC_008285.1:25367..25761-4 on line 1796721). First time ever.

FWIW this is what the GFF specs say about this:

"Orphan" exons CDSs, and other features. Ab initio gene prediction programs call hypothetical exons and CDS's that are attached to the genomic sequence and not necessarily to a known transcript. To handle these features, you may either (1) create a placeholder mRNA and use it as the parent for the exon and CDS subfeatures; or (2) attach the exons and CDSs directly to the gene. This is allowed by SO because of the transitive nature of the part_of relationship.

But in this case they've attached the exon to... nothing! This breaks makeTxDbFromGRanges() which is used internally by makeTxDbFromGFF(). A fix is on its way.

H.

ADD REPLY
0
Entering edit mode

The file also contains some unusual trans-spliced genes (e.g. ID=gene-BRNAC_p045 at lines 1797501-1797505) which also break makeTxDbFromGRanges() so the fix will take a little bit longer.

ADD REPLY
0
Entering edit mode

Thank you.

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 23 hours ago
Seattle, WA, United States

This is fixed in GenomicFeatures 1.42.3 (release) and 1.43.8 (devel). Both versions should become available in the next 48h or so thru BiocManager::install().

With GenomicFeatures 1.42.3:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("GCF_000686985.2_Bra_napus_v2.0_genomic.gff")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   some transcripts have no "transcript_id" attribute ==> their name
#   ("tx_name" column in the TxDb object) was set to NA
# 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   the transcript names ("tx_name" column in the TxDb object) imported
#   from the "transcript_id" attribute are not unique
# 3: In .merge_transcript_parts(transcripts) :
#   The following transcripts were dropped because they have multiple parts
#   that could not be merged due to incompatible strands: gene-BRNAC_p045,
#   gene-BrnapMp015, gene-BrnapMp029, gene-BrnapMp070
# 4: In makeTxDbFromGRanges(gr, metadata = metadata) :
#   The following transcripts were dropped because their exon ranks could
#   not be inferred (either because the exons are not on the same
#   chromosome/strand or because they are not separated by introns):
#   gene-BRNAC_p045, gene-BrnapMp015, gene-BrnapMp029, gene-BrnapMp062,
#   gene-BrnapMp070

Please pay attention to the warnings. They give important clues about the problems found in the file.

A small number of transcripts couldn't be imported (see warnings 3 and 4) because they have exons on both strands (trans-splicing). TxDb objects don't support transcripts with trans-splicing.

Then:

txlens <- transcriptLengths(txdb, with.cds_len=TRUE)
head(txlens)
#   tx_id tx_name    gene_id nexon tx_len cds_len
# 1     1    <NA> BrnaMpl_p1     1   2973    2973
# 2     2    <NA> BrnaMpl_p5     1    411     411
# 3     3    <NA> BrnaMpl_p2     1   1077    1077
# 4     4    <NA> BrnaMpl_p3     1    954     954
# 5     5    <NA> BrnaMpl_p4     1   1089    1089
# 6     6    <NA> BrnaMpl_p6     1   3237    3237

tail(txlens)
#         tx_id        tx_name      gene_id nexon tx_len cds_len
# 157024 157024 XM_022715858.1 LOC111214013     4    864     744
# 157025 157025 XM_022715859.1 LOC111214014     1    432     432
# 157026 157026           <NA> LOC111214015     1   1186       0
# 157027 157027 XM_022715860.1 LOC111214016     1    998     906
# 157028 157028 XM_022715861.1 LOC111214017     5    605     420
# 157029 157029 XM_022715862.1 LOC111214018     5    590     480

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Thank you! I look forward to trying the new release.

ADD REPLY

Login before adding your answer.

Traffic: 652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6