Hello,
I am trying to make a txdb from a .gff of sncRNA obtained from the DASHR database (DASHR v2.0 hg38 sncRNA annotation [GFF]). I had to do a little formatting to remove the 10th column of some lines, but once that was done I tried importing and making a txdb with makeTxDbFromGFF and receive the following error:
>TxDb <- makeTxDbFromGFF(file = "/data2/csijcs/hg38/dashr.v2.sncRNA.annotation.hg38.edited.gff", format="auto")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .merge_transcript_parts(transcripts) :
The following transcripts have multiple parts that cannot be merged
because of incompatible type: U13, U3, U8
I tried removing those lines, but got even more errors:
> TxDb <- makeTxDbFromGFF(file = "/data2/csijcs/hg38/dashr.v2.sncRNA.annotation.hg38.edited.noU13U3U6U8.gff", format="auto")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .merge_transcript_parts(transcripts) :
The following transcripts have multiple parts that cannot be merged
because of incompatible seqnames: 5S, LSU-rRNA_Hsa, SSU-rRNA_Hsa, U1,
U14, U17, U2, U4, U5, U6, U7
Is it possible to make a TxDb for this annotation file? I am trying to perform differential expression with DESeq.
Here is my sessionInfo:
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
Matrix products: default
BLAS: /home/csijcs/anaconda2/lib/R/lib/libRblas.so
LAPACK: /home/csijcs/anaconda2/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] rtracklayer_1.40.6
[2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
[3] apeglm_1.2.1
[4] tximportData_1.8.0
[5] readr_1.1.1
[6] tximport_1.8.0
[7] RColorBrewer_1.1-2
[8] ggplot2_3.1.0
[9] DESeq2_1.20.0
[10] SummarizedExperiment_1.10.1
[11] DelayedArray_0.6.6
[12] BiocParallel_1.14.2
[13] matrixStats_0.54.0
[14] GenomicFeatures_1.32.3
[15] AnnotationDbi_1.42.1
[16] Biobase_2.40.0
[17] GenomicRanges_1.32.7
[18] GenomeInfoDb_1.16.0
[19] IRanges_2.14.12
[20] S4Vectors_0.18.3
[21] BiocGenerics_0.26.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 mirbase.db_1.2.0 bit64_0.9-7
[4] progress_1.2.0 httr_1.3.1 numDeriv_2016.8-1
[7] tools_3.5.0 backports_1.1.2 R6_2.3.0
[10] rpart_4.1-13 Hmisc_4.1-1 DBI_1.0.0
[13] lazyeval_0.2.1 colorspace_1.3-2 nnet_7.3-12
[16] withr_2.1.2 tidyselect_0.2.5 gridExtra_2.3
[19] prettyunits_1.0.2 bit_1.1-14 compiler_3.5.0
[22] htmlTable_1.12 scales_1.0.0 checkmate_1.8.5
[25] genefilter_1.62.0 stringr_1.3.1 digest_0.6.18
[28] Rsamtools_1.32.3 foreign_0.8-71 XVector_0.20.0
[31] base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6
[34] bbmle_1.0.20 htmlwidgets_1.3 rlang_0.3.0.1
[37] rstudioapi_0.8 RSQLite_2.1.1 bindr_0.1.1
[40] acepack_1.4.1 dplyr_0.7.8 RCurl_1.95-4.11
[43] magrittr_1.5 GenomeInfoDbData_1.1.0 Formula_1.2-3
[46] Matrix_1.2-15 Rcpp_1.0.0 munsell_0.5.0
[49] stringi_1.2.4 MASS_7.3-51.1 zlibbioc_1.26.0
[52] plyr_1.8.4 grid_3.5.0 blob_1.1.1
[55] crayon_1.3.4 lattice_0.20-38 Biostrings_2.48.0
[58] splines_3.5.0 annotate_1.58.0 hms_0.4.2
[61] locfit_1.5-9.1 knitr_1.20 pillar_1.3.0
[64] geneplotter_1.58.0 biomaRt_2.36.1 XML_3.98-1.16
[67] glue_1.3.0 latticeExtra_0.6-28 data.table_1.11.8
[70] BiocManager_1.30.4 gtable_0.2.0 purrr_0.2.5
[73] assertthat_0.2.0 emdbook_1.3.10 xtable_1.8-3
[76] coda_0.19-2 survival_2.43-1 tibble_1.4.2
[79] GenomicAlignments_1.16.0 memoise_1.1.0 bindrcpp_0.2.2
[82] cluster_2.0.7-1
I've tested this with a modified file (10th column removed since GFF files have 9 columns), and got the same error as you. The error is thrown by the .merge_transcript_parts() function in GenomicFeatures. In essence, tt seems the reason you are getting the error is that the tx_type value generated in the function from the ID=<something> 9th column in your file contains values that are not unique (e.g. ID=U4 is not unique). To get maketxdbdbfromgff() to work, it seems that you need to make all of the 9th columns values in your gff file unique, or remove non-unique columns.