Warning in makeTxDbFromGFF
1
1
Entering edit mode
weir ▴ 10
@weir-21040
Last seen 4.8 years ago

Hi I'm getting some warnings in

makeTxDbFromGFF()

here is full stacktrace:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: /home/weir/RNAedit/human_test/reference/GCF_000001405.38_GRCh38.p12_genomic.gff
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 178581
# exon_nrow: 1945509
# cds_nrow: 1460272
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-06-17 22:31:22 +0800 (Mon, 17 Jun 2019)
# GenomicFeatures version at creation time: 1.34.8
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2
Warning messages:
1: In .extract_exons_from_GRanges(exon_IDX, gr, ID, Name, Parent, feature = "exon",  :
  The following orphan exon were dropped (showing only the 6 first):
         seqid     start       end strand                     ID
1 NC_000001.11  15542166  15542304      +     exon-NR_135613.1-1
2 NC_000001.11  27834401  27834566      +     exon-NR_002997.1-1
3 NC_000001.11 109100193 109100612      +     exon-NR_003023.1-1
4 NC_000001.11 144875032 144875095      - exon-id-LOC107985528-1
5 NC_000001.11 144874355 144874907      - exon-id-LOC107985528-2
6 NC_000001.11 155679108 155679255      -     exon-NR_132762.1-1
           Parent                   Name
1 rna-NR_135613.1     exon-NR_135613.1-1
2 rna-NR_002997.1     exon-NR_002997.1-1
3 rna-NR_003023.1     exon-NR_003023.1-1
4 id-LOC107985528 exon-id-LOC107985528-1
5 id-LOC107985528 exon-id-LOC107985528-2
6 rna-NR_132762.1     exon-NR_132762.1-1
2: In .extract_exons_from_GRanges(cds_IDX, gr, ID, Name, Parent, feature = "cds",  :
  The following orphan CDS were dropped (showing only the 6 first):
         seqid     start       end strand               ID          Parent Name
1 NC_000001.11 144875032 144875080      - cds-LOC107985528 id-LOC107985528 <NA>
2 NC_000001.11 144874585 144874907      - cds-LOC107985528 id-LOC107985528 <NA>
3 NC_000002.12  88857361  88857683      -         cds-IGKC         id-IGKC <NA>
4 NC_000002.12  88860568  88860605      -        cds-IGKJ5        id-IGKJ5 <NA>
5 NC_000002.12  88860886  88860923      -        cds-IGKJ4        id-IGKJ4 <NA>
6 NC_000002.12  88861221  88861258      -        cds-IGKJ3        id-IGKJ3 <NA>
3: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS
  (only the first CDS was kept for each exon): rna-NM_001134939.1,
  rna-NM_001172437.2, rna-NM_001184961.1, rna-NM_001301020.1,
  rna-NM_001301302.1, rna-NM_001301371.1, rna-NM_002537.3,
  rna-NM_004152.3, rna-NM_015068.3, rna-NM_016178.2

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

Matrix products: default
BLAS/LAPACK: /home/weir/anaconda3/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.34.8 AnnotationDbi_1.44.0   Biobase_2.40.0
[4] GenomicRanges_1.34.0   GenomeInfoDb_1.16.0    IRanges_2.16.0
[7] S4Vectors_0.20.1       AnnotationHub_2.12.1   BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.12.0   progress_1.2.0
 [3] lattice_0.20-38               htmltools_0.3.6
 [5] rtracklayer_1.42.0            yaml_2.2.0
 [7] interactiveDisplayBase_1.18.0 blob_1.1.1
 [9] XML_3.98-1.12                 rlang_0.3.4
[11] later_0.8.0                   DBI_1.0.0
[13] BiocParallel_1.16.0           bit64_0.9-7
[15] matrixStats_0.54.0            GenomeInfoDbData_1.1.0
[17] stringr_1.4.0                 zlibbioc_1.26.0
[19] Biostrings_2.48.0             memoise_1.1.0
[21] biomaRt_2.38.0                httpuv_1.5.1
[23] BiocInstaller_1.30.0          curl_3.3
[25] Rcpp_1.0.1                    xtable_1.8-3
[27] promises_1.0.1                DelayedArray_0.8.0
[29] XVector_0.22.0                mime_0.6
[31] bit_1.1-12                    Rsamtools_1.34.0
[33] hms_0.4.2                     digest_0.6.18
[35] stringi_1.4.3                 shiny_1.2.0
[37] grid_3.5.1                    tools_3.5.1
[39] bitops_1.0-6                  magrittr_1.5
[41] RCurl_1.95-4.12               RSQLite_2.1.1
[43] crayon_1.3.4                  pkgconfig_2.0.2
[45] Matrix_1.2-17                 prettyunits_1.0.2
[47] assertthat_0.2.1              httr_1.4.0
[49] R6_2.4.0                      GenomicAlignments_1.18.0
[51] compiler_3.5.1

The GFF file is download from https://www.ncbi.nlm.nih.gov/genome/?term=human

Can someone help me? Best wishes weir

txdb • 2.1k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

The first 2 warnings indicate that the file contains exons and CDS that were dropped because they couldn't be linked to a transcript. I just improved the warning message (the change is in GenomicFeatures 1.36.3) so it displays the number of exons or CDS that get dropped:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("ref_GRCh38.p12_top_level.gff3")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_exons_from_GRanges(exon_IDX, gr, mcols0, tx_IDX, feature="exon",:
#   1558 exons couldn't be linked to a transcript so were dropped
#   (showing only the first 6):
#          seqid     start       end strand       ID     Name ...
# 1 NC_000001.11 144875032 144875095      - id105387 id105387 ...
# 2 NC_000001.11 144874355 144874907      - id105388 id105388 ...
# 3 NC_000002.12  88857361  88857683      - id241515 id241515 ...
# 4 NC_000002.12  88860568  88860605      - id241517 id241517 ...
# 5 NC_000002.12  88860886  88860923      - id241519 id241519 ...
# 6 NC_000002.12  88861221  88861258      - id241521 id241521 ...
# 2: In .extract_exons_from_GRanges(cds_IDX, gr, mcols0, tx_IDX, feature="cds",:
#   1553 CDS couldn't be linked to a transcript so were dropped
#   (showing only the first 6):
#          seqid     start       end strand       ID Name ...
# 1 NC_000001.11 144875032 144875080      -  cds6180 <NA> ...
# 2 NC_000001.11 144874585 144874907      -  cds6180 <NA> ...
# 3 NC_000002.12  88857361  88857683      - cds14156 <NA> ...
# 4 NC_000002.12  88860568  88860605      - cds14157 <NA> ...
# 5 NC_000002.12  88860886  88860923      - cds14158 <NA> ...
# 6 NC_000002.12  88861221  88861258      - cds14159 <NA> ...
# 3: In .find_exon_cds(exons, cds) :
#   The following transcripts have exons that contain more than
#   one CDS (only the first CDS was kept for each exon): rna116402,
#   rna116403, rna137565, rna137566, rna63759, rna63761,
#   rna63764, rna9689, rna9690, rna9691

Note that the file contains some rare transcript types (scRNA, guide_RNA, telomerase_RNA, vault_RNA, Y_RNA -- these are valid Sequence Ontology terms) that makeTxDbFromGFF() didn't recognize as transcripts so this is why the exons and CDS linked to these transcripts were getting dropped. In GenomicFeatures 1.36.3 I added these types to the list of types that should be treated as transcripts so makeTxDbFromGFF() now drops a few less exons. As a consequence, the TxDb object I get contains a few (44) more transcripts and exons than the one you got with your version of GenomicFeatures:

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: ref_GRCh38.p12_top_level.gff3
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 178625
# exon_nrow: 1945553
# cds_nrow: 1460272
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-06-22 17:05:56 -0700 (Sat, 22 Jun 2019)
# GenomicFeatures version at creation time: 1.37.3
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2

The exons and CDS that still get dropped with GenomicFeatures 1.36.3 are linked to features of type C_gene_segment, D_gene_segment, J_gene_segment, and V_gene_segment. However these Sequence Ontology terms are not offsprings of the transcript term so I'm reluctant to add them to the list of types that makeTxDbFromGFF() should treat as transcripts. But if someone wants to make the case for adding these terms, I'm open to it.

Finally the 3rd warning should be self explanatory: in some rare occasions a GFF3 file can contain a few exons with more than one CDS. makeTxDbFromGFF() does not know how to import more than one CDS per exon at the moment so the warning just says that only the first CDS was kept for each such exon.

GenomicFeatures 1.36.3 should become available to Bioconductor 3.9 users in about 24-48 hours via BiocManager::install(). Note that you're using Bioconductor 3.8 which is not the current release and is no longer supported.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Did that help?

ADD REPLY
0
Entering edit mode

Hi, I also meet the same problems, and I checked the warnings, and as you mentioned above, I use the command grep -E "pre_miRNA" $HOME/datax/Genomes/IRGSP-1.0-release/gff/IRGSP-1.0.50.chrs.gff3 |grep -vE "ncRNA_gene" >items , and find all the warning items, it seems that was resulted from "ncRNA_gene" or "pre_miRNA". But I don't know how to deal with that. Could you give me some suggestions?

1   Ensembl_Plants  pre_miRNA   1215030     1215219     .   +   .   ID=transcript:ENSRNA049471381-T1;Parent=gene:ENSRNA049471381;biotype=pre_miRNA;transcript_id=ENSRNA049471381-T1
1   Ensembl_Plants  pre_miRNA   3439717     3439781     .   -   .   ID=transcript:ENSRNA049471191-T1;Parent=gene:ENSRNA049471191;biotype=pre_miRNA;transcript_id=ENSRNA049471191-T1
1   Ensembl_Plants  pre_miRNA   6556046     6556244     .   -   .   ID=transcript:ENSRNA049471356-T1;Parent=gene:ENSRNA049471356;biotype=pre_miRNA;transcript_id=ENSRNA049471356-T1
1   Ensembl_Plants  pre_miRNA   6563754     6563944     .   -   .   ID=transcript:ENSRNA049471305-T1;Parent=gene:ENSRNA049471305;biotype=pre_miRNA;transcript_id=ENSRNA049471305-T1
1   Ensembl_Plants  pre_miRNA   6677568     6677765     .   -   .   ID=transcript:ENSRNA049471314-T1;Parent=gene:ENSRNA049471314;biotype=pre_miRNA;transcript_id=ENSRNA049471314-T1
1   Ensembl_Plants  pre_miRNA   6693112     6693301     .   +   .   ID=transcript:ENSRNA049471366-T1;Parent=gene:ENSRNA049471366;biotype=pre_miRNA;transcript_id=ENSRNA049471366-T1

enter image description here

ADD REPLY
0
Entering edit mode

I added "pre_miRNA" to the end of ".TX_TYPES" exsisted in "makeTxDbFromGRanges.R", and then re-installed this packages. And it worked well with no warnings reported.

enter image description here

ADD REPLY
0
Entering edit mode

I'm only seeing your post now, sorry.

Unfortunately according to the Sequence Ontology, pre_miRNA is not an offspring of transcript via the _is_a_ relationship, only via the _part_of_ relationship, which means that features of type pre_miRNA are not considered transcripts. So I'm not sure that it was a good idea for the author of this GFF3 file to use the pre_miRNA term for the purpose of describing the exon/transcript structure of the associated genes. Maybe they should have used miRNA_primary_transcript instead?

Anyways, because pre_miRNA are not transcripts, I'm reluctant to add the term to the list of terms that makeTxDbFromGRanges() and makeTxDbFromGFF() should treat as transcripts.

Best,

H.

ADD REPLY

Login before adding your answer.

Traffic: 763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6