Question: makeTxDbFromGFF error "stop codons that cannot be mapped to an exon"
gravatar for marisa.e.miller
5 weeks ago by
marisa.e.miller0 wrote:


I am trying to use makeTxDbFromGFF to import my GFF3 file as a TxDb. Below is the code used and the links to the relevant files:

chromlen <- read.delim("Puccinia_coronata_avenae_12SD80.primary.chromlen", header = FALSE)
colnames(chromlen) <- c("chrom", "length")
chromlen$is_circular <- FALSE

txdb <- makeTxDbFromGFF(file="haus_expr_secreted_effectors_sd80_p.full.gff3",
                        dataSource="gff3 file of haustorial effectors from 12SD80",
                        organism="Puccinia coronata")

Most transcripts are successfully imported, but there are a number that aren't. The warning make TxDbFromGFF throws is "Warning message: In .reject_transcripts(bad_tx, because) : The following transcripts were rejected because they have stop codons that cannot be mapped to an exon." 


Here's an example of what one of these offending transcripts looks like:

000335F EVM     gene    37334   38135   .       -       .       ID=PCA_SD_18895

000335F EVM     mRNA    37334   38135   .       -       .       ID=PCA_SD_18895-T1;Parent=PCA_SD_18895;note=SECRETED:SignalP(1-28)

000335F EVM     exon    37334   37402   .       -       .       ID=evm.model.000335F.12.exon5;Parent=PCA_SD_18895-T1

000335F EVM     exon    37486   37561   .       -       .       ID=evm.model.000335F.12.exon4;Parent=PCA_SD_18895-T1

000335F EVM     exon    37661   37692   .       -       .       ID=evm.model.000335F.12.exon3;Parent=PCA_SD_18895-T1

000335F EVM     exon    37790   37874   .       -       .       ID=evm.model.000335F.12.exon2;Parent=PCA_SD_18895-T1

000335F EVM     exon    37966   38135   .       -       .       ID=evm.model.000335F.12.exon1;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37334   37402   .       -       0       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37486   37561   .       -       1       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37661   37692   .       -       0       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37790   37874   .       -       1       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37966   38135   .       -       0       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     start_codon     37400   37402   .       +       .       ID=PCA_SD_18895-T1:start;Parent=PCA_SD_18895-T1

000335F EVM     stop_codon      37966   37968   .       +       .       ID=PCA_SD_18895-T1:stop;Parent=PCA_SD_18895-T1


As you can see, the stop codon is within the last exon, rather than the last exon ending on the stop codon. Is there anything I can do to get these transcripts successfully imported?

Thank you,



> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2   Biobase_2.36.2         GenomicRanges_1.28.6   GenomeInfoDb_1.12.3    IRanges_2.10.5        
[7] S4Vectors_0.14.7       BiocGenerics_0.22.1   

loaded via a namespace (and not attached):
 [1] bit64_0.9-7                splines_3.4.0              Formula_1.2-3              assertthat_0.2.0           latticeExtra_0.6-28       
 [6] blob_1.1.1                 GenomeInfoDbData_0.99.0    Rsamtools_1.28.0           pillar_1.2.3               RSQLite_2.1.1             
[11] backports_1.1.2            lattice_0.20-35            glue_1.2.0                 digest_0.6.15              RColorBrewer_1.1-2        
[16] XVector_0.16.0             checkmate_1.8.5            colorspace_1.3-2           htmltools_0.3.6            Matrix_1.2-14             
[21] plyr_1.8.4                 XML_3.98-1.11              pkgconfig_2.0.1            biomaRt_2.32.1             zlibbioc_1.22.0           
[26] purrr_0.2.5                scales_0.5.0               BiocParallel_1.10.1        htmlTable_1.12             tibble_1.4.2              
[31] ggplot2_2.2.1              SummarizedExperiment_1.6.5 nnet_7.3-12                lazyeval_0.2.1             survival_2.42-3           
[36] magrittr_1.5               memoise_1.1.0              GGally_1.4.0               foreign_0.8-70             tools_3.4.0               
[41] data.table_1.11.4          matrixStats_0.53.1         stringr_1.3.1              munsell_0.4.3              DelayedArray_0.2.7        
[46] cluster_2.0.7-1            bindrcpp_0.2.2             Biostrings_2.44.2          compiler_3.4.0             rlang_0.2.1               
[51] grid_3.4.0                 RCurl_1.95-4.10            rstudioapi_0.7             htmlwidgets_1.2            bitops_1.0-6              
[56] base64enc_0.1-3            gtable_0.2.0               DBI_1.0.0                  reshape_0.8.7              reshape2_1.4.3            
[61] R6_2.2.2                   GenomicAlignments_1.12.2   gridExtra_2.3              knitr_1.20                 dplyr_0.7.5               
[66] rtracklayer_1.36.6         bit_1.1-14                 bindr_0.1.1                Hmisc_4.1-1                stringi_1.2.2             
[71] Rcpp_0.12.17               rpart_4.1-13               acepack_1.4.1              tidyselect_0.2.4
ADD COMMENTlink modified 29 days ago by daniel.vantwisk30 • written 5 weeks ago by marisa.e.miller0
gravatar for daniel.vantwisk
29 days ago by
daniel.vantwisk30 wrote:

I've been looking through this case, more particularly at the function .find_exon_cds() that is a level higher than .reject_transcripts(), the function that is throwing the warning.

In this function, it undergoes a process of finding overlaps between the exons and stop_codon sequences, in particular, checking whether the stop_codon serves as a stop_codon for any of the exons, i.e. whether the end of the stop codon corresponds to the end of an exon.

For example, the transcript with ID=PCA_SD_23676, is not excluded since exon6:

000673F    EVM    exon    11139    11183    .    +    .    ID=evm.model.000673F.3.exon6;Parent=PCA_SD_23676-T1

Corresponds to the end of the stop_codon:

000673F    EVM    stop_codon    11181    11183    .    +    .    ID=PCA_SD_23676-T1:stop;Parent=PCA_SD_23676-T1

The entry that you listed fails since the stop_codon's end location does not equal the end position of any of the exons.

It seems that if you want more transcripts to be included, the stop_codon would have to correspond in the way just described to an exon.

Please message me back if you need further elaboration.


ADD COMMENTlink modified 29 days ago • written 29 days ago by daniel.vantwisk30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 146 users visited in the last hour