makeTxDbFromGFF error "stop codons that cannot be mapped to an exon"
1
0
Entering edit mode
@marisaemiller-13344
Last seen 5.5 years ago

Hello,

I am trying to use makeTxDbFromGFF to import my GFF3 file as a TxDb. Below is the code used and the links to the relevant files:

library(GenomicFeatures)
chromlen <- read.delim("Puccinia_coronata_avenae_12SD80.primary.chromlen", header = FALSE)
colnames(chromlen) <- c("chrom", "length")
chromlen$is_circular <- FALSE

txdb <- makeTxDbFromGFF(file="haus_expr_secreted_effectors_sd80_p.full.gff3",
                        chrominfo=chromlen,
                        dataSource="gff3 file of haustorial effectors from 12SD80",
                        organism="Puccinia coronata")

https://s3.msi.umn.edu/txdb_troubleshooting/haus_expr_secreted_effectors_sd80_p.full.gff3

https://s3.msi.umn.edu/txdb_troubleshooting/Puccinia_coronata_avenae_12SD80.primary.chromlen

Most transcripts are successfully imported, but there are a number that aren't. The warning make TxDbFromGFF throws is "Warning message: In .reject_transcripts(bad_tx, because) : The following transcripts were rejected because they have stop codons that cannot be mapped to an exon." 

 

Here's an example of what one of these offending transcripts looks like:

000335F EVM     gene    37334   38135   .       -       .       ID=PCA_SD_18895

000335F EVM     mRNA    37334   38135   .       -       .       ID=PCA_SD_18895-T1;Parent=PCA_SD_18895;note=SECRETED:SignalP(1-28)

000335F EVM     exon    37334   37402   .       -       .       ID=evm.model.000335F.12.exon5;Parent=PCA_SD_18895-T1

000335F EVM     exon    37486   37561   .       -       .       ID=evm.model.000335F.12.exon4;Parent=PCA_SD_18895-T1

000335F EVM     exon    37661   37692   .       -       .       ID=evm.model.000335F.12.exon3;Parent=PCA_SD_18895-T1

000335F EVM     exon    37790   37874   .       -       .       ID=evm.model.000335F.12.exon2;Parent=PCA_SD_18895-T1

000335F EVM     exon    37966   38135   .       -       .       ID=evm.model.000335F.12.exon1;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37334   37402   .       -       0       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37486   37561   .       -       1       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37661   37692   .       -       0       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37790   37874   .       -       1       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     CDS     37966   38135   .       -       0       ID=cds.evm.model.000335F.12;Parent=PCA_SD_18895-T1

000335F EVM     start_codon     37400   37402   .       +       .       ID=PCA_SD_18895-T1:start;Parent=PCA_SD_18895-T1

000335F EVM     stop_codon      37966   37968   .       +       .       ID=PCA_SD_18895-T1:stop;Parent=PCA_SD_18895-T1

 

As you can see, the stop codon is within the last exon, rather than the last exon ending on the stop codon. Is there anything I can do to get these transcripts successfully imported?

Thank you,

Marisa

 

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2   Biobase_2.36.2         GenomicRanges_1.28.6   GenomeInfoDb_1.12.3    IRanges_2.10.5        
[7] S4Vectors_0.14.7       BiocGenerics_0.22.1   

loaded via a namespace (and not attached):
 [1] bit64_0.9-7                splines_3.4.0              Formula_1.2-3              assertthat_0.2.0           latticeExtra_0.6-28       
 [6] blob_1.1.1                 GenomeInfoDbData_0.99.0    Rsamtools_1.28.0           pillar_1.2.3               RSQLite_2.1.1             
[11] backports_1.1.2            lattice_0.20-35            glue_1.2.0                 digest_0.6.15              RColorBrewer_1.1-2        
[16] XVector_0.16.0             checkmate_1.8.5            colorspace_1.3-2           htmltools_0.3.6            Matrix_1.2-14             
[21] plyr_1.8.4                 XML_3.98-1.11              pkgconfig_2.0.1            biomaRt_2.32.1             zlibbioc_1.22.0           
[26] purrr_0.2.5                scales_0.5.0               BiocParallel_1.10.1        htmlTable_1.12             tibble_1.4.2              
[31] ggplot2_2.2.1              SummarizedExperiment_1.6.5 nnet_7.3-12                lazyeval_0.2.1             survival_2.42-3           
[36] magrittr_1.5               memoise_1.1.0              GGally_1.4.0               foreign_0.8-70             tools_3.4.0               
[41] data.table_1.11.4          matrixStats_0.53.1         stringr_1.3.1              munsell_0.4.3              DelayedArray_0.2.7        
[46] cluster_2.0.7-1            bindrcpp_0.2.2             Biostrings_2.44.2          compiler_3.4.0             rlang_0.2.1               
[51] grid_3.4.0                 RCurl_1.95-4.10            rstudioapi_0.7             htmlwidgets_1.2            bitops_1.0-6              
[56] base64enc_0.1-3            gtable_0.2.0               DBI_1.0.0                  reshape_0.8.7              reshape2_1.4.3            
[61] R6_2.2.2                   GenomicAlignments_1.12.2   gridExtra_2.3              knitr_1.20                 dplyr_0.7.5               
[66] rtracklayer_1.36.6         bit_1.1-14                 bindr_0.1.1                Hmisc_4.1-1                stringi_1.2.2             
[71] Rcpp_0.12.17               rpart_4.1-13               acepack_1.4.1              tidyselect_0.2.4
genomicfeatures txdb maketxdbfromgff • 1.3k views
ADD COMMENT
0
Entering edit mode
@danielvantwisk-13028
Last seen 3.8 years ago

I've been looking through this case, more particularly at the function .find_exon_cds() that is a level higher than .reject_transcripts(), the function that is throwing the warning.

In this function, it undergoes a process of finding overlaps between the exons and stop_codon sequences, in particular, checking whether the stop_codon serves as a stop_codon for any of the exons, i.e. whether the end of the stop codon corresponds to the end of an exon.

For example, the transcript with ID=PCA_SD_23676, is not excluded since exon6:

000673F    EVM    exon    11139    11183    .    +    .    ID=evm.model.000673F.3.exon6;Parent=PCA_SD_23676-T1

Corresponds to the end of the stop_codon:

000673F    EVM    stop_codon    11181    11183    .    +    .    ID=PCA_SD_23676-T1:stop;Parent=PCA_SD_23676-T1

The entry that you listed fails since the stop_codon's end location does not equal the end position of any of the exons.

It seems that if you want more transcripts to be included, the stop_codon would have to correspond in the way just described to an exon.

Please message me back if you need further elaboration.

 

ADD COMMENT

Login before adding your answer.

Traffic: 770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6