Entering edit mode
I am trying to use genomicFeatures as part of a epi2me Labs Differential Gene Expression pipeline using a bacterial genome and GTF annotation files downloaded from GenBank. When run, the following error message is generated:
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) :
some CDS cannot be mapped to an exon
Calls: makeTxDbFromGFF -> makeTxDbFromGRanges -> .make_splicings
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
Execution halted
[Sun May 22 22:23:50 2022]
Error in rule de_analysis:
jobid: 0
output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
shell:
epi2melabs/differential-expression/pipeline-transcriptome-de/scripts/de_analysis.R
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
A few lines of the gtf annotation file are included here:
#gtf-version 2.2
#!genome-build ASM222426v1
#!genome-build-accession NCBI_Assembly:GCA_002224265.1
#!annotation-date 10/06/2015 12:48:27
#!annotation-source NCBI
CP012881.1 Genbank gene 479 742 . - . gene_id "AOT11_07205"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07205";
CP012881.1 GeneMarkS+ CDS 482 742 . - 0 gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1";
CP012881.1 GeneMarkS+ start_codon 740 742 . - 0 gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1";
CP012881.1 GeneMarkS+ stop_codon 479 481 . - 0 gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1";
CP012881.1 Genbank gene 938 1585 . - . gene_id "AOT11_07210"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07210";
CP012881.1 GeneMarkS+ CDS 941 1585 . - 0 gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1";
CP012881.1 GeneMarkS+ start_codon 1583 1585 . - 0 gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1";
CP012881.1 GeneMarkS+ stop_codon 938 940 . - 0 gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1";
CP012881.1 Genbank gene 1554 1868 . - . gene_id "AOT11_07215"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07215";
CP012881.1 GeneMarkS+ CDS 1557 1868 . - 0 gene_id "AOT11_07215"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07215"; product "hypothetical protein"; protein_id "ASM95057.1"; transl_table "11"; exon_number "1";
Any suggestions on how to clear this error will be greatly appreciated!
Thanks for the advice! I need to use this particular bacterial strain, but I did locate a more recent annotation for it and that cleared the previous errors in my pipeline. However, I am now getting another error that I think is more R related.
Now when I run the following command:
The following error message is obtained:
Warning message: package ‘DRIMSeq’ was built under R version 4.1.1 Warning messages:
1: package ‘GenomicFeatures’ was built under R version 4.1.1
2: package ‘BiocGenerics’ was built under R version 4.1.1
3: package ‘S4Vectors’ was built under R version 4.1.2
4: package ‘IRanges’ was built under R version 4.1.2
5: package ‘GenomeInfoDb’ was built under R version 4.1.1
6: package ‘GenomicRanges’ was built under R version 4.1.2
7: package ‘AnnotationDbi’ was built under R version 4.1.1
8: package ‘Biobase’ was built under R version 4.1.2
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : the transcript names ("tx_name" column in the TxDb object) imported from the "Name" attribute are not unique
2: In makeTxDbFromGRanges(gr, metadata = metadata) : The following transcripts were dropped because their exon ranks could not be inferred (either because the exons are not on the same chromosome/strand or because they are not separated by introns): gene-AOT11_RS02065, gene-AOT11_RS21865
'select()' returned 1:1 mapping between keys and columns
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3, 2
Calls: strip_version ... as.data.frame -> as.data.frame.list -> do.call -> <Anonymous>
Execution halted
[Wed May 25 12:02:51 2022]
Error in rule de_analysis:
jobid: 0
output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
shell:
/epi2melabs/differential-expression/pipeline-transcriptome-de/scripts/de_analysis.R (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message
It's not clear from this what the problem is. Rather than running under
snakemake
you might run the R code directly. I mean it's obvious that at some point you are callingas.data.frame
with alist
object as input, and the list object has differing lengths, but you might be doing that directly, or it might be part of some function you are calling. Without knowing that, there's no way to figure out what the problem is.And also please don't ignore these warnings:
1: package ‘GenomicFeatures’ was built under R version 4.1.1
2: package ‘BiocGenerics’ was built under R version 4.1.1
3: package ‘S4Vectors’ was built under R version 4.1.2
etc...
They suggest that you're using a version of Bioconductor that doesn't match your version of R, which can only be a source of trouble.
As Jim suggested, it's going to be easier to troubleshoot this in an interactive R session (make sure you start the same R that is used by snakemake). When you are at the R command prompt, run
BiocManager::valid()
before anything else.H.