error with: .make_splicings <- function(exons, cds, stop_codons=NULL)
1
0
Entering edit mode
rsnorman • 0
@e7235ccc
Last seen 22 months ago
United States

I am trying to use genomicFeatures as part of a epi2me Labs Differential Gene Expression pipeline using a bacterial genome and GTF annotation files downloaded from GenBank. When run, the following error message is generated:


Import genomic features from the file as a GRanges object ... OK

Prepare the 'metadata' data frame ... OK

Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) : 
  some CDS cannot be mapped to an exon

Calls: makeTxDbFromGFF -> makeTxDbFromGRanges -> .make_splicings

In addition: Warning message:

In .get_cds_IDX(mcols0$type, mcols0$phase) :

  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.

Execution halted

[Sun May 22 22:23:50 2022]

Error in rule de_analysis:
    jobid: 0
    output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
    shell:

epi2melabs/differential-expression/pipeline-transcriptome-de/scripts/de_analysis.R

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

A few lines of the gtf annotation file are included here:

#gtf-version 2.2
#!genome-build ASM222426v1
#!genome-build-accession NCBI_Assembly:GCA_002224265.1
#!annotation-date 10/06/2015 12:48:27
#!annotation-source NCBI

CP012881.1  Genbank gene    479 742 .   -   .   gene_id "AOT11_07205"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07205"; 
CP012881.1  GeneMarkS+  CDS 482 742 .   -   0   gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1"; 
CP012881.1  GeneMarkS+  start_codon 740 742 .   -   0   gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1"; 
CP012881.1  GeneMarkS+  stop_codon  479 481 .   -   0   gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1"; 
CP012881.1  Genbank gene    938 1585    .   -   .   gene_id "AOT11_07210"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07210"; 
CP012881.1  GeneMarkS+  CDS 941 1585    .   -   0   gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1"; 
CP012881.1  GeneMarkS+  start_codon 1583    1585    .   -   0   gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1"; 
CP012881.1  GeneMarkS+  stop_codon  938 940 .   -   0   gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1"; 
CP012881.1  Genbank gene    1554    1868    .   -   .   gene_id "AOT11_07215"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07215"; 
CP012881.1  GeneMarkS+  CDS 1557    1868    .   -   0   gene_id "AOT11_07215"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07215"; product "hypothetical protein"; protein_id "ASM95057.1"; transl_table "11"; exon_number "1";

Any suggestions on how to clear this error will be greatly appreciated!

GenomicFeatures • 1.9k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

That's a very old build, and an old version of the GFF standard. You would be better off getting something more recent.

> txdb <- makeTxDbFromGFF("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/204/915/GCF_002204915.1_ASM220491v1/GCF_002204915.1_ASM220491v1_genomic.gff.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/204/915/GCF_002204915.1_ASM220491v1/GCF_002204915.1_ASM220491v1_genomic.gff.gz'
Content type 'application/x-gzip' length 407136 bytes (397 KB)
downloaded 397 KB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported
  from the "Name" attribute are not unique
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  gene-FORC37_RS07565

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/204/915/GCF_002204915.1_ASM220491v1/GCF_002204915.1_ASM220491v1_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 4709
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-05-24 12:48:11 -0400 (Tue, 24 May 2022)
# GenomicFeatures version at creation time: 1.48.1
# RSQLite version at creation time: 2.2.14
# DBSCHEMAVERSION: 1.2
> transcripts(txdb)
GRanges object with 4709 ranges and 2 metadata columns:
              seqnames      ranges strand |     tx_id        tx_name
                 <Rle>   <IRanges>  <Rle> | <integer>    <character>
     [1] NZ_CP016321.1   1399-2499      + |         1           dnaN
     [2] NZ_CP016321.1   2509-3588      + |         2           recF
     [3] NZ_CP016321.1   3607-6024      + |         3           gyrB
     [4] NZ_CP016321.1   6447-6881      + |         4 FORC37_RS00025
     [5] NZ_CP016321.1 10698-11060      + |         5 FORC37_RS00040
     ...           ...         ...    ... .       ...            ...
  [4705] NZ_CP016323.1 12430-12579      - |      4705 FORC37_RS24350
  [4706] NZ_CP016323.1 21223-22191      - |      4706 FORC37_RS23960
  [4707] NZ_CP016323.1 29490-30458      - |      4707 FORC37_RS24060
  [4708] NZ_CP016323.1 32558-33526      - |      4708 FORC37_RS24075
  [4709] NZ_CP016323.1 38662-39629      - |      4709 FORC37_RS24145
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

OR if you like EBI/EMBL better

> txdb2 <- makeTxDbFromGFF("http://ftp.ensemblgenomes.org/pub/bacteria/release-53/gff3/bacteria_73_collection/vibrio_vulnificus_gca_004214765/Vibrio_vulnificus_gca_004214765.ASM421476v1.49.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'http://ftp.ensemblgenomes.org/pub/bacteria/release-53/gff3/bacteria_73_collection/vibrio_vulnificus_gca_004214765/Vibrio_vulnificus_gca_004214765.ASM421476v1.49.gff3.gz'
downloaded 352 KB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> transcripts(txdb2)
GRanges object with 4985 ranges and 2 metadata columns:
                       seqnames        ranges strand |     tx_id     tx_name
                          <Rle>     <IRanges>  <Rle> | <integer> <character>
     [1] NODE_10_length_16019..     1983-2492      + |         1    RZQ00879
     [2] NODE_10_length_16019..     2812-5349      + |         2    RZQ00880
     [3] NODE_10_length_16019..     7162-8211      + |         3    RZQ00883
     [4] NODE_10_length_16019..    8631-10067      + |         4    RZQ00884
     [5] NODE_10_length_16019..   10752-11360      + |         5    RZQ00885
     ...                    ...           ...    ... .       ...         ...
  [4981] NODE_9_length_176042.. 166011-166829      - |      4981    RZQ01380
  [4982] NODE_9_length_176042.. 166950-167969      - |      4982    RZQ01381
  [4983] NODE_9_length_176042.. 174534-175073      - |      4983    RZQ01387
  [4984] NODE_9_length_176042.. 175076-175396      - |      4984    RZQ01388
  [4985] NODE_9_length_176042.. 175398-175982      - |      4985    RZQ01389
  -------
  seqinfo: 82 sequences from an unspecified genome; no seqlengths

There are like 8 or so different versions at Ensembl bacteria that you could choose from.

ADD COMMENT
0
Entering edit mode

Thanks for the advice! I need to use this particular bacterial strain, but I did locate a more recent annotation for it and that cleared the previous errors in my pipeline. However, I am now getting another error that I think is more R related.

Now when I run the following command:

!snakemake --snakefile pipeline-transcriptome-de/Snakefile \
    --configfile config.yaml -p -j 4 \
    --until "de_analysis" > /dev/null

The following error message is obtained:

Warning message: package ‘DRIMSeq’ was built under R version 4.1.1 Warning messages:

1: package ‘GenomicFeatures’ was built under R version 4.1.1

2: package ‘BiocGenerics’ was built under R version 4.1.1

3: package ‘S4Vectors’ was built under R version 4.1.2

4: package ‘IRanges’ was built under R version 4.1.2

5: package ‘GenomeInfoDb’ was built under R version 4.1.1

6: package ‘GenomicRanges’ was built under R version 4.1.2

7: package ‘AnnotationDbi’ was built under R version 4.1.1

8: package ‘Biobase’ was built under R version 4.1.2

Import genomic features from the file as a GRanges object ... OK

Prepare the 'metadata' data frame ... OK

Make the TxDb object ... OK

Warning messages:

1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : the transcript names ("tx_name" column in the TxDb object) imported from the "Name" attribute are not unique

2: In makeTxDbFromGRanges(gr, metadata = metadata) : The following transcripts were dropped because their exon ranks could not be inferred (either because the exons are not on the same chromosome/strand or because they are not separated by introns): gene-AOT11_RS02065, gene-AOT11_RS21865

'select()' returned 1:1 mapping between keys and columns

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3, 2

Calls: strip_version ... as.data.frame -> as.data.frame.list -> do.call -> <Anonymous>

Execution halted

[Wed May 25 12:02:51 2022]

Error in rule de_analysis:

jobid: 0

output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv

shell:

/epi2melabs/differential-expression/pipeline-transcriptome-de/scripts/de_analysis.R (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

ADD REPLY
0
Entering edit mode

It's not clear from this what the problem is. Rather than running under snakemake you might run the R code directly. I mean it's obvious that at some point you are calling as.data.frame with a list object as input, and the list object has differing lengths, but you might be doing that directly, or it might be part of some function you are calling. Without knowing that, there's no way to figure out what the problem is.

ADD REPLY
0
Entering edit mode

And also please don't ignore these warnings:

1: package ‘GenomicFeatures’ was built under R version 4.1.1

2: package ‘BiocGenerics’ was built under R version 4.1.1

3: package ‘S4Vectors’ was built under R version 4.1.2

etc...

They suggest that you're using a version of Bioconductor that doesn't match your version of R, which can only be a source of trouble.

As Jim suggested, it's going to be easier to troubleshoot this in an interactive R session (make sure you start the same R that is used by snakemake). When you are at the R command prompt, run BiocManager::valid() before anything else.

H.

ADD REPLY

Login before adding your answer.

Traffic: 886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6