Question

bambu output

0

Entering edit mode

camilia.savitri • 0

@6db15c42

Last seen 10 months ago

Japan

I am trying to use bambu with a nanopore cDNA sequencing dataset. I tried to run with bambu and have some questions :

Does the annotation file has to be in gtf? I have an annotation gff3 file. When I make into txDb, it can generate txdb, but, when I run the annotations command it output this way :


# create annotation
annotations <- prepareAnnotations("./refgenome/tn2.gtf")

#output :
transcript names are not unique,
            only one transcript per ID will be kept
Error in .local(x, ..., value = value) : 
  1587 rows in value to replace 1574rows

Hence, I tried to change into gtf format using gffread

When using the command se.Multisample, it showed :

Detected 6 warnings across the samples during read class construction. Access warnings with metadata(bambuOutput)$warnings
--- Start extending annotations ---
WARNING - Less than 50 TRUE or FALSE read classes for NDR precision stabilization.
NDR will be approximated as: (1 - Transcript Model Prediction Score)
A high NDR threshold is being recommended by Bambu indicating high levels of novel transcripts, limiting the performance of the trained model
We recommend training a new model on similiar but well annotated dataset if available (https://github.com/GoekeLab/bambu/tree/master#Training-a-model-on-another-speciesdataset-and-applying-it), or alternatively running Bambu with opt.discovery=list(fitReadClassModel=FALSE)
Using a novel discovery rate (NDR) of: 1
--- Start isoform quantification ---
--- Finished running Bambu ---

However, when I checked the output, it only has the RNA annotation (no CDS or else).

> head(se.multiSample)
class: RangedSummarizedExperiment 
dim: 6 3 
metadata(2): incompatibleCounts warnings
assays(4): counts CPM fullLengthCounts uniqueCounts
rownames(6): BambuTx1 rna-DO80_r001 ... rna-DO80_r004
  rna-DO80_r005
rowData names(11): TXNAME GENEID ... txid eqClassById
colnames(3): b04_sorted b05_sorted b06_sorted
colData names(1): name

I don't know where the problem is, is it because of my annotation file? What should I run differently? Thanks!

txdb bambu • 695 views

ADD COMMENT • link updated 10 months ago by andre_sim ▴ 20 • written 11 months ago by camilia.savitri • 0

score 0 · Answer 1 · 2023-06-23

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 4 days ago

United States

When you ran se.Multisample, it output this statement (copy/pasted from your post):

NDR will be approximated as: (1 - Transcript Model Prediction Score)
A high NDR threshold is being recommended by Bambu indicating high levels of novel transcripts, limiting the performance of the trained model
We recommend training a new model on similiar but well annotated dataset if available 
(https://github.com/GoekeLab/bambu/tree/master#Training-a-model-on-another-speciesdataset-and-applying-it), 
or alternatively running Bambu with opt.discovery=list(fitReadClassModel=FALSE)

Which appears to diagnose the problem and suggest some alternatives. Did you notice that? If so, what have you tried?

ADD COMMENT • link 11 months ago James W. MacDonald 65k

0

Entering edit mode

Even if I input for counting only (no discovery) :

    se.multiSample <- bambu(reads = c("a.bam", "b.bam"), 
                            annotations = annotations, 
                            genome = "./refgenome/sequence.fasta",
                            discovery = FALSE)

the output is still the same :

--- Start generating read class files ---
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com/
  |============================================================================| 100%

Detected 6 warnings across the samples during read class construction. Access warnings with metadata(bambuOutput)$warnings
--- Start isoform quantification ---
  |============================================================================| 100%

--- Finished running Bambu ---

> assays(se.multiSample)$counts
              b04_sorted b05_sorted b06_sorted
rna-DO80_r001         18          1         26
rna-DO80_r002     317499      50659     581984
rna-DO80_r003     124983      20761     235190

it wont output other transcript counts (only rna coded ones)

Here is my txdb command :

> txdb <- makeTxDbFromGFF("./refgenome/tn2_new.gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object)
  imported from the "Name" attribute are not unique

I checked but it looks OK, I don't know what's wrong.. and when I do the prepareannotations from bambu :

> annotations <- prepareAnnotations(txdb)
transcript names are not unique,
            only one transcript per ID will be kept
Error in .local(x, ..., value = value) : 
  1552 rows in value to replace 1546rows

any idea??

ADD REPLY • link 10 months ago camilia.savitri • 0

1

Entering edit mode

Hi,

Bambu doesn't output or use CDS because it does not do ORF prediction and focuses on outputing the exon content of transcripts. If you need the CDS sequences you could use a downstream ORF prediction tool on the output.

How many transcript are you trying to quantify? (Or better said, how many transcripts are in your gff3 file before you convert it)? Are you transcripts you want to quantify in your coverted gtf file? It could be that converting your gff3 to gtf has not provided appropriate exon lines that bambu expects in the gtf file. Could you provide an example of some of the lines from the gff3 and converted gtf file for me to have a look at?

ADD REPLY • link 10 months ago andre_sim ▴ 20

0

Entering edit mode

There are about 1500 annotated transcripts. Here I will show the gff3 :

##gff-version 3
##sequence-region AP019730.1 1 1604952
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=210
AP019730.1  DDBJ    region  1   1604952 .   +   .   ID=AP019730.1:1..1604952;Dbxref=taxon:210;Is_circular=true;collection-date=2010-04-01;country=Japan:Oita;gbkey=Src;mol_type=genomic DNA;nat-host=Homo sapiens;strain=TN2wt
AP019730.1  DDBJ    gene    11  427 .   -   .   ID=gene-DO80_00010;Name=nusB;gbkey=Gene;gene=nusB;gene_biotype=protein_coding;locus_tag=DO80_00010
AP019730.1  DDBJ    CDS 11  427 .   -   0   ID=cds-BBK68945.1;Dbxref=NCBI_GP:BBK68945.1;Name=BBK68945.1;Parent=gene-DO80_00010;gbkey=CDS;gene=nusB;locus_tag=DO80_00010;product=transcription antitermination protein NusB;protein_id=BBK68945.1;transl_table=11
AP019730.1  DDBJ    gene    429 899 .   -   .   ID=gene-DO80_00020;Name=ribH;gbkey=Gene;gene=ribH;gene_biotype=protein_coding;locus_tag=DO80_00020
AP019730.1  DDBJ    CDS 429 899 .   -   0   ID=cds-BBK68946.1;Dbxref=NCBI_GP:BBK68946.1;Name=BBK68946.1;Parent=gene-DO80_00020;gbkey=CDS;gene=ribH;locus_tag=DO80_00020;product=6%2C7-dimethyl-8-ribityllumazine synthase;protein_id=BBK68946.1;transl_table=11
...
AP019730.1  DDBJ    gene    1602475 1603263 .   +   .   ID=gene-DO80_15430;Name=flgG;gbkey=Gene;gene=flgG;gene_biotype=protein_coding;locus_tag=DO80_15430
AP019730.1  DDBJ    CDS 1602475 1603263 .   +   0   ID=cds-BBK70487.1;Dbxref=NCBI_GP:BBK70487.1;Name=BBK70487.1;Parent=gene-DO80_15430;gbkey=CDS;gene=flgG;locus_tag=DO80_15430;product=flagellar basal body rod protein FlgG;protein_id=BBK70487.1;transl_table=11
AP019730.1  DDBJ    gene    1603722 1603862 .   -   .   ID=gene-DO80_15440;Name=DO80_15440;gbkey=Gene;gene_biotype=protein_coding;locus_tag=DO80_15440
AP019730.1  DDBJ    CDS 1603722 1603862 .   -   0   ID=cds-BBK70488.1;Dbxref=NCBI_GP:BBK70488.1;Name=BBK70488.1;Parent=gene-DO80_15440;gbkey=CDS;locus_tag=DO80_15440;product=flagellar basal body rod protein FlgG;protein_id=BBK70488.1;transl_table=11
AP019730.1  DDBJ    gene    1603946 1604707 .   -   .   ID=gene-DO80_15450;Name=DO80_15450;gbkey=Gene;gene_biotype=protein_coding;locus_tag=DO80_15450
AP019730.1  DDBJ    CDS 1603946 1604707 .   -   0   ID=cds-BBK70489.1;Dbxref=NCBI_GP:BBK70489.1;Name=BBK70489.1;Parent=gene-DO80_15450;gbkey=CDS;locus_tag=DO80_15450;product=hypothetical protein;protein_id=BBK70489.1;transl_table=11

and here is the gtf file :

AP019730.1  DDBJ    transcript  11  427 .   -   .   transcript_id "gene-DO80_00010"; gene_id "gene-DO80_00010"; gene_name "nusB"
AP019730.1  DDBJ    CDS 11  427 .   -   0   transcript_id "gene-DO80_00010"; gene_name "nusB";
AP019730.1  DDBJ    transcript  429 899 .   -   .   transcript_id "gene-DO80_00020"; gene_id "gene-DO80_00020"; gene_name "ribH"
AP019730.1  DDBJ    CDS 429 899 .   -   0   transcript_id "gene-DO80_00020"; gene_name "ribH";
...
AP019730.1  DDBJ    transcript  1602037 1602363 .   +   .   transcript_id "gene-DO80_15420"; gene_id "gene-DO80_15420"
AP019730.1  DDBJ    CDS 1602037 1602363 .   +   0   transcript_id "gene-DO80_15420";
AP019730.1  DDBJ    transcript  1602475 1603263 .   +   .   transcript_id "gene-DO80_15430"; gene_id "gene-DO80_15430"; gene_name "flgG"
AP019730.1  DDBJ    CDS 1602475 1603263 .   +   0   transcript_id "gene-DO80_15430"; gene_name "flgG";
AP019730.1  DDBJ    transcript  1603722 1603862 .   -   .   transcript_id "gene-DO80_15440"; gene_id "gene-DO80_15440"
AP019730.1  DDBJ    CDS 1603722 1603862 .   -   0   transcript_id "gene-DO80_15440";
AP019730.1  DDBJ    transcript  1603946 1604707 .   -   .   transcript_id "gene-DO80_15450"; gene_id "gene-DO80_15450"
AP019730.1  DDBJ    CDS 1603946 1604707 .   -   0   transcript_id "gene-DO80_15450";

Even if I tried to make txdb from GFF (as the said above) also outputs only rna

ADD REPLY • link 10 months ago camilia.savitri • 0

1

Entering edit mode

Hi,

Ah I think the issue is the gtf file is missing the exon line. Here is an excerpt from the toy gtf file that comes with bambu.

9   havana  gene    12134   13783   .   +   .   gene_id "ENSG00000236875"; gene_version "3"; gene_name "DDX11L5"; gene_source "havana"; gene_biotype "unprocessed_pseudogene";
9   havana  transcript  12134   13783   .   +   .   gene_id "ENSG00000236875"; gene_version "3"; transcript_id "ENST00000421620"; transcript_version "2"; gene_name "DDX11L5"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L5-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; tag "basic"; transcript_support_level "NA";
9   havana  exon    12134   12190   .   +   .   gene_id "ENSG00000236875"; gene_version "3"; transcript_id "ENST00000421620"; transcript_version "2"; exon_number "1"; gene_name "DDX11L5"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L5-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00001680583"; exon_version "2"; tag "basic"; transcript_support_level "NA";
9   havana  exon    12291   12340   .   +   .   gene_id "ENSG00000236875"; gene_version "3"; transcript_id "ENST00000421620"; transcript_version "2"; exon_number "2"; gene_name "DDX11L5"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L5-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00001759901"; exon_version "2"; tag "basic"; transcript_support_level "NA";
9   havana  exon    12726   12834   .   +   .   gene_id "ENSG00000236875"; gene_version "3"; transcript_id "ENST00000421620"; transcript_version "2"; exon_number "3"; gene_name "DDX11L5"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L5-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00002213115"; exon_version "2"; tag "basic"; transcript_support_level "NA";
9   havana  exon    13088   13157   .   +   .   gene_id "ENSG00000236875"; gene_version "3"; transcript_id "ENST00000421620"; transcript_version "2"; exon_number "4"; gene_name "DDX11L5"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L5-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00001667024"; exon_version "2"; tag "basic"; transcript_support_level "NA";
9   havana  exon    13338   13487   .   +   .   gene_id "ENSG00000236875"; gene_version "3"; transcript_id "ENST00000421620"; transcript_version "2"; exon_number "5"; gene_name "DDX11L5"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L5-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00001719076"; exon_version "2"; tag "basic"; transcript_support_level "NA";
9   havana  exon    13566   13783   .   +   .   gene_id "ENSG00000236875"; gene_version "3"; transcript_id "ENST00000421620"; transcript_version "2"; exon_number "6"; gene_name "DDX11L5"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L5-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00003729664"; exon_version "1"; tag "basic"; transcript_support_level "NA";
9   ensembl_havana  gene    14521   29739   .   -   .   gene_id "ENSG00000181404"; gene_version "15"; gene_name "WASHC1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
9   ensembl_havana  transcript  14521   29739   .   -   .   gene_id "ENSG00000181404"; gene_version "15"; transcript_id "ENST00000442898"; transcript_version "4"; gene_name "WASHC1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "WASHC1-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS78375"; tag "basic"; transcript_support_level "2";

Without the exon line bambu does not know where the intron junctions are and therefore cannot quantify the transcripts. From the scaffold id you have in the gtf I notice this is for a bacterial species which would not have much splicing and likely why there is no exon lines in the gtf. Bambu has been evaluated and tested on mainly complexly spliced transcriptomes and there are a few limitations with bacteria genomes.

As a work around for your gtf problem you could try replacing "transcript" in the 3rd column of the GTF with "exon", I am not sure if that might introduce other issues though... If you get this to work and you want to do transcript discovery you need to set opt.discovery = list(min.txScore.singleExon=0) when running bambu, which turns on discovery for unspliced transcripts. But just to reiterate, Bambu hasn't been evaluated in a bacteria/microbial context.

Hope this helps.

ADD REPLY • link 10 months ago andre_sim ▴ 20