Error using DRIMSeq with Salmon count data
1
0
Entering edit mode
Charlotte • 0
@charlotte-23646
Last seen 4.3 years ago

Hello,

I am using my data to follow the tutorial - Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification.

When I run:

all(rownames(cts) %in% txdf$TXNAME)

It gives FALSE rather than TRUE.

I have checked that all row names are unique in my salmon quantification files by running:

dups <- which(duplicated(Dem1rep1.quant)) unique(Dem1rep1.quant[dups]) length(dups)

This gave zero duplicates for all of the quantification files.

What could be causing my error? If I ignore the error and continue with the workflow I get this error when running:

d <- dmDSdata(counts=counts, samples=samps) Error in dsDSdata(counts=counts, samples=samps) : sum(duplicated(counts$feature_id)) == 0 is not TRUE

Any help would be greatly appreciated

DRIMSeq tximport txdb Salmon • 1.7k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 21 hours ago
United States

What FASTA files are you using for quantification? If it’s human, mouse, or fruit fly you can skip a lot of these issues by letting tximeta discover and download the correct annotation for you.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007664

ADD COMMENT
0
Entering edit mode

I generate the salmon.quant files by running https://github.com/nanoporetech/onttutorialtranscriptome/. I work with a non-model organism but it has a transcriptome and annotation file.

ADD REPLY
0
Entering edit mode

Ok so then you have to work this out, can’t use tximeta.

The next step would be to figure out why txdf doesn’t match your counts. What do these IDs looks like (eg try using head()).

ADD REPLY
0
Entering edit mode

There are duplicate IDs in the txdf GENEID but I assume this is to be expected. The TXNAMEs look like they are unique..

     head(txdf)
     GENEID                   TXNAME                ntx    
1    evm.TU.Herato0101.1   evm.model.Herato0101.1   1
136  evm.TU.Herato0101.2   evm.model.Herato0101.2   2
137  evm.TU.Herato0101.2   evm.model.Herato0101.2.1 2
301  evm.TU.Herato0101.3   evm.model.Herato0101.3   1
445  evm.TU.Herato0101.4   evm.model.Herato0101.4   1
600  evm.TU.Herato0101.5   evm.model.Herato0101.5   1


head(cts)
Dem1_rep1 Dem1_rep2 Hyd2_rep1 Hyd2_rep2 
evm.model.Herato0101.1    14.883894 11.766183  5.740663 10.454824
evm.model.Herato0101.2     7.038967  6.905068  3.948951  9.725248
evm.model.Herato0101.2.1   5.406933  6.981437  3.244008  8.593378
evm.model.Herato0101.3    25.221528 20.935325  3.891139 29.312322
evm.model.Herato0101.4   102.000548 52.240938 43.526470 89.247897
evm.model.Herato0101.5    17.301869 14.667115  7.383188 13.493186

I made the txdf by following the tutorials instructions:

library(GenomicFeatures) gtf <- "Heliconiuseratodemophoonv1.gtf" txdb.filename <- "Heliconiuseratodemophoonv1_gtf.sqlite" txdb <- makeTxDbFromGFF(gtf) saveDb(txdb, txdb.filename)

txdb <- loadDb(txdb.filename) txdf <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID") tab <- table(txdf$GENEID) txdf$ntx <- tab[match(txdf$GENEID, names(tab))]

I loaded the same gtf file into R that was used to create the Salmon counts so I am unsure why there would be duplicates in txdf but not in cts.

For reference, the head of the gtf file looks like this:

Herato0101  GenomeHubs  exon    6012    6592    .   -   .   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  exon    6681    6919    .   -   .   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  exon    7684    8066    .   -   .   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  exon    8239    8832    .   -   .   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  exon    9133    9164    .   -   .   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  CDS 6467    6592    .   -   0   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  CDS 6681    6919    .   -   2   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  CDS 7684    8066    .   -   1   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  CDS 8239    8832    .   -   1   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
Herato0101  GenomeHubs  CDS 9133    9164    .   -   0   transcript_id "evm.model.Herato0101.1"; gene_id "evm.TU.Herato0101.1";
ADD REPLY
0
Entering edit mode

Not sure why the GTF and FASTA are inconsistent, but in the end you’ll have to remove transcripts that have no gene in txdf and to manually deal with duplicates.

ADD REPLY

Login before adding your answer.

Traffic: 629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6