Hello,
I am using my data to follow the tutorial - Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification.
When I run:
all(rownames(cts) %in% txdf$TXNAME)
It gives FALSE rather than TRUE.
I have checked that all row names are unique in my salmon quantification files by running:
dups <- which(duplicated(Dem1rep1.quant)) unique(Dem1rep1.quant[dups]) length(dups)
This gave zero duplicates for all of the quantification files.
What could be causing my error? If I ignore the error and continue with the workflow I get this error when running:
d <- dmDSdata(counts=counts, samples=samps) Error in dsDSdata(counts=counts, samples=samps) : sum(duplicated(counts$feature_id)) == 0 is not TRUE
Any help would be greatly appreciated
I generate the salmon.quant files by running https://github.com/nanoporetech/onttutorialtranscriptome/. I work with a non-model organism but it has a transcriptome and annotation file.
Ok so then you have to work this out, can’t use tximeta.
The next step would be to figure out why txdf doesn’t match your counts. What do these IDs looks like (eg try using head()).
There are duplicate IDs in the txdf GENEID but I assume this is to be expected. The TXNAMEs look like they are unique..
I made the txdf by following the tutorials instructions:
library(GenomicFeatures) gtf <- "Heliconiuseratodemophoonv1.gtf" txdb.filename <- "Heliconiuseratodemophoonv1_gtf.sqlite" txdb <- makeTxDbFromGFF(gtf) saveDb(txdb, txdb.filename)
txdb <- loadDb(txdb.filename) txdf <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID") tab <- table(txdf$GENEID) txdf$ntx <- tab[match(txdf$GENEID, names(tab))]
I loaded the same gtf file into R that was used to create the Salmon counts so I am unsure why there would be duplicates in txdf but not in cts.
For reference, the head of the gtf file looks like this:
Not sure why the GTF and FASTA are inconsistent, but in the end you’ll have to remove transcripts that have no gene in txdf and to manually deal with duplicates.