Hi all,
Salmon produces transcript counts and some include decimals. I noticed that if I import my quant files from salmon (244 files) with non-integer values and then run DESeq I detected 327 DE genes.
txi <- tximport(files, type = "salmon", tx2gene = txdf, ignoreTxVersion = T)
ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ RSV_NPA)
ddsTxi <- DESeq(ddsTxi)
res <- results(ddsTxi)
summary(res)
out of 18815 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 195, 1%
LFC < 0 (down) : 132, 0.7%
outliers [1] : 0, 0%
low counts [2] : 5868, 31%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
But if I export the unnormalized counts of the just imported file (all integer numbers now) and then reload it as matrix and run DESeq I get only 45 DE genes
dds_counts <- as.data.frame(counts(ddsTxi, normalized=FALSE))
write.table(as.data.frame(dds_counts), file="dds_Ens.txt", sep="\t"
counts <-read.table("dds_Ens.txt",h=T,row.names = 1,sep='\t')
ddsTxi <- DESeqDataSetFromMatrix(countData =counts, colData = samples, design = ~RSV_NPA)
out of 18812 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 36, 0.19%
LFC < 0 (down) : 9, 0.048%
outliers [1] : 0, 0%
low counts [2] : 10611, 56%
(mean count < 46)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Interestingly DESeqDataSetFromMatrix does not allow for non-integer values in the matrix but DESeqDataSetFromTximport does.
Then I tested the tximport and salmon files in the DESeq tutorial below and observed the same issue if you repeat the steps above http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#how-to-get-help-for-deseq2
Hence could you explain this issue and suggest how to proceed?
Best
Marcos
Thanks for the clarification Michael. Hence, how could I share with colleagues a file with my imported salmon counts based on effective transcript lengths? I initially thought that exporting the DEseq counts would do, but that's clearly not the case. I could also share the salmon quant files, but that seems a bit cumbersome given the high number of file to share. Thanks again for your valuable input, it's greatly appreciated
For computational reproducibility, you definitely want to share the full Salmon output directories, either zipped or using
tar -czf quants.tgz ...
. This allows collaborators to later determine what transcript references were used (what source and what release). See our recent paper on computational reproducibility for RNA-seq:https://doi.org/10.1371/journal.pcbi.1007664
Or you could share the
txi
ordds
objects.If they don't want to use R and you just want to share the data, and it's ok if the metadata is lost (e.g. what version of Salmon was used, what options were used, what reference transcripts were used, ...), you should write out the three matrices stored in
txi
(the counts, the lengths, and the abundance) to plain text. Then later someone could recreate the analysis by reading those matrices in to R, creating a list (txi
) and providing this toDESeqDataSetFromTximport
.Excellent. Thanks for the reference!!