Hello,
I run DESeq2 onto my RNAseq data contains 3 experimental conditions (group1, group2 and group3). When I run DESeq2 with groups 1 and 2 only, its identify 3239 genes as DE (p-value < 0.05). Here is output with inspection of one transcript, Omu_14977_c0_g2_i4, that is absent in the next run:
> data = read.table("/mnt/4tb/onchi_new/DE/uniq/RSEM.isoform.counts.matrix", header=T, row.names=1, com='')
> data['Omu_14977_c0_g2_i4',]
group1_rep2 group1_rep3 group2_rep1 group2_rep2 group3_rep1 group3_rep2 group3_rep3
Omu_14977_c0_g2_i4 49.76 53.1 12888.89 10298.2 19.62 31 785.84
> col_ordering = c(1,2,3,4)
> rnaseqMatrix = data[,col_ordering]
> rnaseqMatrix = round(rnaseqMatrix)
> conditions = data.frame(group=factor(c(rep("group1", 2), rep("group2", 2))))
> rownames(conditions) = colnames(rnaseqMatrix)
> conditions
group
group1_rep2 group1
group1_rep3 group1
group2_rep1 group2
group2_rep2 group2
> ddsFullCountTable <- DESeqDataSetFromMatrix(
+ countData = rnaseqMatrix,
+ colData = conditions,
+ design = ~ group)
converting counts to integer mode
> dds = DESeq(ddsFullCountTable)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> contrast=c("group","group2","group1")
> res = results(dds, contrast)
> sum(res$padj < 0.05 , na.rm = TRUE)
[1] 3239
> res_table = as.data.frame(res)
> res_table['Omu_14977_c0_g2_i4',]
baseMean log2FoldChange lfcSE stat pvalue padj
Omu_14977_c0_g2_i4 4386.558 7.074097 0.4003181 17.67119 6.990643e-70 2.067762e-65
But if we run DESeq with all 3 groups, number of DE genes will be 752 only, almost 4 times lower:
> data = read.table("/mnt/4tb/onchi_new/DE/uniq/RSEM.isoform.counts.matrix", header=T, row.names=1, com='')
> data['Omu_14977_c0_g2_i4',]
group1_rep2 group1_rep3 group2_rep1 group2_rep2 group3_rep1 group3_rep2 group3_rep3
Omu_14977_c0_g2_i4 49.76 53.1 12888.89 10298.2 19.62 31 785.84
> col_ordering = c(1,2,3,4,5,6,7)
> rnaseqMatrix = data[,col_ordering]
> rnaseqMatrix = round(rnaseqMatrix)
> conditions = data.frame(group=factor(c(rep("group1", 2), rep("group2", 2), rep("group3", 3))))
> rownames(conditions) = colnames(rnaseqMatrix)
> conditions
group
group1_rep2 group1
group1_rep3 group1
group2_rep1 group2
group2_rep2 group2
group3_rep1 group3
group3_rep2 group3
group3_rep3 group3
> ddsFullCountTable <- DESeqDataSetFromMatrix(
+ countData = rnaseqMatrix,
+ colData = conditions,
+ design = ~ group)
converting counts to integer mode
> dds = DESeq(ddsFullCountTable)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> contrast=c("group","group2","group1")
> res = results(dds, contrast)
> sum(res$padj < 0.05 , na.rm = TRUE)
[1] 752
> res_table = as.data.frame(res)
> res_table['Omu_14977_c0_g2_i4',]
baseMean log2FoldChange lfcSE stat pvalue padj
Omu_14977_c0_g2_i4 2560.991 7.057753 1.652229 4.271656 NA NA
Looks like in the second case DESeq think that Omu_14977_c0_g2_i4 is outlier in the group 2, and for this transcript p-value is NA.
What is the right way to consider it? Is it really necessary to run only groups in pairs in the analysis? I thought that contrast was specifically invented to then extract results for different combinations of sample groups (different conditions). It's just that if not, and run all together, those 3200-700=2500 DE genes would be lost, and the biological sense in them is: well, yes, in group 2 they are highly overexpressed.... How to set outliers filter correctly?
Thank you in advance, Ilya
> sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20.1
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=ru_RU.UTF-8 LC_NUMERIC=C LC_TIME=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 LC_MONETARY=ru_RU.UTF-8
[6] LC_MESSAGES=ru_RU.UTF-8 LC_PAPER=ru_RU.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] tximport_1.26.1 DESeq2_1.38.3 SummarizedExperiment_1.28.0 Biobase_2.58.0 MatrixGenerics_1.10.0
[6] matrixStats_0.63.0 GenomicRanges_1.50.2 GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2
[11] BiocGenerics_0.44.0 edgeR_3.40.2 limma_3.54.2
loaded via a namespace (and not attached):
[1] KEGGREST_1.38.0 locfit_1.5-9.7 tidyselect_1.2.0 lattice_0.20-45 generics_0.1.3 colorspace_2.1-0
[7] vctrs_0.5.2 utf8_1.2.3 blob_1.2.3 XML_3.99-0.13 rlang_1.0.6 pillar_1.8.1
[13] glue_1.6.2 DBI_1.1.3 BiocParallel_1.32.5 bit64_4.0.5 RColorBrewer_1.1-3 GenomeInfoDbData_1.2.9
[19] lifecycle_1.0.3 zlibbioc_1.44.0 Biostrings_2.66.0 munsell_0.5.0 gtable_0.3.1 codetools_0.2-19
[25] memoise_2.0.1 geneplotter_1.76.0 fastmap_1.1.1 parallel_4.2.2 AnnotationDbi_1.60.1 fansi_1.0.4
[31] Rcpp_1.0.10 xtable_1.8-4 BiocManager_1.30.20 scales_1.2.1 cachem_1.0.7 DelayedArray_0.24.0
[37] annotate_1.76.0 XVector_0.38.0 bit_4.0.5 ggplot2_3.4.1 png_0.1-8 dplyr_1.1.0
[43] grid_4.2.2 cli_3.6.0 tools_4.2.2 bitops_1.0-7 magrittr_2.0.3 RCurl_1.98-1.10
[49] RSQLite_2.3.0 tibble_3.2.0 pkgconfig_2.0.3 crayon_1.5.2 Matrix_1.5-3 httr_1.4.5
[55] rstudioapi_0.14 R6_2.5.1 compiler_4.2.2
I'm happy to provide help, we've used Salmon -> tximeta -> swish many times in the lab, and get great results.
A limitation is that it works well with ~5 samples per group, it's underpowered with n=3 to be honest.
You can also use Salmon -> edgeR::catchSalmon, as described at BioC2022. I'll ask Pedro to fill in details.
As mentioned by ATpoint, DE analysis of RNA-seq data at the transcript-level should account for the mapping ambiguity overdispersion resulting from the uncertainty of assigning reads to transcripts.
In edgeR, we estimate this overdispersion using the bootstrap counts that is output by Salmon when it is run with the option
--numBootstraps
. Transcript-level raw counts are then scaled down by the estimated overdispersion to reflect their true precision prior to proceeding with the standard DE pipeline. Briefly:In principle, scaled counts from
catchSalmon
could also be used as input to DESeq2 for a DE analysis.Thanks Pedro Baldoni for this. Is there somewhere a tutorial on how to run a differential transcript expression analysis with Kallisto and EdgeR?
The function
catchKallisto
in edgeR reads in the output from kallisto and estimates the mapping ambiguity overdispersion similarlySee the edgeR User's Guide and Pedro's paper, which is now published: https://pubmed.ncbi.nlm.nih.gov/38059347/
@OP, catchSalmon is from edgeR, see