Question

different number of DEG because of outliers

0

Entering edit mode

Ilya • 0

@c7243231

Last seen 21 months ago

France

Hello,

I run DESeq2 onto my RNAseq data contains 3 experimental conditions (group1, group2 and group3). When I run DESeq2 with groups 1 and 2 only, its identify 3239 genes as DE (p-value < 0.05). Here is output with inspection of one transcript, Omu_14977_c0_g2_i4, that is absent in the next run:

> data = read.table("/mnt/4tb/onchi_new/DE/uniq/RSEM.isoform.counts.matrix", header=T, row.names=1, com='')
> data['Omu_14977_c0_g2_i4',]
                   group1_rep2 group1_rep3 group2_rep1 group2_rep2 group3_rep1 group3_rep2 group3_rep3
Omu_14977_c0_g2_i4       49.76        53.1    12888.89     10298.2       19.62          31      785.84
> col_ordering = c(1,2,3,4)
> rnaseqMatrix = data[,col_ordering]
> rnaseqMatrix = round(rnaseqMatrix)
> conditions = data.frame(group=factor(c(rep("group1", 2), rep("group2", 2))))
> rownames(conditions) = colnames(rnaseqMatrix)
> conditions
             group
group1_rep2 group1
group1_rep3 group1
group2_rep1 group2
group2_rep2 group2
> ddsFullCountTable <- DESeqDataSetFromMatrix(
+   countData = rnaseqMatrix,
+   colData = conditions,
+   design = ~ group)
converting counts to integer mode
> dds = DESeq(ddsFullCountTable)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> contrast=c("group","group2","group1")
> res = results(dds, contrast)
> sum(res$padj < 0.05 , na.rm = TRUE)
[1] 3239
> res_table = as.data.frame(res)
> res_table['Omu_14977_c0_g2_i4',]
                   baseMean log2FoldChange     lfcSE     stat       pvalue         padj
Omu_14977_c0_g2_i4 4386.558       7.074097 0.4003181 17.67119 6.990643e-70 2.067762e-65

But if we run DESeq with all 3 groups, number of DE genes will be 752 only, almost 4 times lower:

> data = read.table("/mnt/4tb/onchi_new/DE/uniq/RSEM.isoform.counts.matrix", header=T, row.names=1, com='')
> data['Omu_14977_c0_g2_i4',]
                   group1_rep2 group1_rep3 group2_rep1 group2_rep2 group3_rep1 group3_rep2 group3_rep3
Omu_14977_c0_g2_i4       49.76        53.1    12888.89     10298.2       19.62          31      785.84
> col_ordering = c(1,2,3,4,5,6,7)
> rnaseqMatrix = data[,col_ordering]
> rnaseqMatrix = round(rnaseqMatrix)
> conditions = data.frame(group=factor(c(rep("group1", 2), rep("group2", 2), rep("group3", 3))))
> rownames(conditions) = colnames(rnaseqMatrix)
> conditions
             group
group1_rep2 group1
group1_rep3 group1
group2_rep1 group2
group2_rep2 group2
group3_rep1 group3
group3_rep2 group3
group3_rep3 group3
> ddsFullCountTable <- DESeqDataSetFromMatrix(
+   countData = rnaseqMatrix,
+   colData = conditions,
+   design = ~ group)
converting counts to integer mode
> dds = DESeq(ddsFullCountTable)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> contrast=c("group","group2","group1")
> res = results(dds, contrast)
> sum(res$padj < 0.05 , na.rm = TRUE)
[1] 752
> res_table = as.data.frame(res)
> res_table['Omu_14977_c0_g2_i4',]
                   baseMean log2FoldChange    lfcSE     stat pvalue padj
Omu_14977_c0_g2_i4 2560.991       7.057753 1.652229 4.271656     NA   NA

Looks like in the second case DESeq think that Omu_14977_c0_g2_i4 is outlier in the group 2, and for this transcript p-value is NA.

What is the right way to consider it? Is it really necessary to run only groups in pairs in the analysis? I thought that contrast was specifically invented to then extract results for different combinations of sample groups (different conditions). It's just that if not, and run all together, those 3200-700=2500 DE genes would be lost, and the biological sense in them is: well, yes, in group 2 they are highly overexpressed.... How to set outliers filter correctly?

Thank you in advance, Ilya

> sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20.1

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=ru_RU.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=ru_RU.UTF-8     LC_MONETARY=ru_RU.UTF-8   
 [6] LC_MESSAGES=ru_RU.UTF-8    LC_PAPER=ru_RU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tximport_1.26.1             DESeq2_1.38.3               SummarizedExperiment_1.28.0 Biobase_2.58.0              MatrixGenerics_1.10.0      
 [6] matrixStats_0.63.0          GenomicRanges_1.50.2        GenomeInfoDb_1.34.9         IRanges_2.32.0              S4Vectors_0.36.2           
[11] BiocGenerics_0.44.0         edgeR_3.40.2                limma_3.54.2               

loaded via a namespace (and not attached):
 [1] KEGGREST_1.38.0        locfit_1.5-9.7         tidyselect_1.2.0       lattice_0.20-45        generics_0.1.3         colorspace_2.1-0      
 [7] vctrs_0.5.2            utf8_1.2.3             blob_1.2.3             XML_3.99-0.13          rlang_1.0.6            pillar_1.8.1          
[13] glue_1.6.2             DBI_1.1.3              BiocParallel_1.32.5    bit64_4.0.5            RColorBrewer_1.1-3     GenomeInfoDbData_1.2.9
[19] lifecycle_1.0.3        zlibbioc_1.44.0        Biostrings_2.66.0      munsell_0.5.0          gtable_0.3.1           codetools_0.2-19      
[25] memoise_2.0.1          geneplotter_1.76.0     fastmap_1.1.1          parallel_4.2.2         AnnotationDbi_1.60.1   fansi_1.0.4           
[31] Rcpp_1.0.10            xtable_1.8-4           BiocManager_1.30.20    scales_1.2.1           cachem_1.0.7           DelayedArray_0.24.0   
[37] annotate_1.76.0        XVector_0.38.0         bit_4.0.5              ggplot2_3.4.1          png_0.1-8              dplyr_1.1.0           
[43] grid_4.2.2             cli_3.6.0              tools_4.2.2            bitops_1.0-7           magrittr_2.0.3         RCurl_1.98-1.10       
[49] RSQLite_2.3.0          tibble_3.2.0           pkgconfig_2.0.3        crayon_1.5.2           Matrix_1.5-3           httr_1.4.5            
[55] rstudioapi_0.14        R6_2.5.1               compiler_4.2.2

DESeq2 RNASeq • 1.4k views

ADD COMMENT • link updated 12 months ago by Gordon Smyth 52k • written 22 months ago by Ilya • 0

Michael Love · Answer 1 · 2023-03-16

1

Entering edit mode

ATpoint ★ 4.6k

@atpoint-13662

Last seen 6 days ago

Germany

The vignettes discusses this: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups

Beyond this, note that DESeq2 is not intended for transcript-level DE analysis as one should incorporate the inferential uncertaintly into analysis. See for example the swish method in the fishpond package, which runs downstream of salmon. You might want to summarize tx to gene level with tximport first for DESeq2.

ADD COMMENT • link 22 months ago ATpoint ★ 4.6k

2

Entering edit mode

I'm happy to provide help, we've used Salmon -> tximeta -> swish many times in the lab, and get great results.

A limitation is that it works well with ~5 samples per group, it's underpowered with n=3 to be honest.

You can also use Salmon -> edgeR::catchSalmon, as described at BioC2022. I'll ask Pedro to fill in details.

ADD REPLY • link 22 months ago Michael Love 43k

2

Entering edit mode

As mentioned by ATpoint, DE analysis of RNA-seq data at the transcript-level should account for the mapping ambiguity overdispersion resulting from the uncertainty of assigning reads to transcripts.

In edgeR, we estimate this overdispersion using the bootstrap counts that is output by Salmon when it is run with the option --numBootstraps. Transcript-level raw counts are then scaled down by the estimated overdispersion to reflect their true precision prior to proceeding with the standard DE pipeline. Briefly:

catch <- catchSalmon(paths)
dge <- DGEList(counts = catch$counts/catch$annotation$Overdispersion, genes=catch$annotation)

In principle, scaled counts from catchSalmon could also be used as input to DESeq2 for a DE analysis.

ADD REPLY • link 22 months ago Pedro Baldoni ▴ 40

0

Entering edit mode

Thanks Pedro Baldoni for this. Is there somewhere a tutorial on how to run a differential transcript expression analysis with Kallisto and EdgeR?

ADD REPLY • link 22 months ago Assa Yeroslaviz ★ 1.5k

0

Entering edit mode

The function catchKallisto in edgeR reads in the output from kallisto and estimates the mapping ambiguity overdispersion similarly

ADD REPLY • link 21 months ago Pedro Baldoni ▴ 40

0

Entering edit mode

See the edgeR User's Guide and Pedro's paper, which is now published: https://pubmed.ncbi.nlm.nih.gov/38059347/

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

@OP, catchSalmon is from edgeR, see

ADD REPLY • link updated 22 months ago by Michael Love 43k • written 22 months ago by ATpoint ★ 4.6k