Question

DESeq2 - how many surrogate variables from svaseq, use of lfcshrink

0

Entering edit mode

ja569116 • 0

@08ccc285

Last seen 10 hours ago

United States

Hi,

I have a microRNA dataset, and I have samples from 2 suborders, 5 families, and 7 species. I would like to find differentially expressed microRNAs to then conduct functional enrichment with like gene ontology, or KEGG pathway. I first checked the data for patterns using PCA and clustering, and I first used the normalized counts norm_cetaceans <- counts(res_27_cets, normalized = TRUE)

But later I found that it is suggested to use either VST or rlog transformation so I used VST for the exploratory analysis.

Are exploratory analysis with VST equivalent with the normalized counts from DESeq()?

Also, I checked the VST dataset, and some values are negative, is this okay, expected?

head(counts_VST_19, c(6L, 6L))

135208-DM-Ttru 135209-ET-Ttru 137808-DM-Ttru 137809-ET-Ttru 145459-Ddel 145462-Ddel
hsa-miR-324-3p -3.658031 4.448456 -3.658031 -3.658031 2.9010645 -3.658031
hsa-miR-212-3p -3.658031 -3.658031 -3.658031 -3.658031 -3.6580309 -3.658031
hsa-miR-660-5p 7.108520 6.469499 7.325406 7.607373 8.0828895 7.469345
hsa-miR-670-3p -3.658031 -3.658031 -3.658031 -3.658031 -3.6580309 -3.658031
hsa-miR-876-3p -3.658031 -3.658031 -3.658031 -3.658031 -3.6580309 -3.658031
hsa-miR-382-3p -3.658031 -3.658031 -3.658031 -3.658031 -0.7916052 -3.658031

Also, in the exploratory analysis section, it uses the code: vsd <- vst(dds, blind = FALSE)

Shouldn't one use blind=T?

In the website it says : "In the above function calls, we specified blind = FALSE, which means that differences between cell lines and treatment (the variables in the design) will not contribute to the expected variance-mean trend of the experiment", while on the R ?help states: blind logical, whether to blind the transformation to the experimental design. blind=TRUE should be used for comparing samples in a manner unbiased by prior information on samples, for example to perform sample QA (quality assurance).

In the RNA-seq workflow it mentions using svaseq to identify Removing hidden batch effects. After the exploratory analysis, some samples showed ill-behavior, not clustering with its species, far away in the dendrogram/PCA (3D). After removing those, I did PCA and clustering, and the dataset looked better. I am undecided whether it is beneficial or recommended to check for hidden batch effects. For instance, after removing the weird samples, many microRNAs were removed due to all-zeros. Here are the plots for SVA - using 3 surrogate variables suggested by . Plot of surrogate 1-3 ~ ind/sps Does the data show any pattern to suggest using 3 surrogate variables?

I also wanted to mention that I have 4 species, but I also want to analyse and find DE-microRNAs at the family (3 groups) and suborder level (2). When I add the design ~ Suborder + Family, DESEq mentions that the groups are linear combinations of others. Is this error normal, so that I have to use separate designs?

Finally, as I said, I want to get the DE-mircroRNAs, and do an enrichment analysis but I am confused about how to proceed. I read that one should use the padjust values and rank the microRNAs (genes) by lfcShrink, but my confusing is that lfcShrink also adjusts padjust values. It seems weird that I used the padjust values from results but then use the log2FoldChange from lfcShrink - which is a different "matrix" with different padjust values.

Sorry for the questions, I want to make sure that I am proceeding the right way.

Thanks

sessionInfo( )

R version 4.3.2 (2023-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 11 x64 (build 26100)

Matrix products: default

locale: 1 LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/New_York tzcode source: internal

attached base packages: 1 stats4 stats graphics grDevices utils datasets methods base

other attached packages: 1 sva_3.50.0 BiocParallel_1.36.0 mgcv_1.9-3
[4] nlme_3.1-168 SWATH2stats_1.32.1 IHW_1.30.0
[7] genefilter_1.84.0 ggdendro_0.2.0 gridExtra_2.3
[10] tsne_0.1-3.1 glmpca_0.2.0 RColorBrewer_1.1-3
[13] PoiClaClu_1.0.2.1 umap_0.2.10.0 plotly_4.11.0
[16] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0
[19] dplyr_1.1.4 purrr_1.0.4 readr_2.1.6
[22] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0
[25] factoextra_1.0.7 FactoMineR_2.12 pheatmap_1.0.13
[28] gplots_3.2.0 scales_1.4.0 ggpubr_0.6.2
[31] ggplot2_4.0.0 DESeq2_1.42.1 SummarizedExperiment_1.32.0 [34] Biobase_2.62.0 MatrixGenerics_1.14.0 matrixStats_1.5.0
[37] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8 IRanges_2.36.0
[40] S4Vectors_0.40.2 BiocGenerics_0.48.1

loaded via 1 splines_4.3.2 later_1.4.4 filelock_1.0.3 bitops_1.0-9
XML_3.99-0.20 lifecycle_1.0.4 rstatix_0.7.3
processx_3.8.6 lattice_0.22-7 MASS_7.3-60
crosstalk_1.2.2 backports_1.5.0 magrittr_2.0.4
yaml_2.3.10 remotes_2.5.0 httpuv_1.6.16
askpass_1.2.1 sessioninfo_1.2.3 pkgbuild_1.4.8
DBI_1.2.3 cowplot_1.2.0 multcomp_1.4-29
pkgload_1.4.1 zlibbioc_1.48.2 RCurl_1.98-1.17
rappdirs_0.3.3 sandwich_3.1-1 GenomeInfoDbData_1.2.11 RSpectra_0.16-2 annotate_1.80.0 codetools_0.2-20
xml2_1.5.1 DT_0.34.0 tidyselect_1.2.1
BiocFileCache_2.10.2 jsonlite_2.0.0 ellipsis_0.3.2
survival_3.8-3 emmeans_2.0.1 bbmle_1.0.25.1
tools_4.3.2 Rcpp_1.1.0 glue_1.8.0
usethis_3.2.1 withr_3.0.2 numDeriv_2016.8-1.1
fastmap_1.2.0 openssl_2.3.4 callr_3.7.6
digest_0.6.37 timechange_0.3.0 R6_2.6.1
estimability_1.5.1 colorspace_2.1-2 gtools_3.9.5
RSQLite_2.4.5 dichromat_2.0-0.1 utf8_1.2.6
ggsci_4.2.0 data.table_1.17.8 prettyunits_1.2.0
htmlwidgets_1.6.4 S4Arrays_1.2.1 scatterplot3d_0.3-44
gtable_0.3.6 blob_1.2.4 S7_0.2.0
htmltools_0.5.8.1 carData_3.0-5 profvis_0.4.0
leaps_3.2 png_0.1-8 rstudioapi_0.17.1
tzdb_0.5.0 coda_0.19-4.1 curl_7.0.0
zoo_1.8-14 cachem_1.1.0 KernSmooth_2.23-26
miniUI_0.1.2 AnnotationDbi_1.64.1 desc_1.4.3
pillar_1.11.1 grid_4.3.2 vctrs_0.6.5
urlchecker_1.0.1 promises_1.5.0 car_3.1-3
xtable_1.8-4 cluster_2.1.8.1 mvtnorm_1.3-3
locfit_1.5-9.12 compiler_4.3.2 rlang_1.1.5
ggsignif_0.6.4 fdrtool_1.2.18 labeling_0.4.3
ps_1.9.1 plyr_1.8.9 fs_1.6.6
viridisLite_0.4.2 Biostrings_2.70.3 lazyeval_0.2.2
Matrix_1.6-4 hms_1.1.4 bit64_4.6.0-1
KEGGREST_1.42.0 shiny_1.12.1 broom_1.0.11

```

sva lfcshrink DESeq2 • 297 views

ADD COMMENT • link 4 weeks ago • updated 2 days ago ja569116 • 0

score 0 · Answer 1 · 2025-12-20

0

Entering edit mode

aqeel • 0

@12cc4cde

Last seen 29 days ago

Pakistan

Discover the pinnacle of Brazilian Jiu Jitsu gear at Dragon GI.

ADD COMMENT • link 4 weeks ago aqeel • 0

score 0 · Answer 2 · 2025-12-22

and I first used the normalized counts

You would not use these normalized counts for PCA as they're not on log2 scale. As such, the feature selection steps in PCA just retrieves genes with large variances due to large counts rather than differences between samples. Non-log counts are also far away from normality which PCA assumes. At least do a log2+1 transformation if you want to use these.

But later I found that it is suggested to use either VST or rlog transformation so I used VST for the exploratory analysis.

Are exploratory analysis with VST equivalent with the normalized counts from DESeq()?

Yes, vst/rlog that is recommended because both methods are log2 or log2-like and do some additional magic to remove noise from the data, so it's a good choice for PCA. For the second quote, as said above, plain normalized counts are a poor choice.

For QC, using blind = TRUE (the default) makes sense, see https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#blind-dispersion-estimation, but for something like downstream analysis (heatmaps, clustering), unblinded transformation makes sense.

--- vst ... negative values

That's probably due to sparsity, see regularized log transformation- loss of zero values for sparsely expressed genes

... sva ...

Personally I would only use svaseq / RUVseq if there is groups of samples that show odd behaviour in clustering, and if the sample size is large, such as in human studies, where confounders are not known. If you have individual outliers, where you already see that many genes are zeros, I don't really see the point, as the methods will not magically introduce counts to these samples. How many SVs to include into the design is a fundamental question that personally I have not yet read a convinving answer about, it's basically random.

I read that one should use the padjust values and rank the microRNAs (genes) by lfcShrink, but my confusing is that lfcShrink also adjusts padjust values. It seems weird that I used the padjust values from results but then use the log2FoldChange from lfcShrink - which is a different "matrix" with different padjust values.

The adjusted p-values are for significance filtering. The lfcShrink offers a method that removes noise from the logFCs and the authors recommend it for ranking purposes. You can rank by lfcShrink logFCs, or by pvalues (nominal, not adjusted as adjusted have ties), choice is yours.

For questions on experimental design, please seek local collaborations, this is beyond the scope here.