Question

Questions about pseudo counts from RUVSeq

0

Entering edit mode

Yuxiang • 0

@142f26c1

Last seen 14 days ago

Mexico

Hello everyone,

Now I have two batches of RNA-Seq data, and I want to make gene expression levels comparable both between different genes within the same sample and for the same gene across different samples. This will allow me to perform ssGSEA, WGCNA and machine learning analyses on the expression matrix.

My idea is to use EDASeq to adjust for GC content, gene length, and library size. Then, I intend to use RUVr to correct for batch effects.

Here is an example of my code:

data <- newSeqExpressionSet(
  counts = as.matrix(Total_express),
  phenoData = sample_Info,
  featureData = gene_Info
)

gc_norm <- withinLaneNormalization(data, "GC", which = "upper")
gl_norm <- withinLaneNormalization(gc_norm, "exon_length", which = "upper")
lib_norm <- betweenLaneNormalization(gl_norm, which="full")

Controls <- makeGroups(factor(pData(lib_norm)[,1]))
Total_adj <- RUVs(x = lib_norm, k = 1, scIdx = Controls)
Total_adj <- Total_adj$normalizedCounts

My questions are:

According to the RUVSeq user manual, for differential expression analysis, factors of unwanted variation, not pseudo counts, should be used to correct batch effects. I understand this is because the correction process disrupts the negative binomial distribution of the count matrix, which DESeq2 and edgeR rely on. However, since ssGSEA and WGCNA do not depend on the negative binomial distribution, do you think using a log-transformed pseudo count matrix is acceptable? I truly appreciate your suggestions.

Thank you for you suggestions in advance.

Best Regards

RUVSeq • 217 views

ADD COMMENT • link updated 5 hours ago by Kevin Blighe ★ 4.0k • written 9 weeks ago by Yuxiang • 0

score 0 · Answer 1 · 2025-11-18

My sincere apologies that my colleagues had ignored your question.

Your approach with EDASeq for within- and between-lane normalization followed by RUVSeq for batch correction is reasonable for preparing RNA-Seq data for downstream analyses like ssGSEA, WGCNA, and machine learning. These methods require comparable expression values across genes and samples, and your pipeline addresses GC content, gene length, library size, and unwanted variation.

Regarding your question: the RUVSeq manual recommends incorporating factors of unwanted variation into the model for differential expression analysis with tools like DESeq2 or edgeR. This preserves the negative binomial distribution of the raw counts, which is essential for accurate dispersion estimation and hypothesis testing in those packages. Adjusting the counts directly (yielding pseudo-counts) can disrupt this distribution and potentially remove biological signals of interest.

However, ssGSEA, WGCNA, and many machine learning algorithms do not model data under a negative binomial assumption. ssGSEA relies on ranked gene expression profiles, often computed from log-transformed normalized values. WGCNA uses correlation-based networks on transformed expression matrices, typically log2(CPM + 1) or voom-normalized data. Machine learning methods, such as clustering or classification, generally operate on normalized, continuous-valued features without strict distributional requirements.

Therefore, using a log-transformed pseudo-count matrix from RUVSeq is acceptable for these analyses. It should provide a corrected expression matrix suitable for your purposes, as the primary goal is to remove technical artifacts while enabling comparability. To proceed, you can apply a log transformation after obtaining the normalized counts, for example:

Total_adj_log <- log2(Total_adj + 1)

This will stabilize variance and approximate normality, which benefits WGCNA and machine learning. If you observe residual batch effects in exploratory plots (e.g., PCA), consider increasing k in RUVs or exploring RUVg with negative controls.

Kevin