Hello everyone,
Now I have two batches of RNA-Seq data, and I want to make gene expression levels comparable both between different genes within the same sample and for the same gene across different samples. This will allow me to perform ssGSEA, WGCNA and machine learning analyses on the expression matrix.
My idea is to use EDASeq to adjust for GC content, gene length, and library size. Then, I intend to use RUVr to correct for batch effects.
Here is an example of my code:
data <- newSeqExpressionSet(
counts = as.matrix(Total_express),
phenoData = sample_Info,
featureData = gene_Info
)
gc_norm <- withinLaneNormalization(data, "GC", which = "upper")
gl_norm <- withinLaneNormalization(gc_norm, "exon_length", which = "upper")
lib_norm <- betweenLaneNormalization(gl_norm, which="full")
Controls <- makeGroups(factor(pData(lib_norm)[,1]))
Total_adj <- RUVs(x = lib_norm, k = 1, scIdx = Controls)
Total_adj <- Total_adj$normalizedCounts
My questions are:
According to the RUVSeq user manual, for differential expression analysis, factors of unwanted variation, not pseudo counts, should be used to correct batch effects. I understand this is because the correction process disrupts the negative binomial distribution of the count matrix, which DESeq2 and edgeR rely on. However, since ssGSEA and WGCNA do not depend on the negative binomial distribution, do you think using a log-transformed pseudo count matrix is acceptable? I truly appreciate your suggestions.
Thank you for you suggestions in advance.
Best Regards
