Hi, everyone!
I have a NanoString expression set and I want to do GSVA. I followed the procedures of An approach for normalization and quality control for NanoString RNA expression data by Dr. Michael I. Love:
- Quality control.
- Identify housekeeping genes.
- Pre-normalization visualization via RLE plot and PCA plot. I found batch effects in the expression set.
- I used
RUVSeq
,DESeq2
andlimma
packages to remove batch effect and normalize the data. The codes mainly come from the Github provided by the article mentioned above. I changed slightly according to my data.
library(RUVSeq)
set <- newSeqExpressionSet(as.matrix(raw), phenoData = pData, featureData = fData)
cIdx <- rownames(set)[fData(set)$Class == "Housekeeping"]
set <- betweenLaneNormalization(set, which = "upper")
set <- RUVg(set, cIdx, k = 1)
library(DESeq2)
library(limma)
dds <- DESeqDataSetFromMatrix(counts(set), colData = pData(set), design = ~1)
rowData(dds) <- fData
dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds)
dds <- estimateDispersions(dds, fitType = "mean")
vsd <- varianceStabilizingTransformation(dds, blind = FALSE)
mat <- assay(vsd)
covars <- as.matrix(colData(dds)[,grep("W",colnames(colData(dds))),drop = FALSE])
mat <- removeBatchEffect(mat, covariates = covars)
vsd.after <- vsd
assay(vsd.after) <- mat
- I used dataset in
vsd.after
to draw RLE and PCA plot again to check if there were still batch effects. It showed the batch effects were removed.
According to the protocol of GSVA
package, the input of gsva()
should be a normalized gene expression dataset. I think the dataset in vsd.after
is the one after being removed batch effect and normalization. But I remembered the protocol of DESeq2
said that it is only good for visualization and could not be used for downstream analysis.
I also found there is a dataset in set
, which is different from my raw expression set and not being normalized by DESeq2
and limma
package. It could be extracted by the code set@assayData[["normalizedCounts"]]
. I am not sure what it is. Is it the counts without batch effects? Could it be used for GSVA directly?
Could anyone tell me which one (vsd.after
or set
) could be used for GSVA? If they are not the proper data, how can I get the normalized dataset for GSVA?
Thank you!