I am currently working to analyze RNAseq patient data for lung cancer patients available through cBioPortal/TCGA. More specifically, I want to pass patient RNAseq count data to varianceStabilizingTransformation before doing additional downstream clustering. When I attempt to pass the data to VST, I receive an error stating: "Error in DESeqDataSet(se, design = design, ignoreRank): some values in assay are negative." When looking at the count data from cBioPortal, the data does indeed contain negative values. This makes sense considering that the count data had already been log transformed (and potentially normalized? I haven't been able to verify this in the cBioPortal documentation or in other literature) before I obtained it.
I've read the DESeq2 vignette as well as other related literature and am struggling with how to deal with these negative values. I've seen some researchers simply assign negative values a new value of "0." Others have added a constant value to the entire matrix so that all values are >0. In discussion with other colleagues it has been suggested to simply 'undo' the log transformation prior to passing the count data to VST. How should I handle these negative values prior to passing data to VST? Below is my simple code as well as a sample of my input patient data. Thanks in advance!
Sample Code
library(DESeq2)
analysis <- read.delim("C:/..../analysis.txt")
gene_matrix <- as.matrix.data.frame(analysis)
stabilized <- varianceStabilizingTransformation(gene_matrix, blind = TRUE, fitType = "parametric")
>> Error in DESeqDataSet(se, design = design, ignoreRank): some values in assay are negative.
Patient Data Sample (directly from cBioPortal)
SAMPLE_ID Gene A Gene B Gene C Gene D Gene E Gene F Gene G
TCGA-05-4249-01 -0.7275 -0.7416 -0.8330 5.1667 -0.7212 -0.2081 0.9704
TCGA-05-4384-01 -0.8908 -0.8282 -1.1507 -0.1649 -0.4065 -0.8573 0.3573
Yes, Sander is correct that DESeq2 requires counts as input. Since this data is already transformed it cannot be used as input to DESeq2. The VST is log2-like.
Sander, thanks for your response! I went through the files posted on the Datahub but I don't think the information I need is available there (I need raw, un-normalized RNAseq counts if possible). I was able to pull down un-normalized RNAseq counts for only 162 of the LUAD patients using Firehose but it appears the same information is not available for other patients in the LUAD study. Is there a way I could access this data?
Perhaps it's available on https://portal.gdc.cancer.gov , that should contain all the publicly released TCGA data.