Question: Applying varianceStabilizingTransformation to log transformed RNA seq data from cBioPortal
gravatar for Ben Morris
4 months ago by
Ben Morris0
University of Virginia
Ben Morris0 wrote:

I am currently working to analyze RNAseq patient data for lung cancer patients available through cBioPortal/TCGA. More specifically, I want to pass patient RNAseq count data to varianceStabilizingTransformation before doing additional downstream clustering. When I attempt to pass the data to VST, I receive an error stating: "Error in DESeqDataSet(se, design = design, ignoreRank): some values in assay are negative." When looking at the count data from cBioPortal, the data does indeed contain negative values. This makes sense considering that the count data had already been log transformed (and potentially normalized? I haven't been able to verify this in the cBioPortal documentation or in other literature) before I obtained it.

I've read the DESeq2 vignette as well as other related literature and am struggling with how to deal with these negative values. I've seen some researchers simply assign negative values a new value of "0." Others have added a constant value to the entire matrix so that all values are >0. In discussion with other colleagues it has been suggested to simply 'undo' the log transformation prior to passing the count data to VST. How should I handle these negative values prior to passing data to VST? Below is my simple code as well as a sample of my input patient data. Thanks in advance!

Sample Code


analysis <- read.delim("C:/..../analysis.txt")

gene_matrix <-

stabilized <- varianceStabilizingTransformation(gene_matrix, blind = TRUE, fitType = "parametric")

>> Error in DESeqDataSet(se, design = design, ignoreRank): some values in assay are negative.

Patient Data Sample (directly from cBioPortal)

      SAMPLE_ID        Gene A   Gene B   Gene C  Gene D   Gene E   Gene F  Gene G
TCGA-05-4249-01     -0.7275   -0.7416    -0.8330   5.1667   -0.7212   -0.2081   0.9704
TCGA-05-4384-01     -0.8908   -0.8282    -1.1507  -0.1649   -0.4065   -0.8573   0.3573

ADD COMMENTlink modified 4 months ago by Sander Tan20 • written 4 months ago by Ben Morris0
gravatar for Sander Tan
4 months ago by
Sander Tan20
The Hyve, Netherlands
Sander Tan20 wrote:

Hi Ben,

The data in cBioPortal OncoPrint is RSEM normalized and z-scored, and therefore not compatible with DESeq2. If I remember correctly, DESeq2 requires raw counts as input data. But z-score values can be used for clustering. This can be done cBioPortal OncoPrint by clicking the "Heatmap" button, "Add Genes to Heatmap", "Cluster Heatmap".

If you'd like to do this outside of cBioPortal, you could obtain the expression tables from cBioPortal Datahub. For example, this directory contains the Lung Adenocarcinoma TCGA provisional data:

It contains several files for RNA-Seq expression:

  • data_RNA_Seq_v2_expression_median.txt - RSEM normalized
  • data_RNA_Seq_v2_mRNA_median_Zscores.txt - RSEM normalized, z-score transformed
  • meta_RNA_Seq_v2_expression_median.txt - Meta information about data file
  • meta_RNA_Seq_v2_mRNA_median_Zscores.txt - Meta information about data file

The cBioPortal z-scoring method has been documented here:



Data Scientist & cBioPortal developer

ADD COMMENTlink modified 4 months ago • written 4 months ago by Sander Tan20

Yes, Sander is correct that DESeq2 requires counts as input. Since this data is already transformed it cannot be used as input to DESeq2. The VST is log2-like.

ADD REPLYlink written 4 months ago by Michael Love19k

Sander, thanks for your response! I went through the files posted on the Datahub but I don't think the information I need is available there (I need raw, un-normalized RNAseq counts if possible). I was able to pull down un-normalized RNAseq counts for only 162 of the LUAD patients using Firehose but it appears the same information is not available for other patients in the LUAD study. Is there a way I could access this data?

ADD REPLYlink written 4 months ago by Ben Morris0

Perhaps it's available on , that should contain all the publicly released TCGA data. 

ADD REPLYlink written 4 months ago by Sander Tan20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 389 users visited in the last hour