Applying varianceStabilizingTransformation to log transformed RNA seq data from cBioPortal
1
0
Entering edit mode
Ben Morris • 0
@ben-morris-16089
Last seen 3.3 years ago
University of Virginia

I am currently working to analyze RNAseq patient data for lung cancer patients available through cBioPortal/TCGA. More specifically, I want to pass patient RNAseq count data to varianceStabilizingTransformation before doing additional downstream clustering. When I attempt to pass the data to VST, I receive an error stating: "Error in DESeqDataSet(se, design = design, ignoreRank): some values in assay are negative." When looking at the count data from cBioPortal, the data does indeed contain negative values. This makes sense considering that the count data had already been log transformed (and potentially normalized? I haven't been able to verify this in the cBioPortal documentation or in other literature) before I obtained it.

I've read the DESeq2 vignette as well as other related literature and am struggling with how to deal with these negative values. I've seen some researchers simply assign negative values a new value of "0." Others have added a constant value to the entire matrix so that all values are >0. In discussion with other colleagues it has been suggested to simply 'undo' the log transformation prior to passing the count data to VST. How should I handle these negative values prior to passing data to VST? Below is my simple code as well as a sample of my input patient data. Thanks in advance!

Sample Code

library(DESeq2)

analysis <- read.delim("C:/..../analysis.txt")

gene_matrix <- as.matrix.data.frame(analysis)

stabilized <- varianceStabilizingTransformation(gene_matrix, blind = TRUE, fitType = "parametric")

>> Error in DESeqDataSet(se, design = design, ignoreRank): some values in assay are negative.

Patient Data Sample (directly from cBioPortal)

      SAMPLE_ID        Gene A   Gene B   Gene C  Gene D   Gene E   Gene F  Gene G
TCGA-05-4249-01     -0.7275   -0.7416    -0.8330   5.1667   -0.7212   -0.2081   0.9704
TCGA-05-4384-01     -0.8908   -0.8282    -1.1507  -0.1649   -0.4065   -0.8573   0.3573

variancestabilizingtransformation rnaseq cBioPortal deseq2 • 1.3k views
ADD COMMENT
1
Entering edit mode
Sander Tan ▴ 20
@sander-tan-12882
Last seen 3.3 years ago
The Hyve, Netherlands

Hi Ben,

The data in cBioPortal OncoPrint is RSEM normalized and z-scored, and therefore not compatible with DESeq2. If I remember correctly, DESeq2 requires raw counts as input data. But z-score values can be used for clustering. This can be done cBioPortal OncoPrint by clicking the "Heatmap" button, "Add Genes to Heatmap", "Cluster Heatmap".

If you'd like to do this outside of cBioPortal, you could obtain the expression tables from cBioPortal Datahub. For example, this directory contains the Lung Adenocarcinoma TCGA provisional data:

https://github.com/cBioPortal/datahub/tree/master/public/luad_tcga

It contains several files for RNA-Seq expression:

  • data_RNA_Seq_v2_expression_median.txt - RSEM normalized
  • data_RNA_Seq_v2_mRNA_median_Zscores.txt - RSEM normalized, z-score transformed
  • meta_RNA_Seq_v2_expression_median.txt - Meta information about data file
  • meta_RNA_Seq_v2_mRNA_median_Zscores.txt - Meta information about data file

The cBioPortal z-scoring method has been documented here: https://github.com/cBioPortal/cbioportal/blob/master/docs/Z-Score-normalization-script.md

Best,

Sander

Data Scientist & cBioPortal developer

ADD COMMENT
2
Entering edit mode

Yes, Sander is correct that DESeq2 requires counts as input. Since this data is already transformed it cannot be used as input to DESeq2. The VST is log2-like.

ADD REPLY
0
Entering edit mode

Sander, thanks for your response! I went through the files posted on the Datahub but I don't think the information I need is available there (I need raw, un-normalized RNAseq counts if possible). I was able to pull down un-normalized RNAseq counts for only 162 of the LUAD patients using Firehose but it appears the same information is not available for other patients in the LUAD study. Is there a way I could access this data?

ADD REPLY
1
Entering edit mode

Perhaps it's available on https://portal.gdc.cancer.gov , that should contain all the publicly released TCGA data. 

ADD REPLY

Login before adding your answer.

Traffic: 157 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6