Transformation of raw RSEM gene expression estimates with variance stabilizing transformation and tximport R package
Entering edit mode
svlachavas ▴ 800
Last seen 25 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

I would like to ask a very specific question about the transformation of raw RSEM transcript quantification estimates-briefly, I would like to transform the raw expected rsem gene expression estimates I have acquired from TCGA, in order to perform some downstream unsupervised clustering analysis (below an arbitary example):

query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
+                            data.category = "Gene expression",
+                            data.type = "Gene expression quantification",
+                            platform = "Illumina HiSeq",
+                            file.type  = "results",
+                            experimental.strategy = "RNA-Seq",
+                            barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
+                            legacy = TRUE)

exp <- GDCprepare(query.exp.hg19)

class: RangedSummarizedExperiment
dim: 6 2
metadata(1): data_release
assays(2): raw_count scaled_estimate
rownames(6): A1BG|1 A2M|2 ... SERPINA3|12 AADAC|13
rowData names(4): gene_id entrezgene ensembl_gene_id
colnames(2): TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
colData names(106): barcode patient ...

[1] "raw_count"       "scaled_estimate" "transcript_id" 

xx <- assays(exp)

            TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
A1BG|1                            707.94                       753.89
A2M|2                           10361.63                     63407.77
NAT1|9                            146.00                       133.00
NAT2|10                             5.00                         8.00
SERPINA3|12                    183076.00                     84558.00
AADAC|13                            1.00                         1.00

            TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
A1BG|1                      2.454131e-05                 1.977564e-05
A2M|2                       3.043553e-04                 1.148383e-03
NAT1|9                      4.935974e-06                 3.478234e-06
NAT2|10                     2.429018e-07                 2.972748e-07
SERPINA3|12                 7.182465e-03                 2.542586e-03
AADAC|13                    3.870805e-08                 2.965902e-08

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tximport_1.16.1     tximportData_1.16.0

So from the above, the first matrix is the raw RSEM estimates, whereas the second is the "scaled" estimates-

Thus, can RSEM gene expression estimates used directly into the variance stabilizing transformation function (vst) from DESeq2 ? or initially they need to be rounded ?

As alternative, for a more "robust" solution, tximport can be used initially to read the above RSEM counts ? and perform downstream analysis and normalization ? but with which modifications ? in the relative vignette of tximport for RSEM:

txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)

the files are imported separately-moreover, the scaled estimates could be utilized additionally ?

Thank you in advance,


DESeq2 VST rsem tximport • 1.1k views
Entering edit mode
Last seen 9 hours ago
United States

I would recommend the following functions:

  • tximport
  • DESeqDataSetFromTximport
  • vst

This will take care of the gene length as well as sequencing depth and variance stabilization. You can also use the dds object for differential testing.

Entering edit mode

Dear Michael,

thank you for the confirmation regarding the sequential use of the specific functions-however, as my input is different from the example in the tximport vignette regarding RSEM, how you would modify the input:

1) You would use only the raw RSEM estimates (raw_count from above) and not the scaled estimates, correct ?

2) How should I use the files argument? as from above my input is a unique matrix of the RSEM values and not separate files ?

Entering edit mode

If you have a matrix of RSEM values, the way to use the above workflow would be to import counts and lengths and construct a dds object from these. The counts (not scaled) can be rounded to integers and passed to DESeqDataSetFromMatrix. The lengths can be stored like so:

assays(dds)[["avgTxLength"]] <- lengths

which will use the gene lengths for normalization (this is described briefly in the normMatrix argument section of ?estimateSizeFactors).

Entering edit mode

Dear Michael, thank you for your updates and suggestions-two final questions in order not to disturb you further:

1) For the DESeqDataSetFromMatrix function: as counts the rounded RSEM counts, along with the relative phenotype data, but for the design matrix, can I just use ~1 ? As I'm not interested in downstream DE analysis, and just for unsupervised clustering after vst normalization ?

2) Concerning gene lengths-assuming dds is the output of DESeqDataSetFromMatrix, and I only have gene symbols in the rows of the input matrix, is there a direct way to retrieve gene lengths from the gene symbols ? and then use your function assays(dds)[["avgTxLength"]] <- lengths ? before feeding into vst ?

briefly, I was initially considering:

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

[13] "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"      "TXID"       "TXNAME"    
[19] "TXSTART"    "TXSTRAND"   "TXTYPE"   


but I could find a way how to utilize the gene symbols...

Entering edit mode

If you don't have the effective length information from RSEM, then just leave off the avgTxLength part.

You are then using the estimated counts from RSEM without the offset for gene length (so then you will not be using the tximport methods at all, just rounded estimated counts).

The utility of gene length information for this step is if you had sample-specific information about gene length, getting the annotation-based lengths doesn't help at this particular step.

Entering edit mode

I see, so it is irrelevant from the annotation-based lengths-thanks for the explanation Michael !!


Login before adding your answer.

Traffic: 398 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6