Dear Bioconductor community,
I would like to ask a very specific question about the transformation of raw RSEM transcript quantification estimates-briefly, I would like to transform the raw expected rsem gene expression estimates I have acquired from TCGA, in order to perform some downstream unsupervised clustering analysis (below an arbitary example):
library(TCGAbiolinks)
query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
+ data.category = "Gene expression",
+ data.type = "Gene expression quantification",
+ platform = "Illumina HiSeq",
+ file.type = "results",
+ experimental.strategy = "RNA-Seq",
+ barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
+ legacy = TRUE)
GDCdownload(query.exp.hg19)
exp <- GDCprepare(query.exp.hg19)
head(exp)
class: RangedSummarizedExperiment
dim: 6 2
metadata(1): data_release
assays(2): raw_count scaled_estimate
rownames(6): A1BG|1 A2M|2 ... SERPINA3|12 AADAC|13
rowData names(4): gene_id entrezgene ensembl_gene_id
transcript_id.transcript_id_TCGA-14-0736-02A-01R-2005-01
colnames(2): TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
colData names(106): barcode patient ...
paper_Telomere.length.estimate.in.blood.normal..Kb.
paper_Telomere.length.estimate.in.tumor..Kb.
head(assay.list)
[1] "raw_count" "scaled_estimate" "transcript_id"
xx <- assays(exp)
head(xx[[1]])
TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
A1BG|1 707.94 753.89
A2M|2 10361.63 63407.77
NAT1|9 146.00 133.00
NAT2|10 5.00 8.00
SERPINA3|12 183076.00 84558.00
AADAC|13 1.00 1.00
head(xx[[2]])
TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
A1BG|1 2.454131e-05 1.977564e-05
A2M|2 3.043553e-04 1.148383e-03
NAT1|9 4.935974e-06 3.478234e-06
NAT2|10 2.429018e-07 2.972748e-07
SERPINA3|12 7.182465e-03 2.542586e-03
AADAC|13 3.870805e-08 2.965902e-08
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tximport_1.16.1 tximportData_1.16.0
So from the above, the first matrix is the raw RSEM estimates, whereas the second is the "scaled" estimates-
Thus, can RSEM gene expression estimates used directly into the variance stabilizing transformation function (vst) from DESeq2 ? or initially they need to be rounded ?
As alternative, for a more "robust" solution, tximport can be used initially to read the above RSEM counts ? and perform downstream analysis and normalization ? but with which modifications ? in the relative vignette of tximport for RSEM:
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
the files are imported separately-moreover, the scaled estimates could be utilized additionally ?
Thank you in advance,
Efstathios
Dear Michael,
thank you for the confirmation regarding the sequential use of the specific functions-however, as my input is different from the example in the tximport vignette regarding RSEM, how you would modify the input:
1) You would use only the raw RSEM estimates (raw_count from above) and not the scaled estimates, correct ?
2) How should I use the files argument? as from above my input is a unique matrix of the RSEM values and not separate files ?
If you have a matrix of RSEM values, the way to use the above workflow would be to import counts and lengths and construct a
dds
object from these. The counts (not scaled) can be rounded to integers and passed toDESeqDataSetFromMatrix
. The lengths can be stored like so:which will use the gene lengths for normalization (this is described briefly in the
normMatrix
argument section of?estimateSizeFactors
).Dear Michael, thank you for your updates and suggestions-two final questions in order not to disturb you further:
1) For the DESeqDataSetFromMatrix function: as counts the rounded RSEM counts, along with the relative phenotype data, but for the design matrix, can I just use ~1 ? As I'm not interested in downstream DE analysis, and just for unsupervised clustering after vst normalization ?
2) Concerning gene lengths-assuming dds is the output of DESeqDataSetFromMatrix, and I only have gene symbols in the rows of the input matrix, is there a direct way to retrieve gene lengths from the gene symbols ? and then use your function assays(dds)[["avgTxLength"]] <- lengths ? before feeding into vst ?
briefly, I was initially considering:
but I could find a way how to utilize the gene symbols...
If you don't have the effective length information from RSEM, then just leave off the
avgTxLength
part.You are then using the estimated counts from RSEM without the offset for gene length (so then you will not be using the tximport methods at all, just rounded estimated counts).
The utility of gene length information for this step is if you had sample-specific information about gene length, getting the annotation-based lengths doesn't help at this particular step.
I see, so it is irrelevant from the annotation-based lengths-thanks for the explanation Michael !!