Dear BiocCommunity,
i have imported from the R package curatedCRCdata, an RNA-Seq dataset regarding a specific type of cancer. In detail, after contacting the maintainers of the package, they kindely provided the information, that the data are IlluminaHiSeq_RNASeqV2 Level 3 data, which were quantified using RSEM (https://wiki.nci.nih.gov/display/tcga/rnaseq+version+2). When I load the dataset from the package:
library(curatedCRCData)
data(TCGA.RNASeqV2_eset)
TCGA.RNASeqV2_eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 20502 features, 195 samples
element names: exprs
protocolData: none
phenoData
sampleNames: TCGA.AA.3662 TCGA.A6.4105 ... TCGA.A6.6652 (195 total)
varLabels: unique_patient_ID alt_sample_name ... uncurated_author_metadata (59 total)
varMetadata: labelDescription
featureData
featureNames: ? A1BG ... ZZZ3 (20502 total)
fvarLabels: probeset gene
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 22810696
Annotation: NA
class(TCGA.RNASeqV2_eset)
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
head(exprs(TCGA.RNASeqV2_eset)) # a small output
TCGA.AA.3662 TCGA.A6.4105 TCGA.F4.6463 TCGA.F4.6806 TCGA.A6.6650 TCGA.AZ.6600
? 9.282712 9.933779 10.0443941 9.910121 10.088809 9.875006 9.893715
A1BG 6.027692 4.707000 3.6559242 5.592373 3.253914 5.622860 3.563683
A1CF 8.273718 7.445153 7.0927571 7.118704 7.855290 8.577953 7.774680
range(exprs(
TCGA.RNASeqV2_eset))
[1] 0.00000 20.34961
My main questions are the following:
Because from a relative search in other posts/papers, the RSEM does not provide "essentially raw counts", but estimated counts (which are also not rounded). Thus:
- Is it possible to use for downstream analysis for RNA-Seq data like edgeR ? Or because the counts have e to be integer, other methodologies/packages are eligible for a simple differential expression analysis (for instance a two-group comparison) ?
- Secondly, also from the values above, it seems that the counts are also somehow normalized or transformed (perhaps it is the output of "rsem.genes.normalized_results" in the above link of wiki NCI). Unfortunately, i could not find further information about the above level 3 transformation (for how the rsem counts are normalized). Again, a proper normalization methodology (like TMM) should be necessary for any downstream analysis ?
Please excuse me for any naive questions, but im relatively new to RNA-Seq and at this point any suggestions or opinions would be essential !!
You can round them and use in DESEQ2, but that isn't recommended and people normally use a specific tool that can cope with the uncertainty of read count estimates like EBSEQ.
The counts do not need to be rounded for use in edgeR. edgeR has supported fractional counts for a number of years.
Good to know
Chris thank you for your answer !! i have seen also the EBSEQ !! unfortunately, there is not a clear answer to this specific problem, so i hope with this post i get any suggestions to take all the possible options into account !!
http://deweylab.biostat.wisc.edu/rsem/README.html. Colin who developed RSEM recommends using EBSEQ. That is the answer for this specific problem.
We always used to use rounded counts in DESEQ, it used to upset the statistics guy, but we ignored him.