Search
Question: Possible ways of performing differential gene expression and analysis of RNA-Seq data from normalized counts of RSEM software
0
gravatar for svlachavas
11 weeks ago by
svlachavas430
Greece/Athens/National Hellenic Research Foundation
svlachavas430 wrote:

Dear BiocCommunity,

i have imported from the R package curatedCRCdata, an RNA-Seq dataset regarding a specific type of cancer. In detail, after contacting the maintainers of the package, they kindely provided the information, that the data are IlluminaHiSeq_RNASeqV2 Level 3 data, which were quantified using RSEM (https://wiki.nci.nih.gov/display/tcga/rnaseq+version+2). When I load the dataset from the package:

library(curatedCRCData)

data(TCGA.RNASeqV2_eset)

TCGA.RNASeqV2_eset

ExpressionSet (storageMode: lockedEnvironment)

assayData: 20502 features, 195 samples

  element names: exprs

protocolData: none

phenoData

  sampleNames: TCGA.AA.3662 TCGA.A6.4105 ... TCGA.A6.6652 (195 total)

  varLabels: unique_patient_ID alt_sample_name ... uncurated_author_metadata (59 total)

  varMetadata: labelDescription

featureData

  featureNames: ? A1BG ... ZZZ3 (20502 total)

  fvarLabels: probeset gene

  fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

  pubMedIds: 22810696

Annotation: NA

class(TCGA.RNASeqV2_eset)

[1] "ExpressionSet"

attr(,"package")

[1] "Biobase"

head(exprs(TCGA.RNASeqV2_eset)) # a small output

         TCGA.AA.3662 TCGA.A6.4105 TCGA.F4.6463 TCGA.F4.6806 TCGA.A6.6650 TCGA.AZ.6600

?         9.282712     9.933779   10.0443941     9.910121    10.088809     9.875006     9.893715

A1BG      6.027692     4.707000    3.6559242     5.592373     3.253914     5.622860     3.563683

A1CF      8.273718     7.445153    7.0927571     7.118704     7.855290     8.577953     7.774680

range(exprs(TCGA.RNASeqV2_eset))

[1]  0.00000 20.34961

My main questions are the following:

Because from a relative search in other posts/papers, the RSEM does not provide "essentially raw counts", but estimated counts (which are also not rounded). Thus:

  1. Is it possible to use for downstream analysis for RNA-Seq data like edgeR ? Or because the counts have e to be integer, other methodologies/packages are eligible for a simple differential expression analysis (for instance a two-group comparison) ?
  2. Secondly, also from the values above, it seems that the  counts are also somehow normalized or transformed (perhaps it is the output of "rsem.genes.normalized_results" in the above link of wiki NCI). Unfortunately, i could not find further information about the above level 3 transformation (for how the rsem counts are normalized).  Again, a proper normalization methodology (like TMM) should be necessary for any downstream analysis ?

 

Please excuse me for any naive questions, but im relatively new to RNA-Seq and at this point any suggestions or opinions would be essential !!

 

ADD COMMENTlink modified 11 weeks ago by Gordon Smyth30k • written 11 weeks ago by svlachavas430

You can round them and use in DESEQ2, but that isn't recommended and people normally use a specific tool that can cope with the uncertainty of read count estimates like EBSEQ.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by chris86260
2

The counts do not need to be rounded for use in edgeR. edgeR has supported fractional counts for a number of years.

ADD REPLYlink written 11 weeks ago by Gordon Smyth30k

Good to know

ADD REPLYlink written 11 weeks ago by chris86260

Chris thank you for your answer !! i have seen also the EBSEQ !! unfortunately, there is not a clear answer to this specific problem, so i hope with this post i get any suggestions to take all the possible options into account !!

 

ADD REPLYlink written 11 weeks ago by svlachavas430
1

http://deweylab.biostat.wisc.edu/rsem/README.html. Colin who developed RSEM recommends using EBSEQ. That is the answer for this specific problem.

We always used to use rounded counts in DESEQ, it used to upset the statistics guy, but we ignored him.

ADD REPLYlink written 11 weeks ago by chris86260
3
gravatar for Michael Love
11 weeks ago by
Michael Love11k
United States
Michael Love11k wrote:

I support the use of rounded estimated counts from software like RSEM (also Sailfish, Salmon, kallisto), and we have a pipeline built for this using the Bioconductor package tximport. See the software vignette which has examples for all of these software including RSEM's *.genes.results file, and downstream example use of edgeR, DESeq2, and limma.

What you should never do is use *normalized* counts with the count-based tools.

Yes, another question is what you have in that matrix. Those do not look like estimated counts, because they max out at 20.3.

Have you looked at the relevant man page:

?TCGA.RNASeqV2_eset
ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Michael Love11k

Is it possible that the data is time to survival in years?

Details
assayData: 20502 features, 195 samples
Platform type: NA
Overall survival time-to-event summary (in years):
Call: survfit(formula = Surv(time, cens) ~ -1)
174 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
21.000 21.000 21.000 18.000 1.208 0.715 5.605
ADD REPLYlink written 11 weeks ago by Michael Love11k

Dear Michael,

thank you for your useful answer and recommendations !! Indeed the numbers in the matrix have confused me a lot, but from the initial contact with the maintainers of the package, they mentioned that these are level 3 TCGA RNA-Seq data from RSEM. I have also checked the following links

https://bioconductor.org/packages/release/data/experiment/vignettes/curatedCRCData/inst/doc/curatedCRCData_vignette.pdf

& also https://bioconductor.org/packages/release/data/experiment/manuals/curatedCRCData/man/curatedCRCData.pdf

the information you mentioned above about survival, is also mentioned in another datasets. My main question here, is that is possible that these are indeed estimated counts from RSEM, but they have undergone some normalization tranformation ? like the part 3 of the IlluminaHiSeq_RNASeqV2 section in https://wiki.nci.nih.gov/display/tcga/rnaseq+version+2 ? That is the rsem.genes.normalized_results ? so that they have this range of values ?

 

ADD REPLYlink written 11 weeks ago by svlachavas430
1

I have no idea. 20.3 as the max value isn't really compatible with them being normalized estimated counts either.

ADD REPLYlink written 11 weeks ago by Michael Love11k
1

Maybe they're log2(expected_counts + 1)

@svlachavas: while the max is 20.3, is the min == 0?

If this is the case, we should be able to get back the expected counts by 2**vals - 1, yes?

I haven't check this package, so not sure, but I recently tried to download some data (TCGA and whatever else) from UCSC's toil/xena doodad (specifically the "TOIL Gene RSEM expected_count" file from that page) and managed to convince myself that what's in the datafile are log2(expected_counts + 1)

Maybe that's similar, here?

Just a guess.

ADD REPLYlink written 11 weeks ago by Steve Lianoglou12k
1

Dear Steve,

thank you for your comment and idea--yes the min is zerro:

range(exprs(TCGA.RNASeqV2_eset))

[1]  0.00000 20.34961

However, i don't know if the below log2 transformation [if the maximum value is greater than 50] could be reversed with your suggestion-perhaps it is different from the general log2(expected_counts +1) you mentioned above ?

ADD REPLYlink written 11 weeks ago by svlachavas430
2

The minimum value in the original files is zero, so you can recover the original values with 2^exprs(TCGA.RNASeqV2_eset) - 1. In hindsight, I would have left these values as they were! Here is a DESCRIPTION.txt file that comes with the original files from the GDC legacy archive, describing the analysis procedure for level 3 TCGA RNASeqV2 data.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Levi Waldron300
1

Dear Levi, thank you again for your valuable comments and confirmation of Steves suggestion. 3 final comments to summarize in order to be certain and close this topic:

1) i checked the description file above, as also the original paper publication. Thus, by the above tranformation, i get the "raw estimated gene counts" from RSEM, correct (as also i saw that the rows are already annotated with unique gene symbols) ? based also on the link [https://wiki.nci.nih.gov/display/tcga/rnaseq+version+2]

2) My other important question is about the phenotype information for downstream analysis:

> table(pData(TCGA.RNASeqV2_eset)$sample_type)

adjacentnormal          tumor 

            14                  181 

Thus, i suppose (and if you have any knowledge for this), because the number of normal samples is far more less than tumors, the term "adjucentnormal" means that these 14 match specific tumor samples from the 181 ? and you could only retreive information about these 14 normal samples-in other words only these being available"?

My notion for this question is for downstream analysis, like differential gene expression, if i should probably "ignore" the adjucent meaning to directly compare tumor vs normal samples as distinct groups ? Or this in your opinion is biased ? And i could inspect other between tumors comparisons, such as msi or location status, which is irrelevant from normal samples ?

3) Finally, the unique gene symbols in the rows of the eSet was a process independent of the above, and it is probably described in the blast_gene_maps.R script as the annotation methodology for keeping unique gene symbols, right ?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by svlachavas430
1

1) I believe so :D. If you want to be absolutely sure, download the level-III data for one of the samples, and check that this transformation recapitulates the original values. Although, what is currently available in the GDC may have been processed on a more recent genome so it may not be exact.

2) I don't think these normals are matched to any tumors in the dataset, although you can check with the "unique_patient_ID" column of the pData. If you do an unsupervised clustering of the samples, I think you'll find that the normals cluster closely together (except maybe one outlier here), so pairing wouldn't be important anyways.

ADD REPLYlink written 11 weeks ago by Levi Waldron300

Dear Levi, thank you one more time for your updated answers. From the following link (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/) perhaps the reference genome is the one mentioned GRCh38-Nevertheless, it probably best checking an individual sample, but again the difference with more recent genome could hamper any comparisons. Thus, i would probably perform the above transformation and use these values as raw estimated counts. Btw, for the downloading (i suppose free) you mean the actual GDC repository ?

ADD REPLYlink written 11 weeks ago by svlachavas430

Just use the values or 2^values-1 from curatedCRCData :D. They are, by the way, very highly correlated to the TCGA microarray measurements also in the package, so I am pretty confident in them.

ADD REPLYlink written 11 weeks ago by Levi Waldron300

Oh and I missed 3) - the TCGA annotation map came directly from the level 3 data, in the script download_TCGA-RNASeqV2.sh. We only used BLAST for some old platforms without a recent annotation file available.

ADD REPLYlink written 11 weeks ago by Levi Waldron300
3
gravatar for Gordon Smyth
11 weeks ago by
Gordon Smyth30k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth30k wrote:

edgeR and limma can both accept RSEM expected counts, and the counts do not need to be rounded to an integer. Both edgeR and limma work with fractional counts.

It is also possible to analyze the (approximately) log2(count+1) values directly in limma, without converting back to counts. You would normalize these with normalizeQuantiles and then use the eBayes(trend=TRUE) pipeline (without voom). Here is an example run to illustrate this simple approach. Here I compare tumour to adjacent normal samples. I have kept genes that have about 10 counts or more in at least 14 samples:

> library(curatedCRCData)
> data(TCGA.RNASeqV2_eset)
> targets <- pData(TCGA.RNASeqV2_eset)
> y <- normalizeQuantiles(exprs(TCGA.RNASeqV2_eset))
> type <- factor(targets$sample_type)
> table(type)
type
adjacentnormal          tumor 
            14            181 
> keep <- rowSums(y > log2(11)) >= 14
> table(keep)
keep
FALSE  TRUE 
 4209 16293 
> y2 <- y[keep,]
> design <- model.matrix(~type)
> fit <- lmFit(y2,design)
> fit <- eBayes(fit,robust=TRUE,trend=TRUE)
> topTable(fit,coef=2)
        logFC AveExpr     t  P.Value adj.P.Val     B
CDH3     6.56  10.294  25.2 1.24e-63  2.02e-59 133.7
KRT80    7.48   8.973  24.0 2.08e-60  1.70e-56 126.5
OTOP2   -9.11   1.504 -22.7 6.43e-57  3.49e-53 118.6
ETV4     6.06  10.469  21.6 1.06e-53  4.31e-50 111.3
ESM1     5.97   6.688  19.2 7.10e-47  2.31e-43  95.9
MTHFD1L  2.65   9.659  17.8 6.66e-43  1.81e-39  86.9
LGI1    -4.51   0.957 -17.8 9.26e-43  2.15e-39  86.5
KRT24   -5.90   0.817 -17.5 5.43e-42  1.11e-38  84.8
GRIN2D   6.07   8.724  17.2 4.46e-41  8.07e-38  82.7
JUB      3.24   8.707  16.9 3.15e-40  5.13e-37  80.8

I note that

plotMDS(y2, label=type)

shows a perfect separation of tumour from normal samples. Also

plotSA(fit)

shows the mean-variance relationship typical of RNA-seq data.

ADD COMMENTlink modified 10 weeks ago • written 11 weeks ago by Gordon Smyth30k

Dear Gordon, thank you for your consideration and suggestions on this matter !! Regarding the second part of your answer, you mean to feed directly the above transformed log2(counts+1) in limma -like the microarrays pipeline (limFit, eBayes) without any other transformations/normalization from the RNA-Seq pipeline, like DGEList and cpm, correct ? in order to perform Differential gene expression with the RNA-Seq data similar to microarrays ?

Moreover, if i would like to perform a small intensity filtering to reduce the features prior statistical testing, which could be the case here ? with the above transformed values ?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by svlachavas430
1

Yes, you would still need to normalize, particularly for library size, and to filter non-expressed genes. Quantile normalization would equalize the library sizes as well as other things and I have edited my answer to mention this. I don't suggest this is better than converting back to counts, just another possibility.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Gordon Smyth30k

Dear Gordon, thank you again for your updated recommendations !! so the above are two available methodologies to use/test--just a small comment regarding the direct approach with the log2 transformed counts :

xx <- TCGA.RNASeqV2_eset

set <- exprs(xx)

set.norm <- normalizeQuantiles(set)

range(set.norm)
[1]  0.0000 17.5553

Also i forward the links of a quick boxplot & a density plot with plotDensity function:

https://www.dropbox.com/s/mor4qmvnbf50xej/boxplot.norm.png?dl=0

https://www.dropbox.com/s/5squxrtv3v2ltqk/densityPlot.png?dl=0

Thus, after an initial ispection of the normality of data-and without other transformation like voom that you mentioned--, my main issue is how i should perform the filtering for non-expressed genes? i mean, like normalized microarray data ? or i should be concerned about the specific range of the values of the data and different technology, and select a specific cutoff ?

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by svlachavas430
1

You can filter genes using the same sort of common sense considerations that we always use for differential expression analyses. I have edited my answer to illustrate this.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Gordon Smyth30k

Thank you one more time for your updated example !! two quick comments to add:

1) you mentioned in the filtering part 

keep <- rowSums(y > log2(11)) >= 14

but you used log2(11) instead of y>11 ? or it is just a typo ?

2) a more statistical question but vital:

[due to one very interesting comparison that i would like to perform is a direct comparison between microsatelite stable and ustable tumors(the information is present in a subset of the samples)]:

ind <- sampleNames(TCGA.RNASeqV2_eset)[TCGA.RNASeqV2_eset$msi%in%c("MSS","MSI")& TCGA.RNASeqV2_eset$sample_type%in%"tumor"]

set.2 <- TCGA.RNASeqV2_eset[,ind]

dim(set.2)

Features  Samples

   20502       38

> table(set.2$msi)

MSI MSS 
  8  30 

targets <- pData(set.2)

msi <- factor(targets$msi)

gender <- factor(targets$gender)

table(gender)
gender
 f  m 
22 16 

So in this point, my crusial question is the following: because gender is an important confounder (especially in the microsatelite status, as it is known in colorectal cancer biology), the most simple and robust approach is to include this factor in my design matrix ? like the following ?

design <- model.matrix(~msi + gender) ?

and more generally i could even add a continuous confounder, like age correct ?

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by svlachavas430
1

No, there is no typo. The y-values are approximately log2(count+1) so, if I want the counts to be >10, then I need y > log2(10+1).

Obviously you can do other analyses for MSI status and so on, but a complete analysis is your job, not mine.

ADD REPLYlink written 10 weeks ago by Gordon Smyth30k

Oh my mistake for forgetting the initial transformation regarding the filtering.

Also please excuse me for the possible misunderstanding-not intented to ask for other kind of help or irrelevant suggestions of this topic-just asked it for a quick confirmation about any similar example ( i just mentioned msi status in this example) and case like this, when needed to adjust for specific confounders. Overall, thank you for your total help & consideration on this matter !!

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by svlachavas430

Dear Gordon,

please excuse me to return after a rather long period-however, based on some search to verify the pre-processing details of this dataset, it seems that the estimated counts are probably not log2(raw_estimated_counts +1),

but log2(normalized_results +1)

From a small search, these normalized results for level 3 TCGA, are described: "normalized_results contain the raw counts estimated by RSEM, but on a scaled version-that is, the values are divided by the 75-percentile & multiplied by 1000". Thus, in your opinion as an expert, as this transformation could make the values/counts more comparable between samples/experiments, could this hamper the approach you described above ? that is, after the log2 transform, a further normalization with normalizeQuantiles could be redundant ? 

Or my notion is incorrect and this should not affect the analysis above, started with normalizeQuantiles ? And anyway to take into account at least the library size, normalizeQuantiles is essential ?

(*I would like to initially apologize for any possibly naive questions, but this matter is rather complicated without asking a feedback about this putative issue).

ADD REPLYlink written 6 weeks ago by svlachavas430
1

I would still use quantile normalization myself.

Given the somewhat arbitrary way in which the expression values have been processed, this makes me more strongly recommend the limma approach rather than trying to convert back to actual counts for edgeR or DESeq2.

ADD REPLYlink written 6 weeks ago by Gordon Smyth30k
1
gravatar for Levi Waldron
11 weeks ago by
Levi Waldron300
CUNY School of Public Health, New York, NY
Levi Waldron300 wrote:

I can offer a couple things -

1) the survival data are in pData(TCGA.RNASeqV2_eset)[, c("days_to_death", "vital_status")] not in the exprs() - apologies if the man page makes that unclear. 

2) the level-III TCGA RNA-seq data are provided as-is, BUT are log2-transformed if the maximum value is greater than 50 (from attach_PROCESSED.R in the pipeline):

if(max(exprs(eset),na.rm=TRUE) > 50){
  ##Do log-transformation
  if(min(exprs(eset),na.rm=TRUE) <= 0){
    exprs(eset) <- exprs(eset) - min(exprs(eset),na.rm=TRUE) + 1
  }
  exprs(eset) <- log2(exprs(eset))
}

FYI the script used to download TCGA RNA-seq data is here (see download_TCGA-RNASeqV2-rectum.sh and download_TCGA-RNASeqV2.sh). These download URLs are broken now that TCGA data have been moved to the GDC, but it they do show the original filenames if you'd like to check them out from GDC:

  • unc.edu_COAD.IlluminaHiSeq_RNASeqV2.Level_3.1.4.0.tar.gz
  • unc.edu_READ.IlluminaHiSeq_RNASeqV2.Level_3.1.5.0.tar.gz
ADD COMMENTlink written 11 weeks ago by Levi Waldron300
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 229 users visited in the last hour