Possible ways of performing differential gene expression and analysis of RNA-Seq data from normalized counts of RSEM software
3
0
Entering edit mode
svlachavas ▴ 840
@svlachavas-7225
Last seen 15 months ago
Germany/Heidelberg/German Cancer Resear…

Dear BiocCommunity,

i have imported from the R package curatedCRCdata, an RNA-Seq dataset regarding a specific type of cancer. In detail, after contacting the maintainers of the package, they kindely provided the information, that the data are IlluminaHiSeq_RNASeqV2 Level 3 data, which were quantified using RSEM (https://wiki.nci.nih.gov/display/tcga/rnaseq+version+2). When I load the dataset from the package:

library(curatedCRCData)

data(TCGA.RNASeqV2_eset)

TCGA.RNASeqV2_eset

ExpressionSet (storageMode: lockedEnvironment)

assayData: 20502 features, 195 samples

  element names: exprs

protocolData: none

phenoData

  sampleNames: TCGA.AA.3662 TCGA.A6.4105 ... TCGA.A6.6652 (195 total)

  varLabels: unique_patient_ID alt_sample_name ... uncurated_author_metadata (59 total)

  varMetadata: labelDescription

featureData

  featureNames: ? A1BG ... ZZZ3 (20502 total)

  fvarLabels: probeset gene

  fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

  pubMedIds: 22810696

Annotation: NA

class(TCGA.RNASeqV2_eset)

[1] "ExpressionSet"

attr(,"package")

[1] "Biobase"

head(exprs(TCGA.RNASeqV2_eset)) # a small output

         TCGA.AA.3662 TCGA.A6.4105 TCGA.F4.6463 TCGA.F4.6806 TCGA.A6.6650 TCGA.AZ.6600

?         9.282712     9.933779   10.0443941     9.910121    10.088809     9.875006     9.893715

A1BG      6.027692     4.707000    3.6559242     5.592373     3.253914     5.622860     3.563683

A1CF      8.273718     7.445153    7.0927571     7.118704     7.855290     8.577953     7.774680

range(exprs(TCGA.RNASeqV2_eset))

[1]  0.00000 20.34961

My main questions are the following:

Because from a relative search in other posts/papers, the RSEM does not provide "essentially raw counts", but estimated counts (which are also not rounded). Thus:

  1. Is it possible to use for downstream analysis for RNA-Seq data like edgeR ? Or because the counts have e to be integer, other methodologies/packages are eligible for a simple differential expression analysis (for instance a two-group comparison) ?
  2. Secondly, also from the values above, it seems that the  counts are also somehow normalized or transformed (perhaps it is the output of "rsem.genes.normalized_results" in the above link of wiki NCI). Unfortunately, i could not find further information about the above level 3 transformation (for how the rsem counts are normalized).  Again, a proper normalization methodology (like TMM) should be necessary for any downstream analysis ?

 

Please excuse me for any naive questions, but im relatively new to RNA-Seq and at this point any suggestions or opinions would be essential !!

 

rnaseq RSEM TCGA edgeR normalization • 12k views
ADD COMMENT
0
Entering edit mode

You can round them and use in DESEQ2, but that isn't recommended and people normally use a specific tool that can cope with the uncertainty of read count estimates like EBSEQ.

ADD REPLY
2
Entering edit mode

The counts do not need to be rounded for use in edgeR. edgeR has supported fractional counts for a number of years.

ADD REPLY
0
Entering edit mode

Good to know

ADD REPLY
0
Entering edit mode

Chris thank you for your answer !! i have seen also the EBSEQ !! unfortunately, there is not a clear answer to this specific problem, so i hope with this post i get any suggestions to take all the possible options into account !!

 

ADD REPLY
1
Entering edit mode

http://deweylab.biostat.wisc.edu/rsem/README.html. Colin who developed RSEM recommends using EBSEQ. That is the answer for this specific problem.

We always used to use rounded counts in DESEQ, it used to upset the statistics guy, but we ignored him.

ADD REPLY
3
Entering edit mode
@mikelove
Last seen 11 hours ago
United States

I support the use of rounded estimated counts from software like RSEM (also Sailfish, Salmon, kallisto), and we have a pipeline built for this using the Bioconductor package tximport. See the software vignette which has examples for all of these software including RSEM's *.genes.results file, and downstream example use of edgeR, DESeq2, and limma.

What you should never do is use *normalized* counts with the count-based tools.

Yes, another question is what you have in that matrix. Those do not look like estimated counts, because they max out at 20.3.

Have you looked at the relevant man page:

?TCGA.RNASeqV2_eset
ADD COMMENT
0
Entering edit mode

Is it possible that the data is time to survival in years?

Details
assayData: 20502 features, 195 samples
Platform type: NA
Overall survival time-to-event summary (in years):
Call: survfit(formula = Surv(time, cens) ~ -1)
174 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
21.000 21.000 21.000 18.000 1.208 0.715 5.605
ADD REPLY
0
Entering edit mode

Dear Michael,

thank you for your useful answer and recommendations !! Indeed the numbers in the matrix have confused me a lot, but from the initial contact with the maintainers of the package, they mentioned that these are level 3 TCGA RNA-Seq data from RSEM. I have also checked the following links

https://bioconductor.org/packages/release/data/experiment/vignettes/curatedCRCData/inst/doc/curatedCRCData_vignette.pdf

& also https://bioconductor.org/packages/release/data/experiment/manuals/curatedCRCData/man/curatedCRCData.pdf

the information you mentioned above about survival, is also mentioned in another datasets. My main question here, is that is possible that these are indeed estimated counts from RSEM, but they have undergone some normalization tranformation ? like the part 3 of the IlluminaHiSeq_RNASeqV2 section in https://wiki.nci.nih.gov/display/tcga/rnaseq+version+2 ? That is the rsem.genes.normalized_results ? so that they have this range of values ?

 

ADD REPLY
1
Entering edit mode

I have no idea. 20.3 as the max value isn't really compatible with them being normalized estimated counts either.

ADD REPLY
1
Entering edit mode

Maybe they're log2(expected_counts + 1)

@svlachavas: while the max is 20.3, is the min == 0?

If this is the case, we should be able to get back the expected counts by 2**vals - 1, yes?

I haven't check this package, so not sure, but I recently tried to download some data (TCGA and whatever else) from UCSC's toil/xena doodad (specifically the "TOIL Gene RSEM expected_count" file from that page) and managed to convince myself that what's in the datafile are log2(expected_counts + 1)

Maybe that's similar, here?

Just a guess.

ADD REPLY
1
Entering edit mode

Dear Steve,

thank you for your comment and idea--yes the min is zerro:

range(exprs(TCGA.RNASeqV2_eset))

[1]  0.00000 20.34961

However, i don't know if the below log2 transformation [if the maximum value is greater than 50] could be reversed with your suggestion-perhaps it is different from the general log2(expected_counts +1) you mentioned above ?

ADD REPLY
2
Entering edit mode

The minimum value in the original files is zero, so you can recover the original values with 2^exprs(TCGA.RNASeqV2_eset) - 1. In hindsight, I would have left these values as they were! Here is a DESCRIPTION.txt file that comes with the original files from the GDC legacy archive, describing the analysis procedure for level 3 TCGA RNASeqV2 data.

ADD REPLY
1
Entering edit mode

Dear Levi, thank you again for your valuable comments and confirmation of Steves suggestion. 3 final comments to summarize in order to be certain and close this topic:

1) i checked the description file above, as also the original paper publication. Thus, by the above tranformation, i get the "raw estimated gene counts" from RSEM, correct (as also i saw that the rows are already annotated with unique gene symbols) ? based also on the link [https://wiki.nci.nih.gov/display/tcga/rnaseq+version+2]

2) My other important question is about the phenotype information for downstream analysis:

> table(pData(TCGA.RNASeqV2_eset)$sample_type)

adjacentnormal          tumor 

            14                  181 

Thus, i suppose (and if you have any knowledge for this), because the number of normal samples is far more less than tumors, the term "adjucentnormal" means that these 14 match specific tumor samples from the 181 ? and you could only retreive information about these 14 normal samples-in other words only these being available"?

My notion for this question is for downstream analysis, like differential gene expression, if i should probably "ignore" the adjucent meaning to directly compare tumor vs normal samples as distinct groups ? Or this in your opinion is biased ? And i could inspect other between tumors comparisons, such as msi or location status, which is irrelevant from normal samples ?

3) Finally, the unique gene symbols in the rows of the eSet was a process independent of the above, and it is probably described in the blast_gene_maps.R script as the annotation methodology for keeping unique gene symbols, right ?

ADD REPLY
1
Entering edit mode

1) I believe so :D. If you want to be absolutely sure, download the level-III data for one of the samples, and check that this transformation recapitulates the original values. Although, what is currently available in the GDC may have been processed on a more recent genome so it may not be exact.

2) I don't think these normals are matched to any tumors in the dataset, although you can check with the "unique_patient_ID" column of the pData. If you do an unsupervised clustering of the samples, I think you'll find that the normals cluster closely together (except maybe one outlier here), so pairing wouldn't be important anyways.

ADD REPLY
0
Entering edit mode

Dear Levi, thank you one more time for your updated answers. From the following link (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/) perhaps the reference genome is the one mentioned GRCh38-Nevertheless, it probably best checking an individual sample, but again the difference with more recent genome could hamper any comparisons. Thus, i would probably perform the above transformation and use these values as raw estimated counts. Btw, for the downloading (i suppose free) you mean the actual GDC repository ?

ADD REPLY
0
Entering edit mode

Just use the values or 2^values-1 from curatedCRCData :D. They are, by the way, very highly correlated to the TCGA microarray measurements also in the package, so I am pretty confident in them.

ADD REPLY
0
Entering edit mode

Oh and I missed 3) - the TCGA annotation map came directly from the level 3 data, in the script download_TCGA-RNASeqV2.sh. We only used BLAST for some old platforms without a recent annotation file available.

ADD REPLY
6
Entering edit mode
@gordon-smyth
Last seen 49 minutes ago
WEHI, Melbourne, Australia

edgeR and limma can both accept RSEM expected counts, and the counts do not need to be rounded to an integer. Both edgeR and limma work with fractional counts.

It is also possible to analyze the (approximately) log2(count+1) values directly in limma, without converting back to counts. You would normalize these with normalizeQuantiles and then use the eBayes(trend=TRUE) pipeline (without voom). Here is an example run to illustrate this simple approach. Here I compare tumour to adjacent normal samples. I have kept genes that have about 10 counts or more in at least 14 samples:

> library(curatedCRCData)
> data(TCGA.RNASeqV2_eset)
> targets <- pData(TCGA.RNASeqV2_eset)
> y <- normalizeQuantiles(exprs(TCGA.RNASeqV2_eset))
> type <- factor(targets$sample_type)
> table(type)
type
adjacentnormal          tumor 
            14            181 
> keep <- rowSums(y > log2(11)) >= 14
> table(keep)
keep
FALSE  TRUE 
 4209 16293 
> y2 <- y[keep,]
> design <- model.matrix(~type)
> fit <- lmFit(y2,design)
> fit <- eBayes(fit,robust=TRUE,trend=TRUE)
> topTable(fit,coef=2)
        logFC AveExpr     t  P.Value adj.P.Val     B
CDH3     6.56  10.294  25.2 1.24e-63  2.02e-59 133.7
KRT80    7.48   8.973  24.0 2.08e-60  1.70e-56 126.5
OTOP2   -9.11   1.504 -22.7 6.43e-57  3.49e-53 118.6
ETV4     6.06  10.469  21.6 1.06e-53  4.31e-50 111.3
ESM1     5.97   6.688  19.2 7.10e-47  2.31e-43  95.9
MTHFD1L  2.65   9.659  17.8 6.66e-43  1.81e-39  86.9
LGI1    -4.51   0.957 -17.8 9.26e-43  2.15e-39  86.5
KRT24   -5.90   0.817 -17.5 5.43e-42  1.11e-38  84.8
GRIN2D   6.07   8.724  17.2 4.46e-41  8.07e-38  82.7
JUB      3.24   8.707  16.9 3.15e-40  5.13e-37  80.8

I note that

plotMDS(y2, label=type)

shows a perfect separation of tumour from normal samples. Also

plotSA(fit)

shows the mean-variance relationship typical of RNA-seq data.

ADD COMMENT
0
Entering edit mode

Dear Gordon, thank you for your consideration and suggestions on this matter !! Regarding the second part of your answer, you mean to feed directly the above transformed log2(counts+1) in limma -like the microarrays pipeline (limFit, eBayes) without any other transformations/normalization from the RNA-Seq pipeline, like DGEList and cpm, correct ? in order to perform Differential gene expression with the RNA-Seq data similar to microarrays ?

Moreover, if i would like to perform a small intensity filtering to reduce the features prior statistical testing, which could be the case here ? with the above transformed values ?

ADD REPLY
1
Entering edit mode

Yes, you would still need to normalize, particularly for library size, and to filter non-expressed genes. Quantile normalization would equalize the library sizes as well as other things and I have edited my answer to mention this. I don't suggest this is better than converting back to counts, just another possibility.

ADD REPLY
0
Entering edit mode

Dear Gordon, thank you again for your updated recommendations !! so the above are two available methodologies to use/test--just a small comment regarding the direct approach with the log2 transformed counts :

xx <- TCGA.RNASeqV2_eset

set <- exprs(xx)

set.norm <- normalizeQuantiles(set)

range(set.norm)
[1]  0.0000 17.5553

Also i forward the links of a quick boxplot & a density plot with plotDensity function:

https://www.dropbox.com/s/mor4qmvnbf50xej/boxplot.norm.png?dl=0

https://www.dropbox.com/s/5squxrtv3v2ltqk/densityPlot.png?dl=0

Thus, after an initial ispection of the normality of data-and without other transformation like voom that you mentioned--, my main issue is how i should perform the filtering for non-expressed genes? i mean, like normalized microarray data ? or i should be concerned about the specific range of the values of the data and different technology, and select a specific cutoff ?

ADD REPLY
1
Entering edit mode

You can filter genes using the same sort of common sense considerations that we always use for differential expression analyses. I have edited my answer to illustrate this.

ADD REPLY
0
Entering edit mode

Thank you one more time for your updated example !! two quick comments to add:

1) you mentioned in the filtering part 

keep <- rowSums(y > log2(11)) >= 14

but you used log2(11) instead of y>11 ? or it is just a typo ?

2) a more statistical question but vital:

[due to one very interesting comparison that i would like to perform is a direct comparison between microsatelite stable and ustable tumors(the information is present in a subset of the samples)]:

ind <- sampleNames(TCGA.RNASeqV2_eset)[TCGA.RNASeqV2_eset$msi%in%c("MSS","MSI")& TCGA.RNASeqV2_eset$sample_type%in%"tumor"]

set.2 <- TCGA.RNASeqV2_eset[,ind]

dim(set.2)

Features  Samples

   20502       38

> table(set.2$msi)

MSI MSS 
  8  30 

targets <- pData(set.2)

msi <- factor(targets$msi)

gender <- factor(targets$gender)

table(gender)
gender
 f  m 
22 16 

So in this point, my crusial question is the following: because gender is an important confounder (especially in the microsatelite status, as it is known in colorectal cancer biology), the most simple and robust approach is to include this factor in my design matrix ? like the following ?

design <- model.matrix(~msi + gender) ?

and more generally i could even add a continuous confounder, like age correct ?

ADD REPLY
1
Entering edit mode

No, there is no typo. The y-values are approximately log2(count+1) so, if I want the counts to be >10, then I need y > log2(10+1).

Obviously you can do other analyses for MSI status and so on, but a complete analysis is your job, not mine.

ADD REPLY
0
Entering edit mode

Oh my mistake for forgetting the initial transformation regarding the filtering.

Also please excuse me for the possible misunderstanding-not intented to ask for other kind of help or irrelevant suggestions of this topic-just asked it for a quick confirmation about any similar example ( i just mentioned msi status in this example) and case like this, when needed to adjust for specific confounders. Overall, thank you for your total help & consideration on this matter !!

ADD REPLY
0
Entering edit mode

Dear Gordon,

please excuse me to return after a rather long period-however, based on some search to verify the pre-processing details of this dataset, it seems that the estimated counts are probably not log2(raw_estimated_counts +1),

but log2(normalized_results +1)

From a small search, these normalized results for level 3 TCGA, are described: "normalized_results contain the raw counts estimated by RSEM, but on a scaled version-that is, the values are divided by the 75-percentile & multiplied by 1000". Thus, in your opinion as an expert, as this transformation could make the values/counts more comparable between samples/experiments, could this hamper the approach you described above ? that is, after the log2 transform, a further normalization with normalizeQuantiles could be redundant ? 

Or my notion is incorrect and this should not affect the analysis above, started with normalizeQuantiles ? And anyway to take into account at least the library size, normalizeQuantiles is essential ?

(*I would like to initially apologize for any possibly naive questions, but this matter is rather complicated without asking a feedback about this putative issue).

ADD REPLY
1
Entering edit mode

I would still use quantile normalization myself.

Given the somewhat arbitrary way in which the expression values have been processed, this makes me more strongly recommend the limma approach rather than trying to convert back to actual counts for edgeR or DESeq2.

ADD REPLY
0
Entering edit mode

Dear Gordon,

I also wanted to know about the differential analysis for Level_3 Data (file names: *.rsem.genes.normalized_results) downloaded from TCGA, shows the gene-level transcription estimates, as in log2(x+1) transformed RSEM normalized count.

This HCC data. when I tried "curatedHCCData". It shows not available. 

ADD REPLY
0
Entering edit mode

Hi svlachavas,

I have the same problem like yours. You are using library(curatedCRCData) But I need liver data. What I need to do ? I checked for curatedliverdata it says not available. Can you help me in this?

ADD REPLY
1
Entering edit mode

Ovarian, CRC, and bladder were quite a bit of work, and I haven't had the resources or cancer-specific projects to do it for liver or HCC. If someone can bring time and/or money to expanding the curation approach to other cancers, I'm open to providing guidance. 

ADD REPLY
0
Entering edit mode

Dear ghk,

perhaps for your need and kind of analysis, you should alternatively try the R package TCGAbiolinks

In the link below https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html

you can check from the GDC projects, that there is a one called TCGA-LIHC-so you can inspect the available transcriptomic data (HTSeq counts, etc) and start the analysis of your interest

ADD REPLY
0
Entering edit mode

Dear svalachavas,

I Obtained TCGA gene expression RNAseq (polyA+ IlluminaHiSeq) Level 3 data. It has log2(x+1) transformed RSEM normalized counts [https://xenabrowser.net/datapages/?dataset=TCGA.BRCA.sampleMap/HiSeqV2&host=https://tcga.xenahubs.net]. I haven't heard about these RSEM till now. I need to do differential analysis with this data.

I have seen that you had similar problem here. Did you use what Gordon mentioned? And Is everything fine with your results?

Your help is greatly appreciated. Thank you !!

ADD REPLY
0
Entering edit mode

Dear Gordon,

In the above code I have a small doubt. In the filtering step Why you are keeping genes that have about 10 counts or more in atleast 14 samples ? why not more than 14 samples and why not only 3 samples? Could you please tell me this. Thank you

ADD REPLY
0
Entering edit mode

Simply because there are 14 samples in the smallest "experimental group" (in this case "adjacentnormal").

Given that, I reckon you can divine the answers to your "why not more or less" questions by playing out a few thought experiments in your mind but to answer one specifically: if you required more than 14 samples, then genes that are exclusively expressed in adjacentnormal would be filtered out from  your analysis, and you wouldn't want that.

ADD REPLY
0
Entering edit mode

For my data It looks like following:

table(targets$Sample.type)

MB1   MB2 
286     80 

So, on what sample number should I filter now? 

Do I need to filter like this? 

keep <- rowSums(y > log2(11)) >= 80

ADD REPLY
0
Entering edit mode

Keep in mind that this is more of an art than a science.

With that having been said, in in your case of 286 samples in one or 80 in another, if I were to use the same "filtering idiom" (expressed over X cpm in Y samples) I personally wouldn't go up to 80. In my mind, these sample subsets are big enough where there might be sum "subgroups" in these larger groups you identified. I might just ensure minimal expression in 10-15% of the samples, irrespective of group.

Also consider the fact that another equally valid approach to filter is to simply ensure that the mean cpm of a gene is above a minimal threshold. Now you don't have to worry about how many samples exhibit some minimal expression.

 

ADD REPLY
1
Entering edit mode
Levi Waldron ★ 1.1k
@levi-waldron-3429
Last seen 5 weeks ago
CUNY Graduate School of Public Health a…

I can offer a couple things -

1) the survival data are in pData(TCGA.RNASeqV2_eset)[, c("days_to_death", "vital_status")] not in the exprs() - apologies if the man page makes that unclear. 

2) the level-III TCGA RNA-seq data are provided as-is, BUT are log2-transformed if the maximum value is greater than 50 (from attach_PROCESSED.R in the pipeline):

if(max(exprs(eset),na.rm=TRUE) > 50){
  ##Do log-transformation
  if(min(exprs(eset),na.rm=TRUE) <= 0){
    exprs(eset) <- exprs(eset) - min(exprs(eset),na.rm=TRUE) + 1
  }
  exprs(eset) <- log2(exprs(eset))
}

FYI the script used to download TCGA RNA-seq data is here (see download_TCGA-RNASeqV2-rectum.sh and download_TCGA-RNASeqV2.sh). These download URLs are broken now that TCGA data have been moved to the GDC, but it they do show the original filenames if you'd like to check them out from GDC:

  • unc.edu_COAD.IlluminaHiSeq_RNASeqV2.Level_3.1.4.0.tar.gz
  • unc.edu_READ.IlluminaHiSeq_RNASeqV2.Level_3.1.5.0.tar.gz
ADD COMMENT

Login before adding your answer.

Traffic: 784 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6