Question

Going crazy with normalizing TCGA raw rsem gene count

0

Entering edit mode

ezz • 0

@ezz-7643

Last seen 9.0 years ago

United States

1-I obtained the RNAseqV2 raw counts, as another post suggested not to use the RSEM.GENE.NORMALIZED as they still contain irregularity as seen on the diag.boxplot

2- I use different methods of normalization to be able to start the clustering analysis.

3-quantile normalization in the preprocess core package, EDAseq withinlanenormalizaetion function, DESeq rlog using design~1,EDgeR COM and calnormfactor were used and all have different values. I don't know which one to use and if I can use quantile normalization for the normalized RSEM gene counts directly.

4-After clustering for data exploration and obtaining , for example, 3 groups can I renormalize (BASED ON THE NEW GROUPING) and assess differential gene expression. Simply cause the conditions or design are not yet known at the time of initial normalization, and I will depend on clustering to create these groupings.

Your help is greatly appreciated.

cancer tcga normalization rnaseq differential gene expression • 3.5k views

ADD COMMENT • link updated 6.1 years ago by Dimitris Vavoulis • 0 • written 9.0 years ago by ezz • 0

score 0 · Answer 1 · 2015-04-28

DGEclust is a software package that clusters read-counts, and then uses the clusters to do differential expression analysis. One problem with the method is that it does not normalise read-counts for gene length, so it only works correctly for CAGE-seq read-counts. I don't know how the journal's reviewers didn't notice such a serious problem with the method.

score 0 · Answer 2 · 2018-03-27

Hi,

This is an old thread, but I came across this just now. Below are a few comments, which may be useful to those using DGEclust.

1) As a general remark, I don't think gene length is relevant when examining the expression of any particular gene across different samples/conditions (e.g. as in differential expression analysis), because the implicit assumption here is that the length of said gene remains the same across samples. Gene length may be relevant when studying (e.g. clustering) the expression (i.e. read counts) of different genes within the same sample.

2) Having said that, DGEclust does not explicitly cluster read counts, but rather log-fold changes (i.e. the beta parameters in Eqs. 1, 2 and elsewhere in the paper). In this model, log-fold-changes are dimensionless quantities, which are not dependent on gene length, unlike read-counts, or possibly gene-specific mean expression levels and dispersion parameters.

3) If after the above, one still wishes to use length-normalised data with DGEclust, this is possible. Just instantiate the class CountData with the length-normalised pseudo-counts (as shown in the manual), and let parameter lib_sizes be equal to a vector of ones. This signals the software that the data are already normalised, and no further action needs to be taken in this respect. Internally, we use a continuous version of the negative-binomial distribution, so using a data matrix of possibly non-integer pseudo-counts ought to work.

HTH,

Dimitris Vavoulis