Question: Going crazy with normalizing TCGA raw rsem gene count
gravatar for ezz
3.1 years ago by
United States
ezz0 wrote:

1-I obtained the RNAseqV2 raw counts, as another post suggested not to use the RSEM.GENE.NORMALIZED as they still contain irregularity as seen on the diag.boxplot

2- I use different methods of normalization to be able to start the clustering analysis.

3-quantile normalization in the preprocess core package, EDAseq withinlanenormalizaetion function, DESeq rlog using design~1,EDgeR COM and calnormfactor were used and all have different values. I don't know which one to use and if I can use quantile normalization for the normalized RSEM gene counts directly.

4-After clustering for data exploration and obtaining , for example, 3 groups can I renormalize (BASED ON THE NEW GROUPING) and assess differential gene expression. Simply cause the conditions or design are not yet known at the time of initial normalization, and I will depend on clustering to create these groupings.


Your help is greatly appreciated.

ADD COMMENTlink modified 8 weeks ago by Dimitris Vavoulis0 • written 3.1 years ago by ezz0
gravatar for Dario Strbenac
3.1 years ago by
Dario Strbenac1.4k
Dario Strbenac1.4k wrote:

DGEclust is a software package that clusters read-counts, and then uses the clusters to do differential expression analysis. One problem with the method is that it does not normalise read-counts for gene length, so it only works correctly for CAGE-seq read-counts. I don't know how the journal's reviewers didn't notice such a serious problem with the method.

ADD COMMENTlink written 3.1 years ago by Dario Strbenac1.4k
gravatar for Dimitris Vavoulis
8 weeks ago by
Dimitris Vavoulis0 wrote:


This is an old thread, but I came across this just now. Below are a few comments, which may be useful to those using DGEclust.

1) As a general remark, I don't think gene length is relevant when examining the expression of any particular gene across different samples/conditions (e.g. as in differential expression analysis), because the implicit assumption here is that the length of said gene remains the same across samples. Gene length may be relevant when studying (e.g. clustering) the expression (i.e. read counts) of different genes within the same sample.   

2) Having said that, DGEclust does not explicitly cluster read counts, but rather log-fold changes (i.e. the beta parameters in Eqs. 1, 2 and elsewhere in the paper). In this model, log-fold-changes are dimensionless quantities, which are not dependent on gene length, unlike read-counts, or possibly gene-specific mean expression levels and dispersion parameters.

3) If after the above, one still wishes to use length-normalised data with DGEclust, this is possible. Just instantiate the class CountData with the length-normalised pseudo-counts (as shown in the manual), and let parameter lib_sizes be equal to a vector of ones. This signals the software that the data are already normalised, and no further action needs to be taken in this respect. Internally, we use a continuous version of the negative-binomial distribution, so using a data matrix of possibly non-integer pseudo-counts ought to work.    


Dimitris Vavoulis


ADD COMMENTlink written 8 weeks ago by Dimitris Vavoulis0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 176 users visited in the last hour