Question: DESeq2 rlog transformation
gravatar for francesca.defilippis
2.9 years ago by
European Union
francesca.defilippis40 wrote:


I'd like to use the deseq2 rlog transformation in order to use the normalized matrix for pca and heatplots.

I understood that I should use the raw counts as input, but I'd like to understand how the transformation takes into account the different library size. 

In particular, I'd like to use it with Kegg orthologs, so this is only a "subset" of my raw counts matrix, containing only genes that I could assign to KO, so the real library size for each sample was much bigger. Moreover, since each KO can belong to different metabolisms or pathways, rows in my files are repeated (same KO repeated for the n pathways it belongs to, with the same counts). So basically in my matrix the sum of the columns is not the library size.

Is it correct using this kind of matrix for rlog? Is it possible specify the "true" library size?


ADD COMMENTlink modified 2.9 years ago by Wolfgang Huber13k • written 2.9 years ago by francesca.defilippis40
gravatar for Wolfgang Huber
2.9 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:


in principle, DESeq2's rlog works with subsets of genes (i.e. selected rows of the full count matrix), as long as these genes are still 'enough' (in some sense that is not precisely quantified). To answer your post, could you please clarify a few things:

1. Which technology or assays do your 'raw counts' come from? What are rows and columns of the count matrix?

2. Did you consider doing first the rlog-transformation on the full matrix, and then subset to genes that have KOs?

3. You say "rows in my files are repeated" because genes can 'belong to multiple metabolisms or pathways'...  There is probably no real harm in sporadically repeating the same data in multiple rows as far as DESeq2 is concerned, but things might go awry if this happens too often. More importantly - what would be the benefit in doing this? Then you get the same test results repeated multiple times, why is once not enough?

As for your size factors question - you can always override DESeq2's method by using the 'sizeFactors<--' assignment method on your DESeqDataSet object.  DEseq2's method does not actually look at the library size, but rather tries to fit the factors in such a way that the differentially expressed genes are few and specific, i.e. it tries to avoid that detection of differentially expressed genes is simply associated with common scaling of all counts in a sample.

Kind regards



ADD COMMENTlink written 2.9 years ago by Wolfgang Huber13k

Hi Wolfgang!

thanks for the reply.

to address your question:

1 I have 24 samples (2 replicates x 12 conditions). They are Illumina single ends RNA seq of bacteria from cheese. I mapped the reads (after quality filtering and trimming) to the protein coding genes of the bacteria I'm interested. Then I used kegg database to assign the functional annotation. So rows in my matrix are KO IDs and columns are samples.

2 I didn't think to use rlog directly on the full matrix, just because I'm not interested in genes I won't be able to assign a KO. Of course, if this can be an issue for deseq, I'll reconsider it!

3 I'd like to keep the repeated KO, just for a visualizzation. The aim of this project is to see if there are differences in bacteria gene expression in cheese riprende in different conditions. So in order to do a first screening, I'd like to do a heatplot and pca and see which genes drive the separation of samples. So it would be easier having as rows of the heatmap the KO and the metabolim/metabolisms it belongs to, in order to visualize if one specific metabolism drives the separation even if the specific KO can belong to multiple metabolisms. Should I do the rlog transformation before duplicate the rows?

so, if I understood well, deseq rlog transformation doesn't care about library size, right? Then do you think I can use this matrix as it is for rlog and heatplot?

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by francesca.defilippis40

Just to clarify this:

so, if I understood well, deseq rlog transformation doesn't care about library size, right?

It does care about it, in the sense that the size factors are used to normalize the counts before rlog-transforming them. So, if the size factors are inappropriate, the resulting transformed data will not be well normalized.

Should I do the rlog transformation before duplicate the rows?

Yes. I would suggest you load all your counts into a DESeq object, call 'estimateSizeFactors' to get a proper normalization, then do the rlog transformation, and only then remove the rows you are not interested and duplicate those you need multiple times.

ADD REPLYlink written 2.9 years ago by Simon Anders3.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 108 users visited in the last hour