Question

DESeq2 rlog transformation

0

Entering edit mode

francesca.defilippis ▴ 100

@francescadefilippis-7043

Last seen 9.3 years ago

European Union

Hi!

I'd like to use the deseq2 rlog transformation in order to use the normalized matrix for pca and heatplots.

I understood that I should use the raw counts as input, but I'd like to understand how the transformation takes into account the different library size.

In particular, I'd like to use it with Kegg orthologs, so this is only a "subset" of my raw counts matrix, containing only genes that I could assign to KO, so the real library size for each sample was much bigger. Moreover, since each KO can belong to different metabolisms or pathways, rows in my files are repeated (same KO repeated for the n pathways it belongs to, with the same counts). So basically in my matrix the sum of the columns is not the library size.

Is it correct using this kind of matrix for rlog? Is it possible specify the "true" library size?

thanks

deseq2 normalization • 7.2k views

ADD COMMENT • link updated 9.4 years ago by Wolfgang Huber ★ 13k • written 9.4 years ago by francesca.defilippis ▴ 100

score 1 · Answer 1 · 2014-12-07

1

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 12 days ago

EMBL European Molecular Biology Laborat…

Francesca

in principle, DESeq2's rlog works with subsets of genes (i.e. selected rows of the full count matrix), as long as these genes are still 'enough' (in some sense that is not precisely quantified). To answer your post, could you please clarify a few things:

1. Which technology or assays do your 'raw counts' come from? What are rows and columns of the count matrix?

2. Did you consider doing first the rlog-transformation on the full matrix, and then subset to genes that have KOs?

3. You say "rows in my files are repeated" because genes can 'belong to multiple metabolisms or pathways'... There is probably no real harm in sporadically repeating the same data in multiple rows as far as DESeq2 is concerned, but things might go awry if this happens too often. More importantly - what would be the benefit in doing this? Then you get the same test results repeated multiple times, why is once not enough?

As for your size factors question - you can always override DESeq2's method by using the 'sizeFactors<--' assignment method on your DESeqDataSet object. DEseq2's method does not actually look at the library size, but rather tries to fit the factors in such a way that the differentially expressed genes are few and specific, i.e. it tries to avoid that detection of differentially expressed genes is simply associated with common scaling of all counts in a sample.

Kind regards

Wolfgang

ADD COMMENT • link 9.4 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi Wolfgang!

thanks for the reply.

to address your question:

1 I have 24 samples (2 replicates x 12 conditions). They are Illumina single ends RNA seq of bacteria from cheese. I mapped the reads (after quality filtering and trimming) to the protein coding genes of the bacteria I'm interested. Then I used kegg database to assign the functional annotation. So rows in my matrix are KO IDs and columns are samples.

2 I didn't think to use rlog directly on the full matrix, just because I'm not interested in genes I won't be able to assign a KO. Of course, if this can be an issue for deseq, I'll reconsider it!

3 I'd like to keep the repeated KO, just for a visualizzation. The aim of this project is to see if there are differences in bacteria gene expression in cheese riprende in different conditions. So in order to do a first screening, I'd like to do a heatplot and pca and see which genes drive the separation of samples. So it would be easier having as rows of the heatmap the KO and the metabolim/metabolisms it belongs to, in order to visualize if one specific metabolism drives the separation even if the specific KO can belong to multiple metabolisms. Should I do the rlog transformation before duplicate the rows?

so, if I understood well, deseq rlog transformation doesn't care about library size, right? Then do you think I can use this matrix as it is for rlog and heatplot?

ADD REPLY • link 9.4 years ago francesca.defilippis ▴ 100

1

Entering edit mode

Just to clarify this:

so, if I understood well, deseq rlog transformation doesn't care about library size, right?

It does care about it, in the sense that the size factors are used to normalize the counts before rlog-transforming them. So, if the size factors are inappropriate, the resulting transformed data will not be well normalized.

Should I do the rlog transformation before duplicate the rows?

Yes. I would suggest you load all your counts into a DESeq object, call 'estimateSizeFactors' to get a proper normalization, then do the rlog transformation, and only then remove the rows you are not interested and duplicate those you need multiple times.

ADD REPLY • link 9.4 years ago Simon Anders ★ 3.7k