Validity of using rlog transformed counts (rld) to obtain a list of expressed genes
Entering edit mode
sjmonkley ▴ 30
Last seen 6 weeks ago

Dear Expert(s)

(Sorry if this information is provided elsewhere- i have looked high and low with no success)

I have a RNASeq data set of 22 samples that can be compared to one another  in a number of different combinations. Some of the samples are essentially treatment subsets of others.  I have used DESeq2 on the countdata ( along with associated colData) to generate both rld and dds (and res) objects. For some samples there are (biological) duplicates for others quadruplicates.

I can generate the DE results for various sample combinations using contrasts but what I also want is a list of genes that are 'expressed' in one set of samples but (for example)  not another. I realise what is 'expressed' depends on how that is defined and what cutoff is used but my questions are

1) is it valid to use the rlog transformed matrix (assay(rld)) for this purpose? I am concerned that this data is subject to normalization based on the experimental design input and so would differ with different designs.

If no which package would be recommended to obtain such data from count matrix?

2) If yes to above:   are there any error or p-values associated with this data? (There appears to be such data in the DESeqDataSet buty I cant work out how to get at it). The intention being I would be used as a cutoff for expression values (lfc) for which error is too high.

3) is it valid to use the mean of replicates to obtain a value for a particular cell type

4) I have tried to use the rlog function on the countdata matrix which the manual suggests you can but I get an error (Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘sizeFactors’ for signature ‘"data.frame"’).




DESeq2 rlog • 2.1k views
Entering edit mode
Last seen 28 minutes ago
United States

Expressed vs not expressed is a tough question, and can't be produced directly from just any dataset as you can imagine (I would want to be certain I have saturated w.r.t sequencing depth for one thing). Here are a few suggestions though:

I could imagine using the rlog or VST transformed data to detect on/off distributions for expression if I had many samples from many different tissues types. Note that, if you have 100s of samples, the VST will be much faster than rlog.

To get an idea for this, see the Gene Expression Barcode paper (esp Fig 1):

Re: question (1), I would use blind=FALSE, so the dispersion estimates takes into account the differences between tissues. It is only the global pattern of dispersion~mean which is used in the transformation, not which samples belong to which group. See the vignette for more on this distinction. For blind=FALSE, you will need to provide a DESeqDataSet, not a matrix.

Re: (2), no, it is just a transformation (like log2 or sqrt), no testing or standard errors.

Re: (3) seems reasonable to me.

Re: (4) Try providing a DESeqDataSet instead of a matrix.

Entering edit mode

Thanks for your very helpful reply. I have 22 samples containing 3-5 distinct cell types- I would like to use this approach to help tell me how distinct. While I agree this is nothing compared to the Barcode dataset I still feel that there is value in coming up with gene lists that are considered on in one cell type  and off in another.This sort of thing can be and is being done with FPKM data and I would like to do the same without having to reprocess my data  

I look forward to the Barcode data based on  RNASeq data- that may help in deciding on/off thresholding. 

One more question that has come up when I raised this issue with colleagues: given how the rlog transformation works is it reasonable (assuming all caveats discussed above) to have the same expressed/nonexpressed threshold for all genes?

If so what would that threshold be? Say for example if a threshold of 5 FPKM was considered reasonable is there a way of coming up with a number that would be equivalent to this in the rlog transformed data? Could I use log2(5) ?

Entering edit mode

No, the same threshold for all genes would not be a good idea. The rlog is similar to log2(K_ij/s_j) (see paper). But the count K_ij is proportional to gene length, sequence content of the gene etc, in experiment-specific and sample-specific ways (which we don't understand that well yet).

But at the least you would want to correct for gene length and GC content. See the cqn and EDASeq packages.


Login before adding your answer.

Traffic: 331 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6