DESeq2, outlier correction, Combat, NMF
Entering edit mode
noname • 0
Last seen 22 months ago


I am wondering if you can advice me on my latest project. I have 3 batches of mRNA-seq with 50, 40, 100 samples (same protocol). My end goal is to use NMF to cluster the samples.

I noticed that there are ~200 genes with very high expression values (100k+ raw counts) spread between the samples such that a gene has a very high count in only 1 sample. From the DESeq2 documentation I saw that such artefacts are not uncommon. Because I do not want to throw these genes out I decided to:

1. Give the raw counts for all 50k genes to DESeq2 and also supply the batch information.

colData <- data.frame( Batch = as.factor( batch ) )
dseq <- DESeqDataSetFromMatrix( countData = object, colData = colData, design = ~Batch )                                   
rna_seq_raw_deseq2 <- DESeq( dseq )

2. Retrieve the outlier corrected counts.

assays( rna_seq_raw_deseq2 )[["replaceCounts"]]

From what I saw from the source code it seem that before doing the outlier correction (Cook's) DESeq2 corrects for library size, so I assumed I do not have to do that after obtaining the "replaceCounts".

a) DESeq2 reports that it corrects 4036 outlierrs.

-- replacing outliers and refitting for 4036 genes

b) When I use summary( results( rna_seq_raw_deseq2 ) it reports there are 0 outliers.

out of 50430 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 8452, 17% 
LFC < 0 (down)   : 6369, 13% 
outliers [1]     : 0, 0% 
low counts [2]   : 16635, 33% 
(mean count < 0)

Next I run Combat to account for the batch effect (there is a batch effect).

1. Log2 the "replaceCounts"

rna_seq.log2 <- log2( assays( rna_seq_raw_deseq2 )[["replaceCounts"]] +1 )

2. Run Combat

rna_seq.log2.combat <- ComBat( dat = rna_seq.log2, batch = batches, par.prior = FALSE, prior.plots = TRUE )

Combat ends up creating some negative values.

Running NMF.

1. I select the top 1500 genes using rowVars.

rna_seq.log2.combat.Pvars <- rowVars( rna_seq.log2.combat )

select.1500 <- order( rna_seq.log2.combat.Pvars, decreasing = TRUE)[ 1:1500 ]

rna_seq.log2.combat.1500 <- rna_seq.log2.combat[ select.1500, ]

2. I have to correct for the negative values created by Combat so I simply shift the matrix so that the lowest value becomes 0.

rna_seq.log2.combat.1500 <- rna_seq.lo2.combat.1500 + abs( min( rna_seq.log2.combat.1500 ) )

3. Running NMF

nmf( rna_seq.log2.combat.1500, rank = 4, nrun = 200, .pbackend = 30 )

My questions are:

1. Is it appropriate to use DESeq2 for outlier correction in this fashion?

2. Do you think it is a good idea to remove negative values the way I have done or should I use some other technique?

3. I imagine it will be better to make a pre-selection of genes to do Combat on but how can I pre-select genes if they have not been batch corrected yet?

deseq2 combat outliers NMF • 1.3k views
Entering edit mode
Last seen 55 minutes ago
United States

Do you have any idea where there are so many high counts in a single sample in this data? Your use of DESeq() to replace outliers make sense to me technically, although if the gene is really just an outlier in a single sample, would you be better of filtering out such genes using a simple filter statistic?

Entering edit mode

Sadly I have no idea why a single gene in a sample could have such a high count. They seem to be random.

I went with DESeq2 for the outlier correction because of the inherent filtering steps, lib correction, as well as the option to specify batch as part of the model. I imagine if I am not able to specify batch as part of the model I would have to perform the outlier correction on each batch separately?

I was not able to find an option for it (probably missed it somehow) but is there a way to see which are these -- replacing outliers and refitting for 4036 genes

-- replacing outliers and refitting for 4036 genes
Entering edit mode

I mean, what value do you get out of replacing those outliers? Do you think there is relevant signal from the other samples in these genes?

This is bulk RNA-seq?

Entering edit mode

Yes, this is bulk RNA-seq. The three batches were done with the same protocol as well As for the values I get, I checked a few genes after the outliers were corrected and the new values seemed reasonable and in line with the other samples. ( I am not sure how to get a list of the 4036 genes DESeq2 flags).


Login before adding your answer.

Traffic: 316 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6