DESeq2, outlier correction, Combat, NMF
1
0
Entering edit mode
noname • 0
@noname-13172
Last seen 5.0 years ago

Hello,

I am wondering if you can advice me on my latest project. I have 3 batches of mRNA-seq with 50, 40, 100 samples (same protocol). My end goal is to use NMF to cluster the samples.


I noticed that there are ~200 genes with very high expression values (100k+ raw counts) spread between the samples such that a gene has a very high count in only 1 sample. From the DESeq2 documentation I saw that such artefacts are not uncommon. Because I do not want to throw these genes out I decided to:

1. Give the raw counts for all 50k genes to DESeq2 and also supply the batch information.

colData <- data.frame( Batch = as.factor( batch ) )
dseq <- DESeqDataSetFromMatrix( countData = object, colData = colData, design = ~Batch )                                   
rna_seq_raw_deseq2 <- DESeq( dseq )

2. Retrieve the outlier corrected counts.

assays( rna_seq_raw_deseq2 )[["replaceCounts"]]

From what I saw from the source code it seem that before doing the outlier correction (Cook's) DESeq2 corrects for library size, so I assumed I do not have to do that after obtaining the "replaceCounts".

a) DESeq2 reports that it corrects 4036 outlierrs.

-- replacing outliers and refitting for 4036 genes

b) When I use summary( results( rna_seq_raw_deseq2 ) it reports there are 0 outliers.

out of 50430 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 8452, 17% 
LFC < 0 (down)   : 6369, 13% 
outliers [1]     : 0, 0% 
low counts [2]   : 16635, 33% 
(mean count < 0)

Next I run Combat to account for the batch effect (there is a batch effect).

1. Log2 the "replaceCounts"

rna_seq.log2 <- log2( assays( rna_seq_raw_deseq2 )[["replaceCounts"]] +1 )

2. Run Combat

rna_seq.log2.combat <- ComBat( dat = rna_seq.log2, batch = batches, par.prior = FALSE, prior.plots = TRUE )

Combat ends up creating some negative values.


Running NMF.

1. I select the top 1500 genes using rowVars.

rna_seq.log2.combat.Pvars <- rowVars( rna_seq.log2.combat )

select.1500 <- order( rna_seq.log2.combat.Pvars, decreasing = TRUE)[ 1:1500 ]

rna_seq.log2.combat.1500 <- rna_seq.log2.combat[ select.1500, ]

2. I have to correct for the negative values created by Combat so I simply shift the matrix so that the lowest value becomes 0.

rna_seq.log2.combat.1500 <- rna_seq.lo2.combat.1500 + abs( min( rna_seq.log2.combat.1500 ) )

3. Running NMF

nmf( rna_seq.log2.combat.1500, rank = 4, nrun = 200, .pbackend = 30 )

My questions are:

1. Is it appropriate to use DESeq2 for outlier correction in this fashion?

2. Do you think it is a good idea to remove negative values the way I have done or should I use some other technique?

3. I imagine it will be better to make a pre-selection of genes to do Combat on but how can I pre-select genes if they have not been batch corrected yet?

deseq2 combat outliers NMF • 2.6k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

Do you have any idea where there are so many high counts in a single sample in this data? Your use of DESeq() to replace outliers make sense to me technically, although if the gene is really just an outlier in a single sample, would you be better of filtering out such genes using a simple filter statistic?

ADD COMMENT
0
Entering edit mode

Sadly I have no idea why a single gene in a sample could have such a high count. They seem to be random.

I went with DESeq2 for the outlier correction because of the inherent filtering steps, lib correction, as well as the option to specify batch as part of the model. I imagine if I am not able to specify batch as part of the model I would have to perform the outlier correction on each batch separately?

I was not able to find an option for it (probably missed it somehow) but is there a way to see which are these -- replacing outliers and refitting for 4036 genes

-- replacing outliers and refitting for 4036 genes
ADD REPLY
0
Entering edit mode

I mean, what value do you get out of replacing those outliers? Do you think there is relevant signal from the other samples in these genes?

This is bulk RNA-seq?

ADD REPLY
0
Entering edit mode

Yes, this is bulk RNA-seq. The three batches were done with the same protocol as well As for the values I get, I checked a few genes after the outliers were corrected and the new values seemed reasonable and in line with the other samples. ( I am not sure how to get a list of the 4036 genes DESeq2 flags).

ADD REPLY

Login before adding your answer.

Traffic: 878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6