Hello,
I am wondering if you can advice me on my latest project. I have 3 batches of mRNA-seq with 50, 40, 100 samples (same protocol). My end goal is to use NMF to cluster the samples.
I noticed that there are ~200 genes with very high expression values (100k+ raw counts) spread between the samples such that a gene has a very high count in only 1 sample. From the DESeq2 documentation I saw that such artefacts are not uncommon. Because I do not want to throw these genes out I decided to:
1. Give the raw counts for all 50k genes to DESeq2 and also supply the batch information.
colData <- data.frame( Batch = as.factor( batch ) ) dseq <- DESeqDataSetFromMatrix( countData = object, colData = colData, design = ~Batch ) rna_seq_raw_deseq2 <- DESeq( dseq )
2. Retrieve the outlier corrected counts.
assays( rna_seq_raw_deseq2 )[["replaceCounts"]]
From what I saw from the source code it seem that before doing the outlier correction (Cook's) DESeq2 corrects for library size, so I assumed I do not have to do that after obtaining the "replaceCounts".
a) DESeq2 reports that it corrects 4036 outlierrs.
-- replacing outliers and refitting for 4036 genes
b) When I use summary( results( rna_seq_raw_deseq2 )
it reports there are 0 outliers.
out of 50430 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 8452, 17% LFC < 0 (down) : 6369, 13% outliers [1] : 0, 0% low counts [2] : 16635, 33% (mean count < 0)
Next I run Combat to account for the batch effect (there is a batch effect).
1. Log2 the "replaceCounts"
rna_seq.log2 <- log2( assays( rna_seq_raw_deseq2 )[["replaceCounts"]] +1 )
2. Run Combat
rna_seq.log2.combat <- ComBat( dat = rna_seq.log2, batch = batches, par.prior = FALSE, prior.plots = TRUE )
Combat ends up creating some negative values.
Running NMF.
1. I select the top 1500 genes using rowVars.
rna_seq.log2.combat.Pvars <- rowVars( rna_seq.log2.combat ) select.1500 <- order( rna_seq.log2.combat.Pvars, decreasing = TRUE)[ 1:1500 ] rna_seq.log2.combat.1500 <- rna_seq.log2.combat[ select.1500, ]
2. I have to correct for the negative values created by Combat so I simply shift the matrix so that the lowest value becomes 0.
rna_seq.log2.combat.1500 <- rna_seq.lo2.combat.1500 + abs( min( rna_seq.log2.combat.1500 ) )
3. Running NMF
nmf( rna_seq.log2.combat.1500, rank = 4, nrun = 200, .pbackend = 30 )
My questions are:
1. Is it appropriate to use DESeq2 for outlier correction in this fashion?
2. Do you think it is a good idea to remove negative values the way I have done or should I use some other technique?
3. I imagine it will be better to make a pre-selection of genes to do Combat on but how can I pre-select genes if they have not been batch corrected yet?
Sadly I have no idea why a single gene in a sample could have such a high count. They seem to be random.
I went with DESeq2 for the outlier correction because of the inherent filtering steps, lib correction, as well as the option to specify batch as part of the model. I imagine if I am not able to specify batch as part of the model I would have to perform the outlier correction on each batch separately?
I was not able to find an option for it (probably missed it somehow) but is there a way to see which are these -- replacing outliers and refitting for 4036 genes
I mean, what value do you get out of replacing those outliers? Do you think there is relevant signal from the other samples in these genes?
This is bulk RNA-seq?
Yes, this is bulk RNA-seq. The three batches were done with the same protocol as well As for the values I get, I checked a few genes after the outliers were corrected and the new values seemed reasonable and in line with the other samples. ( I am not sure how to get a list of the 4036 genes DESeq2 flags).