TMM Normalization: DESeqDataSet: some values in assay are negative
2
0
Entering edit mode
neekonsu • 0
@neekonsu-21427
Last seen 2.8 years ago

Hi all,

I am working with the publicly available RNA-seq data from the GTEx database at https://storage.googleapis.com/gtexanalysisv7/rnaseqdata/GTExAnalysis2016-01-15v7RNASeQCv1.1.8genereads.gct.gz

I have normalized the count data using EdgeR's calcNormFactors() and cpm(x, log=TRUE) functions, and I am trying to run my differential analysis with DESeq. The DESeqDataSetFromMatrix() function returns "some values in assay are negative" after passing the normalized counts into the function, and I am not sure how to mitigate this error.

Is it possible for me to use the normalized data with DESeq(2), and if so, I would love to see how. Attached please see the full pipeline, and I appreciate all help greatly!

Thanks,

Neekon

###### #

coldata["condition"] = condition coldata["color"] = condition.color coldata["cluster"] = condition.cluster head(coldata, 10)

y <- DGEList(counts=countdata) keep <- filterByExpr(y) y <- y[keep, , keep.lib.sizes=FALSE] y <- calcNormFactors(y) data.scaled <- cpm(y, log=TRUE)

fvizpcaind(df.pca, label="none", habillage = condition.color, geom.ind="point")

dds <- DESeqDataSetFromMatrix(countData = t(data.scaled), colData = coldata, design= ~ condition)

dds.color <- DESeqDataSetFromMatrix(data.scaled = t(data.scaled), colData = coldata, design= ~ color)

dds.cluster <- DESeqDataSetFromMatrix(countData = t(data.scaled), colData = coldata, design= ~ cluster)

DESeq(dds)

DESeq(dds.color)

DESeq(dds.cluster)

res <- results(dds, name = "results") summary(res)

res.color <- results(dds.color, name = "results.color") summary(res.color)

res.cluster <- results(dds.cluster, name = "results.cluster") summary(res.cluster)

deseq2 deseq normalization • 1.5k views
4
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 22 months ago
Zentrum für Molekularbiologie, Universi…

DESeq2 wants raw unnormalized counts. Do not supply anything else, ever.

Obviously, the easiest would be to to run the whole analysis (normalization and DE testing) using either only edgeR or only DESeq2.

It is possible to mix both, by extracting the normalization coefficients from edgeR and handing them over to DESeq2. If you really need that, I can dig out how to do that, but I then would be curious why. This is a very unusual approach, and unless you know very well what you are doing and why, I would advise against it.

0
Entering edit mode

That is great to know, thanks for the speedy answer! The reason I was looking to mix both functions was because I am comparing pipelines, between EdgeR and DESeq(2), and I wanted to keep the normalization step constant in my comparison. However, since I was comparing for performance measure, I think that I will stick with the recommended implementation of DESeq, although I would hugely appreciate it if you could also tell me how I could use norm-factors from edgeR in DESeq -- it's not a huge deal but I am curious. Thanks again for your advice!

-Neekon ,

0
Entering edit mode

If you want to truly compare pipeline performance, you should let each method do the normalization it's own way. Each method has been designed with its own philosophy and underlying assumptions, so mixing parts of one method with another is likely going to give you sub-optimal performance. (or, if you have enough time, you could test both mixed and unmixed and see for yourself).

0
Entering edit mode
wunderl ▴ 30
@wunderl-20805
Last seen 2.8 years ago

You are likely getting this error because of log=TRUE in your call to cpm(). The log of anything less than 1 will be negative, so if after normalization you end up with any fractional counts they will produce a negative number when you take the log. This issue also occurs with DESeq2's rlog transformation.

As was mentioned in the other response, DESeq will only work with raw, unnormalized counts. For information on why, see here.

If you really want to use normalization factors from edgeR, then you still need to provide DESeq with the raw counts and then provide the normalization factors separately.

I am not familiar with edgeR, so how you provide the normalization factors will depend on if edgeR returns per-sample based normalization factors (a vector with one entry for each sample) or if they are per-gene normalization factors (a matrix in the form gene X samples). For official documentation on this process, see here.

## Sample based normalization factors

dds = DESeqDataSetFromMatrix(counts=rawCounts,.....)
sizeFactors(dds) = normalizationFactorsFromEdgeR


## Gene based normalization factors

dds = DESeqDataSetFromMatrix(counts=rawCounts,.....)
normalizationFactors(dds) = normalizationFactorsFromEdgeR


Note: The authors of DESeq recommend transforming the normalization factors so that the geometric mean of each row (ie gene) is 1, so that the mean of normalized counts for a gene is close to the mean of the unnormalized counts. If you want to follow this recommendation, then you would do the following:

dds = DESeqDataSetFromMatrix(counts=rawCounts,.....)

normFactors = normalizationFactorsFromEdgeR
normFactors = normFactors / exp(rowMeans(log(normFactors)))

normalizationFactors(dds) = normFactors


After setting the normalization factors using one of the methods above, continue with your regular down-stream analysis and DESeq should use the provided normalization factors instead of calculating its own. I recommend checking the help pages for normalizationFactors and sizeFactors if you need more information on how each function works.