Question

Skew of differentially expressed genes using DESeq2

0

Entering edit mode

b88dfe53 • 0

@f1db776d

Last seen 23 months ago

United States

When I run DESeq2 to call differential expression in my data set, I get this strange pattern. I have two experimental conditions, "NM" and "IVF". When I compare them, you can see in the TPM plot that differential expression in very highly expressed genes is almost completely in the NM group, whereas DE in lowly expressed genes is in the IVF group (highlighted in navy). This seems like it could be a technical artifact as I have run similar types of sequencing experiments with different treatments and it has never looked like this. I have also checked the number of genes expressed in each treatment and they look very similar so I don't think that's the problem. Is there any way to correct for this? Or even to know if it's biological or technical? This is an RNA-seq experiment where each sample is an individual embryo so it is a very low-input experiment but I have ~50 samples for each treatment group. Any help would be very much appreciated!

enter image description here

The dispersion estimates plot seems to be normal, but you can see the skew in the MA plot:

enter image description here

The code I used to run DESeq is below:


deseq_input <- DESeqDataSetFromMatrix(rawcounts, treatment, ~ART_type+Pseudostage, tidy=TRUE)
deseq_output <- DESeq(deseq_input, fitType = "local")
deseq_results <- results(deseq_output, contrast=c("ART_type","IVF","NM"), pAdjustMethod = "fdr")

I also get a large number of genes that are called as DE but only due to a few outlier samples (see plot of gene TPM below). This is after removing obvious outliers using a PCA plot, so these are not the same few samples from gene to gene (otherwise I would just remove the problem samples). Is there any way to tell DESeq not to call these genes as DE?

enter image description here

DESeq2 • 1.3k views

ADD COMMENT • link updated 23 months ago by Michael Love 42k • written 23 months ago by b88dfe53 • 0

score 0 · Answer 1 · 2022-10-18

0

Entering edit mode

ATpoint ★ 4.4k

@atpoint-13662

Last seen 25 days ago

Germany

I would argue that the entire MA-plot would need to be shifted upwards so normalization is suboptimal, probably due to the asymmetrical DE profile. This can probably be solved by some prefiltering as in the vignette and calculating size factors based on genes with large baseMean, for example the top 20% genes with largest baseMean. As for the genes with low effect size, that is what the lfc argument in lfcShrink is for, so testing against a minimal fold change to avoid significances for genes with large counts but tiny effect sizes.

ADD COMMENT • link 23 months ago ATpoint ★ 4.4k

0

Entering edit mode

Agree with providing the genes with say, baseMean > 1000 to the controlGenes argument of estimateSizeFactors, before running DESeq(). For this particular dataset, you need to provide additional information than just relying on in silico normalization over all genes.

ADD REPLY • link 23 months ago Michael Love 42k

0

Entering edit mode

Thanks for the responses. I tried the suggestions, adding pre-filtering and calculating size factors based on genes with baseMean>1000 using the following code:

## pre-filtering

deseq_input <- DESeqDataSetFromMatrix(rawcounts, treatment, ~ART_type+Pseudostage, tidy=TRUE)
keep <- rowSums(counts(deseq_input)) >= 10
deseq_input <- deseq_input[keep, ]

## run DE calculation

deseq_output <- DESeq(deseq_input, fitType = "local")
results <- results(deseq_output, contrast=c("ART_type","IVF","NM"), pAdjustMethod = "fdr") 

## calculate size factors based on genes with high mean

keep2 <- results$baseMean>1000
high_means <- grep("TRUE", keep2)
deseq_input2 <- estimateSizeFactors(deseq_input, controlGenes=high_means)

## re-run DE calculation with new size factors

deseq_output2 <- DESeq(deseq_input2, fitType = "local")
results2 <- results(deseq_output2, contrast=c("ART_type","IVF","NM"), pAdjustMethod = "fdr")

However, I get the following MA plot:

enter image description here

So, now almost all DE genes are overexpressed in the IVF treatment. This doesn't seem possible biologically. When I look at the results call, I get a much higher number of DE genes after filtering baseMean, with 55% of genes overexpressed:

> summary(results)

out of 17349 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 2065, 12%
LFC < 0 (down)     : 691, 4%
outliers [1]       : 235, 1.4%
low counts [2]     : 4505, 26%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> summary(results2)

out of 17349 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 9522, 55%
LFC < 0 (down)     : 74, 0.43%
outliers [1]       : 300, 1.7%
low counts [2]     : 3510, 20%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

I also tried using lfcShrink using the three different types:

resLFC_nor <- lfcShrink(deseq_output2, coef = "ART_type_IVF_vs_NM", type="normal")
resLFC_ape <- lfcShrink(deseq_output2, coef = "ART_type_IVF_vs_NM", type="apeglm")
resLFC_ash <- lfcShrink(deseq_output2, coef = "ART_type_IVF_vs_NM", type="ashr")

> summary(resLFC_nor)

out of 17349 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 9518, 55%
LFC < 0 (down)     : 78, 0.45%
outliers [1]       : 300, 1.7%
low counts [2]     : 3510, 20%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> summary(resLFC_ape)

out of 17349 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 9522, 55%
LFC < 0 (down)     : 74, 0.43%
outliers [1]       : 300, 1.7%
low counts [2]     : 3510, 20%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> summary(resLFC_ash)

out of 17349 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 9522, 55%
LFC < 0 (down)     : 74, 0.43%
outliers [1]       : 300, 1.7%
low counts [2]     : 3510, 20%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

I get the following plots:

enter image description here

As far as I understand, the ashr option does the best job at shrinkage. However, I get error messages with that option such as:

warning: solve(): system is singular (rcond: 2.29493e-18); attempting approx solution

warning: solve(): system is singular (rcond: 2.39535e-18); attempting approx solution

I'm not sure what's causing these as the apeglm and normal settings don't cause any errors.

I have also tried changing the lfcThreshold option but I don't really understand how to optimize this. I tried with a 1.5 fold change and I get vastly different results (see below). I don't want to pick an arbitrary number...what's the best way to set a cutoff here?

resLFC_nor1.5 <- lfcShrink(deseq_output2, coef = "ART_type_IVF_vs_NM", type="normal", lfcThreshold = log(1.5,2))

> summary(resLFC_nor1.5)

out of 17349 with nonzero total read count
adjusted p-value < 0.1
LFC > 0.58 (up)    : 36, 0.21%
LFC < -0.58 (down) : 0, 0%
outliers [1]       : 317, 1.8%
low counts [2]     : 4128, 24%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

enter image description here

I apologize for the long post and thanks again for any help!

ADD REPLY • link 23 months ago b88dfe53 • 0

0

Entering edit mode

So, now almost all DE genes are overexpressed in the IVF treatment. This doesn't seem possible biologically.

And yet you have many genes with an LFC that is larger than what appears to be the "non-changing" LFC at the x-axis. And it is precise. You can look at these genes with high LFC with plotCounts. I guess the question is, why are these genes having larger LFC than the ones that are on the x-axis? And if you don't care about positive LFC, then you should use the LFC threshold.

I have also tried changing the lfcThreshold option but I don't really understand how to optimize this

This has to be chosen based on the biological interpretation. You have said that the genes with positive LFC but small are maybe not "plausible" or maybe not practically important in this experiment. Then you need to set a threshold of practical significance.

ADD REPLY • link 23 months ago Michael Love 42k