Question: DESeq2 gives weird results
0
gravatar for chipolino
4 months ago by
chipolino0
chipolino0 wrote:

Hi,

I am doing DE analysis. I have 4 conditions: control, drug1, drug2, drug3 (2 replicates for each condition). I started with Salmon to count transcripts and imported them with tximport. After this I constructed DESeq data set using DESeqDataSetFromTximport following this tutorial. Then I tried to compare control with drug1. Here is my code:

>library(tximportData)
>library(tximport)
>library(tximeta)
>library(DESeq2)
>library(GenomicFeatures)

>files <- file.path("path_to_salmon_files", "quants", metadata$Quant, "quant.sf")
>names(files) <- metadata$Samples
>TxDb <- makeTxDbFromGFF(file="gencode.v30.annotation.gff3.gz")

>k <- keys(TxDb, keytype = "TXNAME")
>tx2gene <- AnnotationDbi::select(TxDb, k, "GENEID", "TXNAME")

>txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE)

>ddsTxi <-DESeqDataSetFromTximport(txi, colData = metadata, design =~Treatment)

>keep <-rowSums(counts(ddsTxi))>=10
>ddsTxi <- ddsTxi[keep,]

#"drug1_1", "drug1_2" are technical replicates
>xt <- ddsTxi[,c("drug1_1", "drug1_2", "control1", "control2")]
>yt <- factor(c("D", "D", "C", "C"))
>xt$Treatment <- yt
>xt <-DESeq(xt)
>res <-results(xt, name="Treatment_C_vs_D")

>resOrdered <- res[order(res$padj),]
>results <-as.data.frame(resOrdered)

Then when I check results I see this information for gene ETV6:

| baseMean  | log2FoldChange    | lfcSE     | stat      | padj      |
|---------- |----------------   |-------    |------     |------     |
| 661.72988 | -3.5438716        | 0.3112069 | -11.387510| 2.982196e-26   |

Which basically means that the gene is downregulated in control compared to drug1 (with log fold change 3.5). But if I look at raw counts, this is what I see:

| drug1_1 | drug1_2      | control1 | control2 |
|----------|--------------|-------  |------    |
| 280      | 357          | 396     | 401      | 

So according to raw counts, there is no significant change in control compared to treatment. And almost all counts come from the single transcript in all conditions. All library sizes are very similar (~15M). What might be the reason for such log fold change in DESeq analysis? Can it be that in raw counts there is no change, but due to normalization relative levels of this gene in the drug treatment (compared to all other genes) increased?

P.S. When I analyze the same data with limma-voom, I get logFC for this gene 0.23 (increase in control compared to drug1).

deseq2 salmon • 115 views
ADD COMMENTlink modified 4 months ago by Michael Love25k • written 4 months ago by chipolino0
Answer: DESeq2 gives weird results
0
gravatar for Michael Love
4 months ago by
Michael Love25k
United States
Michael Love25k wrote:

Note that the base mean for the gene you have selected is 661.7. But the counts that you posted are all smaller.

What are the normalized counts instead of the raw counts?

ADD COMMENTlink written 4 months ago by Michael Love25k

Thank you for your response!

Here are the normalized counts:

| drug1_1  | drug1_2      | control1 | control2  |
|----------|--------------|-------   |------     |
| 1293.573 | 1144.279     | 90.263   | 118.803   | 

I also checked the quant.sf files from Salmon for this gene and that's what I see:

Drug1

|Transcript| Length  | EffectiveLength | TPM       |NumReads   |
|----------|---------|-------          |------     |------     |
| ETV6-202 | 5989    | 5727.000        | 0.130581  | 275.035   | 
| ETV6-203 | 572     | 323.000         | 0         | 0         | 
| ETV6-204 | 733     | 484.000         | 0         | 0         | 
| ETV6-205 | 652     | 403.000         | 0         | 0         | 
| ETV6-201 | 160     | 6.178           | 2.185476  | 4.965     | 

Drug2

|Transcript| Length  | EffectiveLength | TPM       |NumReads   |
|----------|---------|-------          |------     |------     |
| ETV6-202 | 5989    | 5727.000        | 0.107006  | 351.076   | 
| ETV6-203 | 572     | 323.000         | 0         | 0         | 
| ETV6-204 | 733     | 484.000         | 0.004320  | 1.198     | 
| ETV6-205 | 652     | 403.000         | 0         | 0         | 
| ETV6-201 | 160     | 6.178           | 1.354557  | 4.794     | 

Control1

|Transcript| Length  | EffectiveLength | TPM       |NumReads   |
|----------|---------|-------          |------     |------     |
| ETV6-202 | 5989    | 5727.000        | 0.191470  | 396.000   | 
| ETV6-203 | 572     | 323.000         | 0         | 0         | 
| ETV6-204 | 733     | 484.000         | 0         | 0         | 
| ETV6-205 | 652     | 403.000         | 0         | 0         | 
| ETV6-201 | 160     | 6.178           | 0         | 0         | 

Control2

|Transcript| Length  | EffectiveLength | TPM       |NumReads   |
|----------|---------|-------          |------     |------     |
| ETV6-202 | 5989    | 5727.000        | 0.084566  | 393.219   | 
| ETV6-203 | 572     | 323.000         | 0         | 0         | 
| ETV6-204 | 733     | 484.000         | 0.006440  | 2.531     | 
| ETV6-205 | 652     | 403.000         | 0.016047  | 5.251     | 
| ETV6-201 | 160     | 6.178           | 0         | 0         | 

So looks like the overall difference comes from ETV6-201 transcript... but it has just a few reads in the drug treatment. How can I take this into account? Should I use limma-voom instead?

ADD REPLYlink modified 4 months ago • written 4 months ago by chipolino0
1

The issue is not limma-voom vs DESeq2, but using counts alone vs. using the transcript lengths.

I'm suspicious, like you are, about the large change in normalized counts from the expression switching to the short isoform, given there are only a few reads allocated to that transcript.

My quick fix for this dataset would be to only use the txi$counts, and pass these to DESeqDataSetFromMatrix, so tossing out the average transcript lengths for normalization, because we don't trust it in the above example. A more involved approach to deal with these short isoforms would be to generate the transcript-level matrices with tximport, then remove all the transcripts with a very low read count, then summarizeToGene and pass to DESeqDataSetFromTximport.

ADD REPLYlink written 4 months ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 297 users visited in the last hour