Question

Why so few genes are returned?

0

Entering edit mode

megapode32559 • 0

@megapode32559-23129

Last seen 4.2 years ago

For the following data, and R code,

https://filebin.net/5zlqg5of4c2o37vu

I got only 1948 genes returned, but count_data.tsv has 26506 rows. Am I supposed to get so few genes with non NA padj. Is there anything wrong with the DESeq2 code that I use? Thanks.

R> sum(!is.na(res$padj))
[1] 1948

count_file='count_data.tsv'
sample_file='meta_data.tsv'
condA='A'
condB='B'
qthr=0.5

read.tsv = function (file, header = T, sep = "\t", quote = "", check.names = F, comment.char = "", stringsAsFactors = F, ...) {
    read.table(file, header = header, sep = sep, quote = quote, check.names = check.names, comment.char = comment.char, stringsAsFactors = stringsAsFactors, ...)
}

sample_table=read.tsv(sample_file)
count_table=read.tsv(count_file)[, sample_table[, 1], drop=F]
design=as.formula(sprintf('~%s', names(sample_table)[[2]]))

suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
scaled_count_table=counts(
    estimateSizeFactors(
        suppressWarnings(
            DESeqDataSetFromMatrix(
                countData=count_table
                , colData=sample_table[, 2, drop=F]
                , design=design
                )
            )
        )
    , normalized=T
    )
rs=rowSums(scaled_count_table)
sub_count_table=count_table[rs > quantile(rs, qthr), , drop=F]

cds=estimateSizeFactors(
    suppressWarnings(
        DESeqDataSetFromMatrix(
            countData=sub_count_table
            , colData=sample_table
            , design=design
            )
        )
    )

res=data.frame(
    results(
        DESeq(cds, quiet = T)
        , contrast=c(names(sample_table)[[2]], condA, condB)
        )
    )

deseq2 • 524 views

ADD COMMENT • link updated 4.2 years ago by Kevin Blighe ★ 3.9k • written 4.2 years ago by megapode32559 • 0

score 0 · Answer 1 · 2020-03-19

0

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 24 days ago

Republic of Ireland

You seem to be making your life complicated with this. For a simple A vs B comparison, the Quick start section of the vignette should suffice.

To understand NA p-values, please see this section Note on p-values set to NA

Kevin

ADD COMMENT • link 4.2 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

The complexity is not the cause of the problem. Using contrast would be simpler to make the code for work multiple level factors.

R> cds = DESeq(cds, quiet = T)
R> sum(!is.na(data.frame(results(cds, name='treatment_B_vs_A'))$padj))
[1] 1840

If I use the above code without contrast, the results are the same. Without using the quantile filter, the results are the same.

Regarding, the NA's of padj, I understand what it means. But the problem is that somebody running on the same data don't get so many NA's. So I am not sure having only 1840 NAs are correctly. I am not sure what is wrong. Could you please try the data and see if there is anything wrong with the analysis? Thanks.

Here is a simpler code as you suggested.

count_file='count_data.tsv'
sample_file='meta_data.tsv'
condA='A'
condB='B'
qthr=0.5

read.tsv = function (file, header = T, sep = "\t", quote = "", check.names = F, comment.char = "", stringsAsFactors = F, ...) {
    read.table(file, header = header, sep = sep, quote = quote, check.names = check.names, comment.char = comment.char, stringsAsFactors = stringsAsFactors, ...)
}

sample_table=read.tsv(sample_file)
count_table=read.tsv(count_file)[, sample_table[, 1], drop=F]
design=as.formula(sprintf('~%s', names(sample_table)[[2]]))

suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
cds=DESeqDataSetFromMatrix(
    countData=count_table
    , colData=sample_table[, 2, drop=F]
    , design=design
    )

cds = DESeq(cds, quiet = T)
res=data.frame(
    results(
        cds
        , contrast=c(names(sample_table)[[2]], condA, condB)
        )
    )
sum(!is.na(data.frame(results(cds, name='treatment_B_vs_A'))$padj))

ADD REPLY • link 4.2 years ago megapode32559 • 0

0

Entering edit mode

"But the problem is that somebody running on the same data don't get so many NA's. So I am not sure having only 1840 NAs are correctly"

Please read the section that Kevin linked to.

You can turn off the procedures that produce the NAs if you do not want to employ those procedures.

It's all described in the vignette.

ADD REPLY • link 4.2 years ago Michael Love 42k

0

Entering edit mode

I made a mistake. I meant to say that "I am not sure having only 1840 non-NAs are correct."

The problem is not about turning off NA rows or not (Also, please be specific about which part of the vignette, as I don't think this is a problem that vignette covers).

The problem is I get too few non-NAs rows (1840) for a genome with ~20000 genes. Do you really think the selection of such a small subset of genes is correct? After all, the p-value distribution is titled toward 1 instead of 0. The selection procedure strongly depend on the correctness of the p-values, which we don't have for this dataset.

ADD REPLY • link 4.2 years ago megapode32559 • 0

0

Entering edit mode

I don’t have any remaining suggestions for you.

ADD REPLY • link 4.2 years ago Michael Love 42k

0

Entering edit mode

It sounds like it is something unrelated to DESeq2, specifically, but check every step, like, ensuring that numerical data is encoded as numerical data (and not factors / categorically). Also, your original code looks overly-complicated and prone to producing error - please follow the DESeq2 vignette in order to simplify it. If your colleague obtains a different result, then compare your pipeline to her/his.

ADD REPLY • link 4.2 years ago Kevin Blighe ★ 3.9k