Why so few genes are returned?
1
0
Entering edit mode
@megapode32559-23129
Last seen 4.2 years ago

For the following data, and R code,

https://filebin.net/5zlqg5of4c2o37vu

I got only 1948 genes returned, but count_data.tsv has 26506 rows. Am I supposed to get so few genes with non NA padj. Is there anything wrong with the DESeq2 code that I use? Thanks.

R> sum(!is.na(res$padj))
[1] 1948
count_file='count_data.tsv'
sample_file='meta_data.tsv'
condA='A'
condB='B'
qthr=0.5

read.tsv = function (file, header = T, sep = "\t", quote = "", check.names = F, comment.char = "", stringsAsFactors = F, ...) {
    read.table(file, header = header, sep = sep, quote = quote, check.names = check.names, comment.char = comment.char, stringsAsFactors = stringsAsFactors, ...)
}

sample_table=read.tsv(sample_file)
count_table=read.tsv(count_file)[, sample_table[, 1], drop=F]
design=as.formula(sprintf('~%s', names(sample_table)[[2]]))

suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
scaled_count_table=counts(
    estimateSizeFactors(
        suppressWarnings(
            DESeqDataSetFromMatrix(
                countData=count_table
                , colData=sample_table[, 2, drop=F]
                , design=design
                )
            )
        )
    , normalized=T
    )
rs=rowSums(scaled_count_table)
sub_count_table=count_table[rs > quantile(rs, qthr), , drop=F]

cds=estimateSizeFactors(
    suppressWarnings(
        DESeqDataSetFromMatrix(
            countData=sub_count_table
            , colData=sample_table
            , design=design
            )
        )
    )

res=data.frame(
    results(
        DESeq(cds, quiet = T)
        , contrast=c(names(sample_table)[[2]], condA, condB)
        )
    )
deseq2 • 524 views
ADD COMMENT
0
Entering edit mode
Kevin Blighe ★ 3.9k
@kevin
Last seen 24 days ago
Republic of Ireland

You seem to be making your life complicated with this. For a simple A vs B comparison, the Quick start section of the vignette should suffice.

To understand NA p-values, please see this section Note on p-values set to NA

Kevin

ADD COMMENT
0
Entering edit mode

The complexity is not the cause of the problem. Using contrast would be simpler to make the code for work multiple level factors.

R> cds = DESeq(cds, quiet = T)
R> sum(!is.na(data.frame(results(cds, name='treatment_B_vs_A'))$padj))
[1] 1840

If I use the above code without contrast, the results are the same. Without using the quantile filter, the results are the same.

Regarding, the NA's of padj, I understand what it means. But the problem is that somebody running on the same data don't get so many NA's. So I am not sure having only 1840 NAs are correctly. I am not sure what is wrong. Could you please try the data and see if there is anything wrong with the analysis? Thanks.

Here is a simpler code as you suggested.

count_file='count_data.tsv'
sample_file='meta_data.tsv'
condA='A'
condB='B'
qthr=0.5

read.tsv = function (file, header = T, sep = "\t", quote = "", check.names = F, comment.char = "", stringsAsFactors = F, ...) {
    read.table(file, header = header, sep = sep, quote = quote, check.names = check.names, comment.char = comment.char, stringsAsFactors = stringsAsFactors, ...)
}

sample_table=read.tsv(sample_file)
count_table=read.tsv(count_file)[, sample_table[, 1], drop=F]
design=as.formula(sprintf('~%s', names(sample_table)[[2]]))

suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
cds=DESeqDataSetFromMatrix(
    countData=count_table
    , colData=sample_table[, 2, drop=F]
    , design=design
    )

cds = DESeq(cds, quiet = T)
res=data.frame(
    results(
        cds
        , contrast=c(names(sample_table)[[2]], condA, condB)
        )
    )
sum(!is.na(data.frame(results(cds, name='treatment_B_vs_A'))$padj))
ADD REPLY
0
Entering edit mode

"But the problem is that somebody running on the same data don't get so many NA's. So I am not sure having only 1840 NAs are correctly"

Please read the section that Kevin linked to.

You can turn off the procedures that produce the NAs if you do not want to employ those procedures.

It's all described in the vignette.

ADD REPLY
0
Entering edit mode

I made a mistake. I meant to say that "I am not sure having only 1840 non-NAs are correct."

The problem is not about turning off NA rows or not (Also, please be specific about which part of the vignette, as I don't think this is a problem that vignette covers).

The problem is I get too few non-NAs rows (1840) for a genome with ~20000 genes. Do you really think the selection of such a small subset of genes is correct? After all, the p-value distribution is titled toward 1 instead of 0. The selection procedure strongly depend on the correctness of the p-values, which we don't have for this dataset.

ADD REPLY
0
Entering edit mode

I don’t have any remaining suggestions for you.

ADD REPLY
0
Entering edit mode

It sounds like it is something unrelated to DESeq2, specifically, but check every step, like, ensuring that numerical data is encoded as numerical data (and not factors / categorically). Also, your original code looks overly-complicated and prone to producing error - please follow the DESeq2 vignette in order to simplify it. If your colleague obtains a different result, then compare your pipeline to her/his.

ADD REPLY

Login before adding your answer.

Traffic: 337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6