Question

DESeq2 baseMean from DiffBind summarizedExperiment

0

Entering edit mode

rbronste ▴ 60

@rbronste-12189

Last seen 6.2 years ago

So I am using DiffBind to create a count matrix as summarizedExperiment and then running this through DESeq2 and I was wondering about ways to filter for specific baseMean ranges within this data? My script looks like the following:

library("DESeq2")
library("ggplot2")
library("BiocParallel")
library("DiffBind")

parallel=TRUE
BPPARAM=MulticoreParam(4)

# Create a dba.count object, important that dba.count be unnormalized (DBA_SCORE_READS).

ATAC_adult_MA<- dba(sampleSheet = "Adult_ATAC_MA1.csv", peakFormat="narrow")

ATAC_Adult_MA_count<-dba.count(ATAC_adult_MA)

# Retrieve counts from dba.count

rangedCounts <- dba.peakset(ATAC_Adult_MA_count, bRetrieve=TRUE)

nrows <- 1095036
ncols <- 12
counts <- matrix(runif(nrows * ncols, 1, 1e5), nrows)
rowRanges<-GRanges(rangedCounts)

sampleNames<-c("MMV1",    "MMV2",    "MMV3",    "FMV1",    "FMV2",    "FMV3", "MME7", "MME8", "MME9", "FME1", "FME2", "FME3")
sampleSex<-c("male", "male", "male", "female", "female", "female", "male", "male", "male", "female", "female", "female")
colData<-data.frame(sampleName=sampleNames, sex=sampleSex)

# Retrieve counts from dba.count without the site interval ranges

counts <- as.matrix(mcols(rangedCounts))

# Construct a SummarizedExperiment

se<-SummarizedExperiment(assays=list(counts=counts),rowRanges=rowRanges, colData=colData)

# DESeq2 analysis

table(sampleSex)

dds<-DESeqDataSet(se, design= ~sex)

dds$group <- factor(paste0(dds$sex))

design(dds) <- ~0 + group

dds <- dds[ rowSums(counts(dds)) > 1, ]

dds <- DESeq(dds, betaPrior = FALSE)

deseq2 diffbind basemean summarizedexperiment • 2.8k views

ADD COMMENT • link 7.5 years ago rbronste ▴ 60

score 1 · Answer 1 · 2018-08-04

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

I don’t follow the question. You want to pre-filter the dds based on the mean of counts? You can pre-filter the object using dds[keep,] where keep may be defined as rows with mean of normalized count greater than some value or rows where x or more samples have a count greater than or equal to y. This is up to you.

ADD COMMENT • link 7.5 years ago Michael Love 43k

0

Entering edit mode

Yes actually that answers part of it, however an even deeper and perhaps slightly naive question is how the baseMean called by DESeq2 relates to the original counts and how does that change with normalization within DESeq2 - trying to determine if the problem is with the stringency of called peaks by MACS2 or something happening to during the differential peak calling. An example being that a lot of shared peaks among samples may have a high baseMean such as those at promoters with the differential peaks having low baseMeans relatively - therefore the overall change in the differential peaks is small. Thanks again!

ADD REPLY • link 7.5 years ago rbronste ▴ 60

1

Entering edit mode

Take a look here:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#more-information-on-results-columns

ADD REPLY • link 7.5 years ago Michael Love 43k

0

Entering edit mode

Yes this is quite helpful however I guess I am still a bit confused as what baseMean represents in an RNA-seq experiment vs the experiment I am indicating with the code above? The binding matrix in question above is made from ATAC-seq data. Thanks.

ADD REPLY • link 7.5 years ago rbronste ▴ 60

1

Entering edit mode

The column that DESeq2 creates called “baseMean” is rowMeans(counts(dds, normalized=TRUE)) regardless of the type of input data.

ADD REPLY • link 7.5 years ago Michael Love 43k

0

Entering edit mode

Sure of course, I just mean that baseMean in gene expression data represents to some approximation the transcript abundance however in ChIP or ATAC data merely represents the fragment pileup at a specific site, unless I am incorrect?

ADD REPLY • link 7.5 years ago rbronste ▴ 60

1

Entering edit mode

I'll leave it to the DiffBind package authors I suppose

ADD REPLY • link 7.5 years ago Michael Love 43k

1

Entering edit mode

In this case, where the count score is DBA_SCORE_READS, the binding matrix does not have pileup scores, but rather the total number of reads that align overlapping each (consensus) peak. This is similar to what you get in RNA data, where the expression matrix has the raw number of reads that overlap the gene (exons), only in the ATAC case, instead of transcripts, they are overlapping open chromatin regions.

ADD REPLY • link 7.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thanks! That explains it really nicely. One additional question is whether in ATAC data using DBA_SCORE_READS this consensus means anything like it does with the RNA-seq data where differences in expression of a gene are inferred. Im assuming it simply means smaller or larger open chromatin regions however what that amounts to functionally, unlike the gene expression data, is not well defined. That is really what I was getting at with what baseMean really represents across the two experimental types.

ADD REPLY • link 7.5 years ago rbronste ▴ 60