Question: DESeq2 baseMean from DiffBind summarizedExperiment
0
gravatar for rbronste
8 months ago by
rbronste60
rbronste60 wrote:

So I am using DiffBind to create a count matrix as summarizedExperiment and then running this through DESeq2 and I was wondering about ways to filter for specific baseMean ranges within this data? My script looks like the following:

library("DESeq2")
library("ggplot2")
library("BiocParallel")
library("DiffBind")

parallel=TRUE
BPPARAM=MulticoreParam(4)

# Create a dba.count object, important that dba.count be unnormalized (DBA_SCORE_READS).

ATAC_adult_MA<- dba(sampleSheet = "Adult_ATAC_MA1.csv", peakFormat="narrow")

ATAC_Adult_MA_count<-dba.count(ATAC_adult_MA)

# Retrieve counts from dba.count

rangedCounts <- dba.peakset(ATAC_Adult_MA_count, bRetrieve=TRUE)

nrows <- 1095036
ncols <- 12
counts <- matrix(runif(nrows * ncols, 1, 1e5), nrows)
rowRanges<-GRanges(rangedCounts)

sampleNames<-c("MMV1",    "MMV2",    "MMV3",    "FMV1",    "FMV2",    "FMV3", "MME7", "MME8", "MME9", "FME1", "FME2", "FME3")
sampleSex<-c("male", "male", "male", "female", "female", "female", "male", "male", "male", "female", "female", "female")
colData<-data.frame(sampleName=sampleNames, sex=sampleSex)

# Retrieve counts from dba.count without the site interval ranges

counts <- as.matrix(mcols(rangedCounts))

# Construct a SummarizedExperiment

se<-SummarizedExperiment(assays=list(counts=counts),rowRanges=rowRanges, colData=colData)

# DESeq2 analysis

table(sampleSex)

dds<-DESeqDataSet(se, design= ~sex)

dds$group <- factor(paste0(dds$sex))

design(dds) <- ~0 + group

dds <- dds[ rowSums(counts(dds)) > 1, ]

dds <- DESeq(dds, betaPrior = FALSE)
ADD COMMENTlink written 8 months ago by rbronste60
Answer: C: DESeq2 baseMean from DiffBind summarizedExperiment
1
gravatar for Michael Love
8 months ago by
Michael Love23k
United States
Michael Love23k wrote:

I don’t follow the question. You want to pre-filter the dds based on the mean of counts? You can pre-filter the object using dds[keep,] where keep may be defined as rows with mean of normalized count greater than some value or rows where x or more samples have a count greater than or equal to y. This is up to you.

ADD COMMENTlink written 8 months ago by Michael Love23k

Yes actually that answers part of it, however an even deeper and perhaps slightly naive question is how the baseMean called by DESeq2 relates to the original counts and how does that change with normalization within DESeq2 - trying to determine if the problem is with the stringency of called peaks by MACS2 or something happening to during the differential peak calling. An example being that a lot of shared peaks among samples may have a high baseMean such as those at promoters with the differential peaks having low baseMeans relatively - therefore the overall change in the differential peaks is small. Thanks again!

ADD REPLYlink written 8 months ago by rbronste60
1

Take a look here:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#more-information-on-results-columns

ADD REPLYlink written 8 months ago by Michael Love23k

Yes this is quite helpful however I guess I am still a bit confused as what baseMean represents in an RNA-seq experiment vs the experiment I am indicating with the code above? The binding matrix in question above is made from ATAC-seq data. Thanks.

ADD REPLYlink written 8 months ago by rbronste60
1

The column that DESeq2 creates called “baseMean” is rowMeans(counts(dds, normalized=TRUE)) regardless of the type of input data.

ADD REPLYlink written 8 months ago by Michael Love23k

Sure of course, I just mean that baseMean in gene expression data represents to some approximation the transcript abundance however in ChIP or ATAC data merely represents the fragment pileup at a specific site, unless I am incorrect? 

ADD REPLYlink written 8 months ago by rbronste60
1

I'll leave it to the DiffBind package authors I suppose

ADD REPLYlink written 8 months ago by Michael Love23k
1

In this case, where the count score is DBA_SCORE_READS, the binding matrix does not have pileup scores, but rather the total number of reads that align overlapping each (consensus) peak. This is similar to what you get in RNA data, where the expression matrix has the raw number of reads that overlap the gene (exons), only in the ATAC case, instead of transcripts, they are overlapping open chromatin regions.

ADD REPLYlink written 8 months ago by Rory Stark2.8k

Thanks! That explains it really nicely. One additional question is whether in ATAC data using DBA_SCORE_READS this consensus means anything like it does with the RNA-seq data where differences in expression of a gene are inferred. Im assuming it simply means smaller or larger open chromatin regions however what that amounts to functionally, unlike the gene expression data, is not well defined. That is really what I was getting at with what baseMean really represents across the two experimental types.  

ADD REPLYlink written 8 months ago by rbronste60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 183 users visited in the last hour