Entering edit mode
rbronste
▴
60
@rbronste-12189
Last seen 5.2 years ago
So I am using DiffBind to create a count matrix as summarizedExperiment and then running this through DESeq2 and I was wondering about ways to filter for specific baseMean ranges within this data? My script looks like the following:
library("DESeq2") library("ggplot2") library("BiocParallel") library("DiffBind") parallel=TRUE BPPARAM=MulticoreParam(4) # Create a dba.count object, important that dba.count be unnormalized (DBA_SCORE_READS). ATAC_adult_MA<- dba(sampleSheet = "Adult_ATAC_MA1.csv", peakFormat="narrow") ATAC_Adult_MA_count<-dba.count(ATAC_adult_MA) # Retrieve counts from dba.count rangedCounts <- dba.peakset(ATAC_Adult_MA_count, bRetrieve=TRUE) nrows <- 1095036 ncols <- 12 counts <- matrix(runif(nrows * ncols, 1, 1e5), nrows) rowRanges<-GRanges(rangedCounts) sampleNames<-c("MMV1", "MMV2", "MMV3", "FMV1", "FMV2", "FMV3", "MME7", "MME8", "MME9", "FME1", "FME2", "FME3") sampleSex<-c("male", "male", "male", "female", "female", "female", "male", "male", "male", "female", "female", "female") colData<-data.frame(sampleName=sampleNames, sex=sampleSex) # Retrieve counts from dba.count without the site interval ranges counts <- as.matrix(mcols(rangedCounts)) # Construct a SummarizedExperiment se<-SummarizedExperiment(assays=list(counts=counts),rowRanges=rowRanges, colData=colData) # DESeq2 analysis table(sampleSex) dds<-DESeqDataSet(se, design= ~sex) dds$group <- factor(paste0(dds$sex)) design(dds) <- ~0 + group dds <- dds[ rowSums(counts(dds)) > 1, ] dds <- DESeq(dds, betaPrior = FALSE)
Yes actually that answers part of it, however an even deeper and perhaps slightly naive question is how the baseMean called by DESeq2 relates to the original counts and how does that change with normalization within DESeq2 - trying to determine if the problem is with the stringency of called peaks by MACS2 or something happening to during the differential peak calling. An example being that a lot of shared peaks among samples may have a high baseMean such as those at promoters with the differential peaks having low baseMeans relatively - therefore the overall change in the differential peaks is small. Thanks again!
Take a look here:
https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#more-information-on-results-columns
Yes this is quite helpful however I guess I am still a bit confused as what baseMean represents in an RNA-seq experiment vs the experiment I am indicating with the code above? The binding matrix in question above is made from ATAC-seq data. Thanks.
The column that DESeq2 creates called “baseMean” is rowMeans(counts(dds, normalized=TRUE)) regardless of the type of input data.
Sure of course, I just mean that baseMean in gene expression data represents to some approximation the transcript abundance however in ChIP or ATAC data merely represents the fragment pileup at a specific site, unless I am incorrect?
I'll leave it to the DiffBind package authors I suppose
In this case, where the count score is
DBA_SCORE_READS
, the binding matrix does not have pileup scores, but rather the total number of reads that align overlapping each (consensus) peak. This is similar to what you get in RNA data, where the expression matrix has the raw number of reads that overlap the gene (exons), only in the ATAC case, instead of transcripts, they are overlapping open chromatin regions.Thanks! That explains it really nicely. One additional question is whether in ATAC data using DBA_SCORE_READS this consensus means anything like it does with the RNA-seq data where differences in expression of a gene are inferred. Im assuming it simply means smaller or larger open chromatin regions however what that amounts to functionally, unlike the gene expression data, is not well defined. That is really what I was getting at with what baseMean really represents across the two experimental types.