I'm very new to R, Bioinformatics, and programming in general, so I appreciate the aid.
I've acquired a consensus peakset count matrix from a program for ChipSeq MCAP Data with a minimum overlap of 1 across ~40 samples. These samples can be divided into two groups, and I hope to contrast them to determine differentially methylated regions/genes with DiffBind. The problem lies with the fact that I'm unsure how the data was normalized, if it was at all. Therefore, I would like to do this on my own from the counts that I was given.
The data is formatted in the below format. Note that I've acquired the fold change data as the peaks for each sample are quantified as either a 0 or 1 rather than a range in between. Therefore, I don't think it would be as useful for normalization as the fold change values themselves.
I've tried doing successful calls to dba.peakset for each sample to generate a new DBA object. However, it doesn't seem to work, yielding an error message I'm unclear about. The code I used is below.
> head(BAandWA) #partial head
Chromosome Start End Fold.change.BA12_MCAP_Peaks.with.gene.annotations. Fold.change.BA16_MCAP_Peaks.with.gene.annotations.
1 chr1 9993 11257 17.308525 5.414499
2 chr1 51517 52052 0.000000 21.969871
3 chr1 121222 121570 7.496554 14.837665
4 chr1 135042 135329 0.000000 5.802254
5 chr1 136266 137151 8.312252 9.293681
6 chr1 180710 181826 8.730159 11.106704
> countMatrix <- BAandWA[, -c(1:3)]
> newDBA <- NULL
> for(sample in 1:ncol(countMatrix)){
newDBA <- dba.peakset(newDBA, peaks = BAandWA, sampID = colnames(countMatrix)[sample], counts = countMatrix[, sample])
}
Error in if (treads) { : argument is not interpretable as logical
> newDBA
1 Samples, 71010 sites in matrix:
ID Caller Intervals Reads FRiP
1 Fold.change.BA12_MCAP_Peaks.with.gene.annotations. counts 71010 149058.4 1
I'd like to know how to fix this. Then, I'd like to know how to perform a normalization and contrast. Would dba.contrast and dba.analyze work fine?
Thank you for the reply. I can get the bam files, but I'd need to export them from the program itself apparently, and it would take a long time considering the size of each file.
However, I've been reading through the documentation, and I see that I can read in a peakset containing the chromosome, start, end, and confidence score in a peak. I do have all of those values for each sample, at least I believe I do. I've used the MACS algorithm, and I've read that the scores are returned as the -log of the pvalue, indicating that a high number means greater confidence. I have columns titled P score and Q score for each sample, so I believe I can use these.
My question is: is that all I need to continue with the analysis? Do I need the raw read counts or the bam files?
You need the bam files (or a raw count matrix) to do any kind of quantitative analysis and compute confidence statistics (p-values/FDR values).
Without the reads, you can only do what is referred to as an "occupancy analysis" in the vignette. That is, you can do overlaps (basically Venn diagrams), basic clustering, and not much else. You can find peaks that are identified uniquely in one sample group vs. another group, but you can't assess if these changes are meaningful. More importantly, you can't really assess changes in binding strength in sites that are identified across sample groups; these are often some of the most important sites.