edgeR error when trying to filter peaks with low read counts
2
0
Entering edit mode
camerond • 0
@camerond-15316
Last seen 5 months ago

Hello,

I have successfully run edgeR on an ATAC-seq dataset to identify differential open chromatin peaks across two cell types. I wanted to normalise for peaks read count, GC content and peak width, so my method was to create a consensus peak set in Diffbind (2 cell types, 3 replicates per cell type), then to use the CQN package for normalisation steps. This creates a GLM offset to add to the DGEList object in edgeR that accounts for the normalisation.

The issue I have is that when I try to filter my DGEList object in edgeR to remove 'peaks' that have very low read counts using the command:

keep <- rowSums(cpm(peak.list) > 2) >= 3

I get the following error:

Error in cpm.default(x$counts, lib.size = lib.size, log = log, prior.count = prior.count) : matrix dimensions are not consistent for non-repeating number of columns I'm not sure what's going on here. I suspect it may have something to do with the additional parameters I used whilst setting up the DGEList object to account for the CQN normalisation, my command was this: peak.list <- DGEList(counts = count.matrix, lib.size = depth.matrix, group = rep(c("grp1", "grp2"), each = 3), genes = GC.count.matrix) The offset I mentioned is added later: peak.list$offset <- cqn.ch1$glm.offset and creates the following object: An object of class "DGEList"$counts
ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2
chr1_10002_10467       58       54      126      153      143       65
chr1_15409_15764        4        4        1        2        6        6
chr1_17404_17589       28       27       34       58       64       38
chr1_21206_21446        1        1        1        1        1        1
chr1_26045_26177        1        1        1        1        1        1
22010 more rows ...

$samples group lib.size.ATAC16_1 lib.size.ATAC17_1 lib.size.ATAC18_2 lib.size.ATAC20_4 lib.size.ATAC22_1 lib.size.ATAC22_2 norm.factors ATAC16_1 grp1 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC17_1 grp1 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC18_2 grp1 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC20_4 grp2 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC22_1 grp2 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC22_2 grp2 5620945 2511455 5162441 9085694 9852238 6172175 1$genes
length gccontent
chr1_10002_10467    465  0.520430
chr1_15409_15764    355  0.602817
chr1_17404_17589    185  0.691892
chr1_21206_21446    240  0.566667
chr1_26045_26177    132  0.545455
22010 more rows ...

$offset ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2 chr1_10002_10467 1.2106529 0.40383051 1.1225094 1.44570668 1.6393788 1.0711832 chr1_15409_15764 0.1482142 0.05310531 0.6880168 0.63767323 0.9767822 0.4124740 chr1_17404_17589 0.6204587 0.02583219 0.7587918 0.61683366 0.8382115 0.2851622 chr1_21206_21446 -0.3672278 -0.67452604 0.2488282 0.22727645 0.5559019 -0.1707992 chr1_26045_26177 -0.7680765 -0.76807653 -0.0846480 0.07907179 0.3615096 -0.3383576 22010 more rows ... I could alter the count matrix earlier at the Diffbind stage but as I'd like to try different values for the read count cut off, and run several other optimisation steps using different cutoff values over a number of different datasets, I'd rather not have to run the entire script each time I want to change this. It would be much easier if I could sort this error and alter this in edgeR. Any suggestions would be greatly appreciated. edger R filtering • 660 views ADD COMMENT 3 Entering edit mode Aaron Lun ★ 27k @alun Last seen 15 hours ago The city by the bay lib.size= should be a vector, not a matrix, see ?DGEList. With your current set-up, there is no $samples$lib.size, resulting in a library size vector of length zero. This eventually results in errors with respect to matrix dimensions. Also, you don't mention what version of edgeR you're using, but the reported error message no longer exists in the current release version (3.20.9). Finally, if you're setting offsets, you should use the scaleOffset() function. This ensures that the offsets are interpretable on the same scale as log-library sizes, which is important when adding an appropriate prior count to shrink expression values to a constant value (and thus, the log-fold changes to zero). ADD COMMENT 0 Entering edit mode Hi Aaron, Thanks for the response. Yes, the lib-size was the issue. Thanks for your help. I tried updating my version of edgeR (3.18.1) but it doesn't go beyond this version. My R version is not the latest, so I'm sure it's down to that. Is there any chance you could elaborate a little on your last point? Until now, I have been using peak.list$offset <- cqn.ch1$glm.offset. This was recommended in the CQN vignette and it adds the GLM.offset matrix directly to the DGEList: An object of class "DGEList"$counts
ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2
chr1_10002_10467       58       54      126      153      143       65
chr1_15409_15764        4        4        1        2        6        6
chr1_17404_17589       28       27       34       58       64       38
chr1_21206_21446        1        1        1        1        1        1
chr1_26045_26177        1        1        1        1        1        1
22010 more rows ...

$samples group lib.size norm.factors ATAC16_1 grp1 5620945 1 ATAC17_1 grp1 2511455 1 ATAC18_2 grp1 5162441 1 ATAC20_4 grp2 9085694 1 ATAC22_1 grp2 9852238 1 ATAC22_2 grp2 6172175 1$genes
length gccontent
chr1_10002_10467    465  0.520430
chr1_15409_15764    355  0.602817
chr1_17404_17589    185  0.691892
chr1_21206_21446    240  0.566667
chr1_26045_26177    132  0.545455
22010 more rows ...

$offset ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2 chr1_10002_10467 1.2100869 0.40413134 1.12272869 1.44601091 1.6394837 1.0708546 chr1_15409_15764 0.1480778 0.05321138 0.68757220 0.63722860 0.9763376 0.4125135 chr1_17404_17589 0.6200409 0.02589410 0.75829204 0.61710300 0.8383258 0.2853416 chr1_21206_21446 -0.3676724 -0.67495521 0.24838355 0.22683182 0.5554573 -0.1712439 chr1_26045_26177 -0.7682129 -0.76821293 -0.08509263 0.07862716 0.3610650 -0.3388022 22010 more rows ... Using the scaleOffset() function as you suggest: peak.list <- scaleOffset(peak.list, offset = cqn.ch1$glm.offset)

An object of class "DGEList"
$counts ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2 chr1_10002_10467 58 54 126 153 143 65 chr1_15409_15764 4 4 1 2 6 6 chr1_17404_17589 28 27 34 58 64 38 chr1_21206_21446 1 1 1 1 1 1 chr1_26045_26177 1 1 1 1 1 1 22010 more rows ...$samples
group lib.size norm.factors
ATAC16_1  grp1  5620945            1
ATAC17_1  grp1  2511455            1
ATAC18_2  grp1  5162441            1
ATAC20_4  grp2  9085694            1
ATAC22_1  grp2  9852238            1
ATAC22_2  grp2  6172175            1

$genes length gccontent chr1_10002_10467 465 0.520430 chr1_15409_15764 355 0.602817 chr1_17404_17589 185 0.691892 chr1_21206_21446 240 0.566667 chr1_26045_26177 132 0.545455 22010 more rows ...$offset
ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2
chr1_10002_10467 15.64392 14.83796 15.55656 15.87984 16.07332 15.50469
chr1_15409_15764 15.24497 15.15010 15.78446 15.73412 16.07323 15.50940
chr1_17404_17589 15.67859 15.08444 15.81684 15.67565 15.89687 15.34389
chr1_21206_21446 15.24558 14.93829 15.86163 15.84008 16.16870 15.44200
chr1_26045_26177 15.06794 15.06794 15.75106 15.91478 16.19722 15.49735
22010 more rows ...

... continued below ...

0
Entering edit mode

... the values are different for the offset in both cases due, no doubt, to the log-library size scaling you mentioned. The part I don't understand fully is your last point: 'adding an appropriate prior count to shrink expression values to a constant value (and thus, the log-fold changes to zero)', and by extension, how this would effect the cpm call when removing peaks with low read counts. As it stands I'm currently using:

keep <- rowSums(cpm(peak.list, lib.size = depth.vector) > 2, ) >= 3

... and will replug this into the DGEList using (from page 49 in the edgeR guide):

peak.list <- peak.list[keep, , keep.lib.sizes=FALSE]


However, I notice when checking ?cpm that there are logprior.count and gene.length parameters and I'm unsure whether or not to use these in my cpm call. For example, are these new values in the offset matrix log values, meaning I should set log = TRUE, and if so what should I set the prior.count to?

1
Entering edit mode

Use of scaleOffset() doesn't affect the cpm() call, as the latter does not use the offsets at all. The purpose of scaleOffset() is to ensure that the shrinkage of the log-fold changes is done correctly (see ?predFC). We added this function a few releases ago, so it is perhaps not surprising that other packages have yet to make full use of it.

The documentation is pretty clear about the purpose of the other arguments, so just read it.

P.S. I would update R and then all Bioconductor packages. Don't expose yourself to known bugs if you can avoid it.

P.P.S. It would be remiss of me to not point out the problems with peak-calling in conjunction with using edgeR. In particular, peak calling and filtering does not constitute an independent filtering scheme, biasing the results of differential analyses with edgeR. See https://dx.doi.org/10.1093/nar/gku351 for more details. This is especially true when you have systematically different library sizes between groups, which results in asymmetry in peak calling power between your groups.

0
Entering edit mode

Thanks Aaron.

Both for your answers and advice. Your last point is interesting as we are hoping to make comparisons between groups with different library sizes.