Question

How to interpret Histogram of average log2 CPM filtering in EdgeR?

0

Entering edit mode

mohammedtoufiq91 ▴ 10

@mohammedtoufiq91-17679

Last seen 4 months ago

United States

Hi,

I am analyzing mRNA-Seq dataset using EdgeR package and testing filtering by filterByExpr and rowSums that would keep genes. I have question about interpreting Histogram of average log2 CPM in EdgeR?

I tested filtering in 4 different ways, and would like to know how to interpret the plot? Basically, filterByExpr looks good, however, I am interested in creating model.matrix on other variables too like treatment, severity, etc., for comparisons. How to do I decide the cut-off in perhaps rowsums? What does the negative values in the x-axis signifies? Should the graph look like bell shaped distribution?

Thank you in advance.

Best Regards, Toufiq

dge <- DGEList(counts = Counts,  remove.zeros = TRUE)
dge$samples

# Either;
## filterByExpr
keep <- filterByExpr(dge, design). ## ## Pairing and blocking is essential for comparison as different cells are extracted from same subjects
table(keep.keep)

## (OR)
## Filtering to remove low counts
keep <- rowSums(dge$counts) >= 10

## (OR)
## Filtering to remove low counts
<- rowSums(dge$counts) >= 50

dge <- dge[keep, , keep.lib.sizes=FALSE]
dge$counts
dim(dge$counts)

AveLogCPM <- aveLogCPM(dge)
hist(AveLogCPM)

Thank you,

Toufiq

CPM filterByExpr edgeR histogram rowsums • 3.5k views

ADD COMMENT • link 3.1 years ago mohammedtoufiq91 ▴ 10

score 3 · Accepted Answer · 2022-12-07

3

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 15 hours ago

WEHI, Melbourne, Australia

In the edgeR workflow vignette we discuss three possible filtering methods:

filterByExpr(y, design)
rowSums(Counts) > 50
aveLogCPM(y)

The use of filterByExpr with default arguments is the recommended method, and I don't see a reason not to do that. The exact covariates in the design matrix makes little difference to the filtering. You should just include the main covariates and treatment factors. If you have paired design, then include the treatments and covariates in the design for filtering but leave out the subjects.

If you filter by total gene counts, then a value like 50 needs to be used. Using >= 10 is too small.

The third method is to examine the histogram and filter by AveLogCPM. For your data, you would note that the histogram is bimodal and a value like -1 or -2 would separate the modes. Therefore you could filter by

keep <- (AveLogCPM > -2)

in order to remove the mode of small values on the left.

edgeR makes no assumptions about the distribution of the AveLogCPM values, so the shape of the histogram is irrelevant. The only requirement is that there shouldn't be a left tail of very small values.

Negative AveLogCPM values simply mean that the average CPM is less than 1. There's nothing unusual or worrying about that.

ADD COMMENT • link 3.1 years ago Gordon Smyth 53k

0

Entering edit mode

Hi Gordon Smyth , thank you for the details and suggestions. rowSums(Counts) is easy to understand and execute. I like performing with filterByExpr(y, design). The only doubt I have here is about the input design matrix.

I used design_1 to estimateDisp and glmQLFit. Pairing and blocking: I used as it was essential for comparison as different cells are extracted from same subjects.

design_1 <- model.matrix(~Subject+Cell_Type)
keep <- filterByExpr(dge, design_1)

To filter by expression should I use the below design_2?

design_2 <- model.matrix(~ Cell_Type)
keep <- filterByExpr(dge, design_2)

ADD REPLY • link 3.1 years ago mohammedtoufiq91 ▴ 10

2

Entering edit mode

Yes, it would be better to use design_2 for filtering even though the full matrix design_1 is used for the DE analysis. The reason is that Subject is just a blocking variable, the aim is not to compare the different Subjects to each other.

ADD REPLY • link 3.1 years ago Gordon Smyth 53k

0

Entering edit mode

Gordon Smyth Noted.

Another question, In the same experiment I have perhaps an interesting group comparison Treatment and considered it as independent, where 3 subjects without treatment (act as baseline), and 3 patients with treated.

I create another object DGEList to filter by expression and fitting the model, I could use the below I assume:

dge.new <- DGEList(counts = Counts, group = Treatment, remove.zeros = TRUE)
design_3 <- model.matrix(~ Treatment)
keep <- filterByExpr(dge, design_3)

ADD REPLY • link 3.1 years ago mohammedtoufiq91 ▴ 10

0

Entering edit mode

You can't arbitrarily change design matrices for the same experiment. You can't include Cell_Type for one analysis and ignore it for another. The design matrix must always include all the important factors and groups.

Anyay, I think I have already answered your original question about AveLogCPM histograms.