I am analyzing mRNA-Seq dataset using
EdgeR package and testing filtering by
rowSums that would keep genes. I have question about interpreting Histogram of average log2 CPM in EdgeR?
I tested filtering in 4 different ways, and would like to know how to interpret the plot? Basically,
filterByExpr looks good, however, I am interested in creating model.matrix on other variables too like treatment, severity, etc., for comparisons. How to do I decide the cut-off in perhaps
rowsums? What does the negative values in the x-axis signifies? Should the graph look like bell shaped distribution?
Thank you in advance.
Best Regards, Toufiq
dge <- DGEList(counts = Counts, remove.zeros = TRUE) dge$samples # Either; ## filterByExpr keep <- filterByExpr(dge, design). ## ## Pairing and blocking is essential for comparison as different cells are extracted from same subjects table(keep.keep) ## (OR) ## Filtering to remove low counts keep <- rowSums(dge$counts) >= 10 ## (OR) ## Filtering to remove low counts <- rowSums(dge$counts) >= 50 dge <- dge[keep, , keep.lib.sizes=FALSE] dge$counts dim(dge$counts) AveLogCPM <- aveLogCPM(dge) hist(AveLogCPM)
Hi Gordon Smyth , thank you for the details and suggestions.
rowSums(Counts)is easy to understand and execute. I like performing with
filterByExpr(y, design). The only doubt I have here is about the input design matrix.
glmQLFit. Pairing and blocking: I used as it was essential for comparison as different cells are extracted from same subjects.
To filter by expression should I use the below design_2?
Yes, it would be better to use
design_2for filtering even though the full matrix
design_1is used for the DE analysis. The reason is that
Subjectis just a blocking variable, the aim is not to compare the different Subjects to each other.
Gordon Smyth Noted.
Another question, In the same experiment I have perhaps an interesting group comparison
Treatmentand considered it as independent, where 3 subjects without treatment (act as baseline), and 3 patients with treated.
I create another object
DGEListto filter by expression and fitting the model, I could use the below I assume:
You can't arbitrarily change design matrices for the same experiment. You can't include
Cell_Typefor one analysis and ignore it for another. The design matrix must always include all the important factors and groups.
Anyay, I think I have already answered your original question about AveLogCPM histograms.
Gordon Smyth Sure, thank you.