I am looking to remove all samples that failed to sequence indicated by low lib.size from a DGEList using filterByExpr. I want to remove all samples that has lib.size less then 3000.000 (side question, why the 3 decimal places?)
I have manipulated the User Guide base code in a variety of intuitive ways without success.
y <- DGEList(counts=x,group=group) keep <- filterByExpr(y) y <- y[keep,,keep.lib.sizes=FALSE]
I wish I could post an image of my DEGList but images are not allowed. I'll just have to describe the DEGList. It contains counts double [48279 X 78] and then a subfolder of samples - list [78 X 3], groups - factor, lib.size - double[78], norm.factors - double[78].
I appreciate any suggestions you might have. Thank you.
Thank you so much for the explanation.
I tried both answers and there is a difference between
y1 <- y[y$samples$lib.size > 3e3,] vs y2 <- y[,y$samples$lib.size > 3e3]
The first (left) only edits the samples only in the matrix, while the second (right) edits both counts and samples. I have to assume that the second solution will work best when using other functions with the matrix.
In the first situation you are subsetting the samples based on the library size. In the second situation you are subsetting the genes based on the library size, which makes no sense (and no, it doesn't subset both samples and counts, whatever you mean by that; it only subsets the number of genes).
I see your point and I don't claim to understand the mechanics. Maybe you can also help interpret the DGELlist output with y1 & y2? Here it is.
y1 appears exactly the same samples as the parent DGEList, just reduced counts.
y2 has the number of good samples (32) that I expected to see.
(Yes, this means that 46 samples failed sequencing, which we knew from the fastQC. Luckily the core experiment didn't fail )
Ugh. See below where Gordon corrected me. I got it backwards. You can think of a
DGEList
as being a matrix in some sense (at least when subsetting using[
), and the matrix has samples in its columns and genes in its rows. In R's subsetting paradigm, it's [rows,columns].So yes, your y2 object has removed all the samples with low library size, and it's the subsetting for y1 that doesn't make any sense. There are two things going on there; first, you are subsetting the rows based on column information, which is the part that doesn't make sense. In addition, if you subset using a boolean vector, what happens is that the vector is recycled to be the length of the thing you are subsetting. So you aren't getting what you might think you should be getting anyway. As an example,
So subsetting a 1000 row matrix with a length four boolean results in 500 rows, because half of the subsetting vector are TRUE, and the vector is recycled 200 times in order to make it match the number of rows. This is a gotcha that you have to keep in mind when using R.