Question

Can I remove failed samples using DGEList then filterByExpr?

0

Entering edit mode

James • 0

@James-24772

Last seen 2.4 years ago

United States

I am looking to remove all samples that failed to sequence indicated by low lib.size from a DGEList using filterByExpr. I want to remove all samples that has lib.size less then 3000.000 (side question, why the 3 decimal places?)

I have manipulated the User Guide base code in a variety of intuitive ways without success.

y <- DGEList(counts=x,group=group) keep <- filterByExpr(y) y <- y[keep,,keep.lib.sizes=FALSE]

I wish I could post an image of my DEGList but images are not allowed. I'll just have to describe the DEGList. It contains counts double [48279 X 78] and then a subfolder of samples - list [78 X 3], groups - factor, lib.size - double[78], norm.factors - double[78].

I appreciate any suggestions you might have. Thank you.

edgeR • 2.1k views

ADD COMMENT • link 3.2 years ago James • 0

0

Entering edit mode

Thank you so much for the explanation.
I tried both answers and there is a difference between

y1 <- y[y$samples$lib.size > 3e3,] vs y2 <- y[,y$samples$lib.size > 3e3]

The first (left) only edits the samples only in the matrix, while the second (right) edits both counts and samples. I have to assume that the second solution will work best when using other functions with the matrix.

ADD REPLY • link 3.2 years ago James • 0

0

Entering edit mode

In the first situation you are subsetting the samples based on the library size. In the second situation you are subsetting the genes based on the library size, which makes no sense (and no, it doesn't subset both samples and counts, whatever you mean by that; it only subsets the number of genes).

ADD REPLY • link 3.2 years ago James W. MacDonald 65k

0

Entering edit mode

I see your point and I don't claim to understand the mechanics. Maybe you can also help interpret the DGELlist output with y1 & y2? Here it is.

In my program the parent DGEList 
$counts = double[48279 X 78]
$samples = list [78 X 3] (S3: data.frame)


y1 results in another "Formal class DGEList"
$counts = double[19808 X 78]
$samples = list [78 X 3] (S3: data.frame)

y1 appears exactly the same samples as the parent DGEList, just reduced counts.

y2 results in "Large DGEList"
$counts = double[48279 X 32]
$samples = list [32 X 3] (S3: data.frame)

y2 has the number of good samples (32) that I expected to see.
(Yes, this means that 46 samples failed sequencing, which we knew from the fastQC. Luckily the core experiment didn't fail )

ADD REPLY • link 3.2 years ago James • 0

1

Entering edit mode

Ugh. See below where Gordon corrected me. I got it backwards. You can think of a DGEList as being a matrix in some sense (at least when subsetting using [), and the matrix has samples in its columns and genes in its rows. In R's subsetting paradigm, it's [rows,columns].

So yes, your y2 object has removed all the samples with low library size, and it's the subsetting for y1 that doesn't make any sense. There are two things going on there; first, you are subsetting the rows based on column information, which is the part that doesn't make sense. In addition, if you subset using a boolean vector, what happens is that the vector is recycled to be the length of the thing you are subsetting. So you aren't getting what you might think you should be getting anyway. As an example,

> z <- matrix(rnorm(10000), 1000)
> ind <- c(TRUE,FALSE,TRUE, FALSE)
> dim(z[ind,])
[1] 500  10
> dim(z[which(ind),])
[1]  2 10

So subsetting a 1000 row matrix with a length four boolean results in 500 rows, because half of the subsetting vector are TRUE, and the vector is recycled 200 times in order to make it match the number of rows. This is a gotcha that you have to keep in mind when using R.

ADD REPLY • link 3.2 years ago James W. MacDonald 65k

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

No, filterByExpr filters genes, not samples. I think this should be obvious from the help page?

filterByExpr               package:edgeR               R Documentation

Filter Genes By Expression Level

Description:

     Determine which genes have sufficiently large counts to be
     retained in a statistical analysis.

The only way you get library sizes like that is is your incoming data have three decimal points (or you are supplying it yourself). By default it's the colSums of your counts matrix, and that will return as many decimal points as the incoming data. Usually it's actual counts, which are integer, in which case the sums will be integer as well.

Anyway, a DGEList acts just like a matrix upon subsetting, so it's a simple matter of doing something like

y <- y[y$samples$lib.size > 3e3,]

ADD COMMENT • link 3.2 years ago James W. MacDonald 65k

0

Entering edit mode

James, I think you mean y[,y$samples$lib.size > 3e3]. (Rows=genes while columns=samples.)

ADD REPLY • link 3.2 years ago Gordon Smyth 50k

0

Entering edit mode

Yes, exactly. Thanks for pointing out my mistake!

ADD REPLY • link 3.2 years ago James W. MacDonald 65k

score 2 · Accepted Answer · 2021-02-11

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

All edgeR data objects obey row and column subsetting as if they were matrices or data.frames, so you can easily remove samples based on your criterion:

y2 <- y[, y$samples$lib.size > 3000]

ADD COMMENT • link 3.2 years ago Gordon Smyth 50k