Question

edgeR DE gene calculation problem

0

Entering edit mode

jw24 • 0

@jw24-13599

Last seen 6.8 years ago

Hi, I'm puzzled by getting different deferentially expressed small RNA gene list using a normalized count matrix and a subset of it. For example, I have a small RNA count matrix consisting 410 tRNA species and 753 miRNA species from ten samples (5 treatment, 5 control). If I use the entire data matrix, for one miRNA, I got:

logFC logCPM F PValue FDR
rno-miR-741-3p -6.612987 3.204682 56.10205 7.507579e-14 4.521434e-11

However, if I use a subset (only the miRNA) of this data matrix, I got totally different results:

logFC logCPM F PValue FDR

rno-miR-741-3p -4.224581 4.180834 10.22008 0.004310712 0.1298386

Here is the normalized count:

> counts.normalized["rno-miR-741-3p", ]

365201 C120215 360157 368285 367657 49588 49589 49590 48374
rno-miR-741-3p 2 0 443 1 20 0 0 0 14
48372 47706 48371 49185 49186 49189 71766 71768 71769 72663
rno-miR-741-3p 1 0 27 7 0 6 0 0 4 2
73304
rno-miR-741-3p 0

Since the same count data is used, the only difference is the size of the count data matrix, why do I get very different results? Can anybody help to explain this? Thanks

Jason

edgeR • 1.7k views

ADD COMMENT • link updated 6.8 years ago by Aaron Lun ★ 28k • written 6.8 years ago by jw24 • 0

score 0 · Answer 1 · 2017-07-26

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 22 minutes ago

The city by the bay

edgeR works by sharing information between genes for dispersion estimation. This improves performance by stabilizing the estimates in the presence of limited replication. Obviously, if you subset the matrix before suppling it to edgeR, it won't be able to do that properly. This is probably why you end up with larger p-values. The FDR adjustment is also dependent on the p-values of other genes, so subsetting will also affect that.

Incidentally, edgeR expects raw counts, rather than "normalized counts". If you want to incorporate normalization information into the DE analysis, use offsets (via scaleOffsets) or normalization factors (via assignment to y$samples$norm.factors) instead.

ADD COMMENT • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

that makes sense! But what if you were trying to tweak the number of columns (samples in the above example)? Working off of Jason's example, what if one were to subset the number of samples to just keep the "treatment" to see what DGE genes were up/down regulated in that particular subset, would you advise against that too?

ADD REPLY • link 6.8 years ago drikaul • 0

0

Entering edit mode

Changing the samples will probably have as much effect (if not more) than changing the number of genes, given that you're now directly changing the residual d.f. in the model fit. Unless I had good reasons (e.g., if some of the samples were clearly of low quality), I would fit a model to all samples together, and then set up contrasts to test specific groups.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron! That explains why. Now here is a related question: piRNAs make a large fraction of total small RNAs. Should they be included in the analysis? Otherwise miRNAs and tRNAs are still a subset of the total small RNAs.

ADD REPLY • link 6.8 years ago jw24 • 0

0

Entering edit mode

There are relevant two questions here, mostly unrelated to the EB shrinkage between genes. The first question is whether the biases for piRNAs are the same as those for tRNAs/miRNAs. The second is whether most piRNAs/tRNAs/miRNAs are not differentially expressed. If both of these conditions hold, then you can compute normalization factors from the entire matrix via calcNormFactors. If the first assumption does not hold, you will have to compute normalization factors for each class of genes, and construct an offset matrix (not a trivial process - basically, the offset matrix is of the same dimensions as the count matrix, with the values of each row equal to the product of the library sizes and the normalization factors for the corresponding gene). If the second assumption does not hold for any class of features, then you will not be able to use TMM normalization for those features. Instead, you will need to use spike-ins, though these have their own problems.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Aaron, you mentioned the edgeR takes raw counts. What if I used spike-ins in my experiment and I want to normalize by spike-ins or house-keeping genes?

ADD REPLY • link 6.8 years ago jw24 • 0

0

Entering edit mode

Aaron, you mentioned to use offsets or normalization factors in your reply. So can you give me an example on how to do it using pre-computed normalization factors (such as from spike-in or house-keeping gene counts) in order to get DE genes? Thanks!

ADD REPLY • link 6.8 years ago jw24 • 0

0

Entering edit mode

If you want to normalize by spike-in transcripts, you will need to compute spike-in offsets. Do so by subsetting the count matrix to contain only the rows corresponding to spike-in transcripts, make a DGEList from the subsetted matrix, and run getOffset to obtain an offset vector. These offsets should be the log-transformed total count across spike-in transcripts in each library, and can be assigned to the DGEList of the endogenous genes with scaleOffset.

However, be aware that the use of spike-ins can be tricky for bulk RNA-seq, as their behaviour can vary depending on whether the same amount of spike-in is added per tube/per million cells/per microgram of RNA (with single-cell applications, we just add the same amount per cell). Check out http://dx.doi.org/10.1038/nbt.2931 for an example of funny spike-in behaviour.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

You mean like these:

y <- DGEList(counts=spike.in.count.matrix);

offsets <- getOffset(y);

yy <- DEGList(counts=my.smallRNA.counts, group=my.group);
yyy <- scaleOffset(yy, offsets);
design <- model.matrix(~group);
fit <- glmQLFit(yyy,design)
qlf <- glmQLFTest(fit,coef=2)

ADD REPLY • link 6.8 years ago jw24 • 0

0

Entering edit mode

Yes, that's right.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Well, using the code example above, I got very different result as compared with the upperquartile or TMM normalization method. What could be wrong? Alternatively, how to use the calcNormFactors method? Looks to me that scaleOffset(y, offset) does not modify y, it returns a vector of numbers. Then how to really make the scaling happen? What function to use?

ADD REPLY • link 6.8 years ago jw24 • 0

0

Entering edit mode

Yes, spike-in normalization will frequently return different results from TMM normalization, see the paper in the link. scaleOffset will return a DGEList with a value in $offset, make sure y is a DGEList.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

I did use y as a DGEList. However, the results using spike-in is not only different, but do not make sense to me. Can you send me a work flow as how to exactly use spike-in for getting the DE genes?

ADD REPLY • link 6.7 years ago jw24 • 0

0

Entering edit mode

This code works for me, producing a DGEList with offsets.

y <- matrix(rnbinom(40,size=1,mu=100),10,4)
offset <- rnorm(4)
d <- DGEList(y)
d <- scaleOffset(d, offset)

If it's not working for you, you'll have to provide a minimum example that replicates your problem.

As for why the spike-in results do not make sense; this is not uncommon, read the paper in the link.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks. I also have a miRNA that we want to use as a house-keeping gene. I see there is a option to assign "method='none'" to the "calcNormFactors" function. Does this make sense to you if I normalize my raw counts by this house-keeping gene outside of the edgeR and supply this normalized count matrix to calcNormFactors with method = 'none'?

ADD REPLY • link 6.7 years ago jw24 • 0

0

Entering edit mode

Normalization based on a single house-keeping gene seems crazy to me. The biological and technical variability in the counts for this gene will be present in the normalization factors, which will interfere with the inferences for all other genes (offsets are treated as known, so their uncertainty is not modelled). I suspect you would need at least 50-100 house-keeping features to avoid this problem - this poses its own problems, as you need to pick the genes very carefully.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k