Hi, I'm puzzled by getting different deferentially expressed small RNA gene list using a normalized count matrix and a subset of it. For example, I have a small RNA count matrix consisting 410 tRNA species and 753 miRNA species from ten samples (5 treatment, 5 control). If I use the entire data matrix, for one miRNA, I got:
logFC logCPM F PValue FDR
rno-miR-741-3p -6.612987 3.204682 56.10205 7.507579e-14 4.521434e-11
However, if I use a subset (only the miRNA) of this data matrix, I got totally different results:
logFC logCPM F PValue FDR
rno-miR-741-3p -4.224581 4.180834 10.22008 0.004310712 0.1298386
Here is the normalized count:
> counts.normalized["rno-miR-741-3p", ]
365201 C120215 360157 368285 367657 49588 49589 49590 48374
rno-miR-741-3p 2 0 443 1 20 0 0 0 14
48372 47706 48371 49185 49186 49189 71766 71768 71769 72663
rno-miR-741-3p 1 0 27 7 0 6 0 0 4 2
73304
rno-miR-741-3p 0
Since the same count data is used, the only difference is the size of the count data matrix, why do I get very different results? Can anybody help to explain this? Thanks
Jason
that makes sense! But what if you were trying to tweak the number of columns (samples in the above example)? Working off of Jason's example, what if one were to subset the number of samples to just keep the "treatment" to see what DGE genes were up/down regulated in that particular subset, would you advise against that too?
Changing the samples will probably have as much effect (if not more) than changing the number of genes, given that you're now directly changing the residual d.f. in the model fit. Unless I had good reasons (e.g., if some of the samples were clearly of low quality), I would fit a model to all samples together, and then set up contrasts to test specific groups.
Thanks Aaron! That explains why. Now here is a related question: piRNAs make a large fraction of total small RNAs. Should they be included in the analysis? Otherwise miRNAs and tRNAs are still a subset of the total small RNAs.
There are relevant two questions here, mostly unrelated to the EB shrinkage between genes. The first question is whether the biases for piRNAs are the same as those for tRNAs/miRNAs. The second is whether most piRNAs/tRNAs/miRNAs are not differentially expressed. If both of these conditions hold, then you can compute normalization factors from the entire matrix via
calcNormFactors
. If the first assumption does not hold, you will have to compute normalization factors for each class of genes, and construct an offset matrix (not a trivial process - basically, the offset matrix is of the same dimensions as the count matrix, with the values of each row equal to the product of the library sizes and the normalization factors for the corresponding gene). If the second assumption does not hold for any class of features, then you will not be able to use TMM normalization for those features. Instead, you will need to use spike-ins, though these have their own problems.Aaron, you mentioned the edgeR takes raw counts. What if I used spike-ins in my experiment and I want to normalize by spike-ins or house-keeping genes?
Aaron, you mentioned to use offsets or normalization factors in your reply. So can you give me an example on how to do it using pre-computed normalization factors (such as from spike-in or house-keeping gene counts) in order to get DE genes? Thanks!
If you want to normalize by spike-in transcripts, you will need to compute spike-in offsets. Do so by subsetting the count matrix to contain only the rows corresponding to spike-in transcripts, make a
DGEList
from the subsetted matrix, and rungetOffset
to obtain an offset vector. These offsets should be the log-transformed total count across spike-in transcripts in each library, and can be assigned to theDGEList
of the endogenous genes withscaleOffset
.However, be aware that the use of spike-ins can be tricky for bulk RNA-seq, as their behaviour can vary depending on whether the same amount of spike-in is added per tube/per million cells/per microgram of RNA (with single-cell applications, we just add the same amount per cell). Check out http://dx.doi.org/10.1038/nbt.2931 for an example of funny spike-in behaviour.
You mean like these:
y <- DGEList(counts=spike.in.count.matrix);
offsets <- getOffset(y);
yy <- DEGList(counts=my.smallRNA.counts, group=my.group);
yyy <- scaleOffset(yy, offsets);
design <- model.matrix(~group);
fit <- glmQLFit(yyy,design)
qlf <- glmQLFTest(fit,coef=2)
Yes, that's right.
Well, using the code example above, I got very different result as compared with the upperquartile or TMM normalization method. What could be wrong? Alternatively, how to use the calcNormFactors method? Looks to me that scaleOffset(y, offset) does not modify y, it returns a vector of numbers. Then how to really make the scaling happen? What function to use?
Yes, spike-in normalization will frequently return different results from TMM normalization, see the paper in the link.
scaleOffset
will return aDGEList
with a value in$offset
, make surey
is aDGEList
.I did use y as a DGEList. However, the results using spike-in is not only different, but do not make sense to me. Can you send me a work flow as how to exactly use spike-in for getting the DE genes?
This code works for me, producing a
DGEList
with offsets.If it's not working for you, you'll have to provide a minimum example that replicates your problem.
As for why the spike-in results do not make sense; this is not uncommon, read the paper in the link.
Thanks. I also have a miRNA that we want to use as a house-keeping gene. I see there is a option to assign "method='none'" to the "calcNormFactors" function. Does this make sense to you if I normalize my raw counts by this house-keeping gene outside of the edgeR and supply this normalized count matrix to calcNormFactors with method = 'none'?
Normalization based on a single house-keeping gene seems crazy to me. The biological and technical variability in the counts for this gene will be present in the normalization factors, which will interfere with the inferences for all other genes (offsets are treated as known, so their uncertainty is not modelled). I suspect you would need at least 50-100 house-keeping features to avoid this problem - this poses its own problems, as you need to pick the genes very carefully.