Detect global differences in miRNA expression between tumor and normal using spike-ins
Entering edit mode
R ▴ 40
Last seen 14 months ago

I have sequences 96 samples, 48 from tumor and 48 from normal tissue. I want to check if the miRNAs in the tumor samples are globally down-regulated compared to normal samples. I was thinking of comparing the median expression of all the tumor libraries to the median of all the normal libraries. 

What kind of normalization do I need to do before I compare the two libraries? We do have spike-in RNA as controls as well. I was thinking taking the norm.factors from the spike-ins and lib.sizes from the miRNAs. Should I cpm-normalize the data?

Here is my pipeline:

#! /usr/bin env R


s <- read.delim("exprs.txt", sep="\t", check.names=FALSE)

df.mdp <- read.delim("miRNAs_expressed_all_samples_ALL.csv", sep="\t", check.names=FALSE, stringsAsFactors=FALSE)
counts <- as.matrix(df.mdp[,grep("\\d\\d\\d", colnames(df.mdp))])

sids <- as.character(read.table("config_braf.csv", stringsAsFactors=FALSE)[,3])
colnames(counts) <- sids
colnames(counts.N) <- sids
s <- s[,sids]

group <- sids
group[grep("\\dR", sids)] <- "Normal"
group[grep("\\dG", sids)] <- "Tumor"


group <- factor(group)
patient <- factor(sapply(str_split(sids, 'R|G'), function(x) x[[1]]))

d <- edgeR::DGEList(counts = round(counts), group=group)
d.s <- edgeR::DGEList(counts = s)

d$samples$lib.size <- colSums(d$counts) # recalculate size factors

## add spike-ins
d$counts <- rbind(d$counts, d.s$counts)

## use lib sizes from mirna and norm factors from spike-ins

method <- "TMM" # c("TMM","RLE","upperquartile","none")
method <- "RLE"

y <- calcNormFactors(d, method=method)
y.s <- calcNormFactors(d.s, method=method)
plot(y$samples$lib.size, y.s$samples$lib.size)
y$samples$lib.size <- y.s$samples$lib.size
y2 <- estimateDisp(y, trend.method="loess")

des <- model.matrix(~patient + group)
fit <- glmFit(y2, des)
lrt <- edgeR::glmLRT(fit)
tab <- topTags(lrt, n=Inf)@.Data[[1]]

edger limma mirna sequencing • 1.6k views
Entering edit mode
Last seen 2.1 years ago


You can use any of the methods that people use for mRNA. I use more limma:voom or DESeq2 because in general work better in my case with outliers. 

You can check the dispersion figures and MA plot of these packages, and if they look good, you don't need anything special for normalization. If something is weird you can add as a covariate in the model the log2 expression of your spike-in. As well, I guess you sequences all that samples in different RUNs? If you have data from different RUNs, I would add that as a covariate in the model as well.

What I always do is to remove very low expressed miRNAs. Like need to have > 5 counts in more than 40% of the data or similar threshold.

An example with DESeq2 assuming you have a data.frame (colData) with a column named condition and another named run (in case you have this batch effect):


dds <- DESeqDataSetFromMatrix(countData= s, colData= colData, design=~ run + condition)

dds <- DESeq(dds)

dds <- results(dds)


hope this help

Entering edit mode
Aaron Lun ★ 27k
Last seen 7 hours ago
The city by the bay

First, a bit of background. Most of the scaling methods for RNA-seq normalization, including TMM normalization, assume that most genes are not DE between samples. Any systematic differences in expression are identified as biases and are removed. This is not ideal if global changes are interesting. For example, any global decrease in miRNA expression in tumor relative to control will be treated as a bias and eliminated by normalization.

Now, this is where spike-ins come in. The quantity of spike-in RNA is (theoretically) the same between samples, so any differences in spike-in coverage can only represent bias. Scaling to eliminate this difference can remove biases without making the assumption of a majority of non-DE genes. Note that the practical performance of spike-ins is not without controversy; check out Risso et al. (2014) where they say spike-ins aren't very reliable.

As for your specific problem; you need to be very careful when using non-standard normalization factors in edgeR. These factors are computed in conjunction with the library sizes, given that the product of the two (i.e., the effective library size) is the relevant value for GLM fitting. It is not appropriate to use the normalization factors from the spike-ins with the library sizes from the miRNAs. Rather, you should use both sizes and factors from the spike-in counts:

y <- calcNormFactors(d, method=method) # following on from your code
y.s <- calcNormFactors(d.s, method=method)
y$samples$lib.size <- y.s$samples$lib.size
y$samples$norm.factors <- y.s$samples$norm.factors

Don't worry about the fact that the library sizes don't reflect the sum of counts for your miRNAs. The important thing is the effective library size, which only makes sense when you supply the same library size as that used to compute the normalization factors in the first place. Similarly, don't bother calculating CPMs, because the library sizes here have no absolute meaning with respect to the miRNA counts (i.e., you can't interpret the absolute size of the CPMs). For DE analyses, all that matters are the relative differences between samples.

P.S. I second Lorena's comment about filtering.


Edit: Alternatively, you can set the lib.size for both d and d.s to the sum of miRNA and spike-in library sizes, prior to running calcNormFactors. If the library sizes are the same for both DGEList objects, you can swap normalization factors between them without too much concern. This also gives the library size some meaning (i.e., it now represents the total number of reads sequenced across both miRNAs and spike-ins) so that the CPM values can be interpreted on an absolute scale.


Login before adding your answer.

Traffic: 242 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6