I have sequences 96 samples, 48 from tumor and 48 from normal tissue. I want to check if the miRNAs in the tumor samples are globally down-regulated compared to normal samples. I was thinking of comparing the median expression of all the tumor libraries to the median of all the normal libraries.
What kind of normalization do I need to do before I compare the two libraries? We do have spike-in RNA as controls as well. I was thinking taking the norm.factors from the spike-ins and lib.sizes from the miRNAs. Should I cpm-normalize the data?
Here is my pipeline:
#! /usr/bin env R
library(edgeR)
library(stringr)
s <- read.delim("exprs.txt", sep="\t", check.names=FALSE)
df.mdp <- read.delim("miRNAs_expressed_all_samples_ALL.csv", sep="\t", check.names=FALSE, stringsAsFactors=FALSE)
counts <- as.matrix(df.mdp[,grep("\\d\\d\\d", colnames(df.mdp))])
sids <- as.character(read.table("config_braf.csv", stringsAsFactors=FALSE)[,3])
colnames(counts) <- sids
colnames(counts.N) <- sids
s <- s[,sids]
group <- sids
group[grep("\\dR", sids)] <- "Normal"
group[grep("\\dG", sids)] <- "Tumor"
group <- factor(group)
patient <- factor(sapply(str_split(sids, 'R|G'), function(x) x[[1]]))
d <- edgeR::DGEList(counts = round(counts), group=group)
d.s <- edgeR::DGEList(counts = s)
d$samples$lib.size <- colSums(d$counts) # recalculate size factors
## add spike-ins
d$counts <- rbind(d$counts, d.s$counts)
## use lib sizes from mirna and norm factors from spike-ins
method <- "TMM" # c("TMM","RLE","upperquartile","none")
method <- "RLE"
y <- calcNormFactors(d, method=method)
y.s <- calcNormFactors(d.s, method=method)
plot(y$samples$lib.size, y.s$samples$lib.size)
y$samples$lib.size <- y.s$samples$lib.size
y2 <- estimateDisp(y, trend.method="loess")
des <- model.matrix(~patient + group)
fit <- glmFit(y2, des)
lrt <- edgeR::glmLRT(fit)
tab <- topTags(lrt, n=Inf)@.Data[[1]]