Question: CPM output from edgeR for a downstream application
0
4.4 years ago by
ycding10
United States
ycding10 wrote:

Dear edgeR users,

I am not an experienced R user.  I ran the edgeR for the TCGA RNAseq  data using raw count from the Rsubread featureCounts  and the TCGA miRNAseq data using raw count from TCGA level3 data to identify differential expressed genes, see the codes below;  now I want to examine miRNA and mRNA expresion interaction, Should I use the Count per Million (CPM) from edgeR as input data for the interaction prediction?

thank you very much!!

Yuanchun Ding

---------------------------------------------------------------------------------------------------------------------------------

library(edgeR)
library(limma)
library(splines)

#create treatment factor with old group as reference group and then create batchid factor
Treat <- factor(group$agegroup) Treat <- relevel(Treat, ref="old") batch <- factor(group$batchid)

#filter genes to include genes with at least two counts per million in at least three samples
keep=rowSums(cpm(counts) >2) >= 3
counts_final=counts[keep,]
table(keep)

#create a DGEList and apply TMM normalization
y=DGEList(counts=counts_final, group=Treat)
y=calcNormFactors(y)

#to create design matrix using batch as a block factor,so compare young to old within each batch
design=model.matrix(~batch + Treat)
rownames(design)=colnames(y)

# to estimate dispersion and plot BCV
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

plotBCV(y)

# to detrmine differential expresssed genes; by default, the test is for the last
# coefficient in the desgin matrix, which is the young to old treatment effect.
fit=glmFit(y, design)
lrt=glmLRT(fit)
DEGs_all=topTags(lrt,n=15223)
write.csv(DEGs_all, "DEGs_all_052715.csv")

#subset genes with FDR < 0.05 and abs(logFC)>1
DEGs_FDR05 = subset(as.data.frame(DEGs_all), FDR < 0.05)
DEGs_FDR05_logFC1=subset(DEGs_FDR05, abs(logFC)>1)
genes_FDR05=rownames(DEGs_FDR05)

write.csv(DEGs_FDR05_logFC1, "DEGs_FDR05_FC2.csv")

#summarize total nubmer of differential expressed genes at 5% FDR
dt=decideTestsDGE(lrt)
summary(dt)

#to pick out DEGs
isDE=as.logical(dt)
DEnames=rownames(y)[isDE]

# to plot all the logFCs against average count size, highlighting DEGs
plotSmear(lrt, de.tags=DEnames)
abline(h=c(-1,1), col="blue")

# to export individual counts-per-million for all 15223 genes
top15223=rownames(topTags(lrt, n=15223))
CPM_15223genes=cpm(y)[top15223,]
write.csv(CPM_15223genes_trans, "CPM_15223genes.csv")

--------------------------------------------------------------------------------------------------------------------------------------

edger cpm • 2.2k views
modified 4.4 years ago • written 4.4 years ago by ycding10
Answer: CPM output from edgeR for a downstream application
0
4.4 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

If you're trying to determine if differential expression of miRNAs is related to differential expression of mRNAs, it would make more sense to compare the relevant log-fold changes across your DE comparison of interest. For example, you'd expect that upregulation of a miRNA would have a repressive effect on its target mRNAs.

As to the specifics; that depends on how you intend to predict an interaction between miRNAs and RNAs. If you already have a table of miRNA-mRNA relationships, then you can treat all genes regulated by a given miRNA as a gene set. You could then use roast to examine the change in mRNA expression across the set for your given DE comparison. Alternatively, you can just intersect the DE miRNAs with the DE mRNAs to identify likely miRNA-target relationships.

If you don't have this table, then the problem becomes harder, because you're trying to predict targets from the sign of the log-fold change alone. I can't imagine that this would be particularly informative because you're not using the mRNA/miRNA sequence to obtain potential binding sites. I guess there would be other software to do this for you, but the sensibility of feeding in log-fold changes or CPMs would depend on the specific program.

Answer: CPM output from edgeR for a downstream application
0
4.4 years ago by
ycding10
United States
ycding10 wrote:

Hi Aaron,

I am going to use the integrative analysis method developed by Dr. Iterson etal published in Nuclear Acid Research( integerative analysis of microarray and mRNA expression);  yes, the integrative idea is similar to what you suggested, first use target prediction tools to generate a list of predicted mRNA targets for a miRNA and then use a gene set test to detect consistent effect of the miRNA on the joint expression of multiple targets.  in the paper, microarray data were used; for microarray data, after background correction, log2 transformaiton, quantile normalization and batch effect removal by combat, then the data are ready for testing miRNA and mRNA interaction.  however, for RNAseq data, I am not sure CPMs or logCPM from edgeR are good input data for running the interaction as edgeR manual indicates that for most part, edgeR does not produce any quantity that could be called a "normalized count".   from edgeR, if 200 mRNA DEGs and 100 miRNA DEGs were generated, then I guess I can just examine the interaction between those DEGs;  if CPMs are not suitable, what kind of output file I should generated from edgeR or other programs to perform the integrative analysis?

If you can get hold of the predicted target list for each miRNA (e.g., through TargetScan or something), then you can do the entire analysis within edgeR. You can use the roast approach that I've described, or you can identify putative miRNA-mRNA relationships where both the miRNA and mRNA are DE in their respective analyses (presumably in opposite directions).

If you still want to use the external method, then it seems that taking the log-CPMs would be the best strategy. This allows you to (roughly) treat the log-counts like microarray values, as with the voom method. However, note that voom also calculates precision weights based on the empirical mean-variance relationship. This provides better performance as it accounts for the variance of the log-counts during modelling. If the external method supports weighting in the linear model, I would recommend also feeding the precision weights to it.