You're right, the very large logFC values in your analysis are not at all normal. They have arisen because you have input raw RNA-seq counts to lmFit
instead of normalized log-intensity values.
It is not correct to input a matrix of RNA-seq read counts directly to lmFit
. You have to first transform the counts to logCPM values using either the voom
function or the cpm
function. See Chapter 15 "RNA-seq Data" of the limma User's Guide.
There so many issues with your code that it would be better and quicker if I gave you a correct sequence code sequence rather than trying to comment on lots of individual things. The correct analysis is actually simpler than your code so far.
Read in data:
samples <- read.delim("25d.txt", stringsAsFactors=FALSE)
countf <- file.path(folder, paste0(samples$SAMPLE_ID, '_ReadsPerGene.counts'))
dge <- readDGE(countf, group=samples$Treatment)
noint <- rownames(dge) %in% c("N_multimapping","N_noFeature","N_ambiguous")
dge <- dge[!noint,]
Filter and convert to logCPM:
keep <- filterByExpr(dge)
dge <- dge[keep,,keep.lib.sizes=FALSE]
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=2)
Random effect correlation:
design <- model.matrix(~Treatment, data=samples)
dupC <- duplicateCorrelation(logCPM, design, block=samples$Mother)
dupC$consensus
Differential expression analysis:
fit <- lmFit(logCPM, design, block=samples$Mother, correlation=dupC$consensus)
fit <- eBayes(fit, trend=TRUE, robust=TRUE)
topTable(fit, coef=2)
What is the experimental context? What is the structure of the data? What limma code did you use? You need to add more detail to your post, read the posting guide.
In my project, we performed a RNAseq of pig embryos, whose mothers were supplemented (ARG) or not with arginine (CONT) on diet. We used the follow Limma code in R: