Question

Foldchange values Limma

1

Entering edit mode

susana.amaral.teixeira ▴ 20

@susanaamaralteixeira-19915

Last seen 6.4 years ago

Hi, everyone. I have performed my in Limma package because there are nested effect interaction (random effect) in statistical model. Looking to my results the log foldchange seems to be wrong to me. Some genes have too high or too low LogFC values such as -14997.66667 with p-adjust 0.02518022, for example. We don't have experience with this package, we would like to know if this is normal.

Thanks.

limma RNAseq foldchange • 1.8k views

ADD COMMENT • link 6.4 years ago • updated 6.3 years ago susana.amaral.teixeira ▴ 20

1

Entering edit mode

What is the experimental context? What is the structure of the data? What limma code did you use? You need to add more detail to your post, read the posting guide.

ADD REPLY • link 6.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

In my project, we performed a RNAseq of pig embryos, whose mothers were supplemented (ARG) or not with arginine (CONT) on diet. We used the follow Limma code in R:

samples <- read.table("25d.txt",header=T, as.is=T)
#samples: 
#SAMPLE_ID  Mother  Tissue  Sex Treatment
#P8175-25DA-E1  P-8175  embryos M   ARG
#P8191-25DA-E5  P-8191  embryos M   ARG
#P8192-25DC-E1  P-8192  embryos M   CONT
#P8192-25DC-E5  P-8192  embryos  M  CONT
#P8206-25DC-E1  P-8206  embryos  M  CONT
#P8285-25DC-E2  P-8285  embryos M   CONT
#P8285-25DC-E5  P-8285  embryos M   CONT
#P8175-25DA-E2  P-8175  embryos F   ARG
#P8175-25DA-E5  P-8175  embryos F   ARG
#P8191-25DA-E1  P-8191  embryos F   ARG
#P8191-25DA-E2  P-8191  embryos F   ARG
#P8192-25DC-E2  P-8192  embryos F   CONT
#P8193-25DA-E1  P-8193  embryos F   ARG
#P8193-25DA-E2  P-8193  embryos F   ARG
#P8193-25DA-E5  P-8193  embryos F   ARG
#P8206-25DC-E2  P-8206  embryos  F  CONT
#P8206-25DC-E5  P-8206  embryos F   CONT
#P8285-25DC-E1  P-8285  embryos F   CONT

countf <- file.path(folder, paste0(samples$SAMPLE_ID, '_ReadsPerGene.counts'))

counts = readDGE(countf)$counts #Edge function to load the counts data

# 2. Filter weakly expressed and noninformative (e.g., non-aligned) features:
noint = rownames(counts) %in% c("N_multimapping","N_noFeature","N_ambiguous")
LogCPM <- cpm(counts, log=TRUE, prior.count=3)

write.csv(LogCPM, "log_cpm_25d.csv", quote=FALSE, row.names = TRUE)
mean(colSums(counts[!noint,])/colSums(counts))

# In edgeR, it is recommended to remove features without at least 1 read per million in n of the samples, where n is the size of the smallest group of replicates, 

keep = rowSums(LogCPM >1) >=3 #& !noint
counts = counts[keep,]

sample_id = apply(samples[,c("SAMPLE_ID", "Treatment")], 
 1, function(x) paste(na.exclude(x), collapse = ".")) 
condition = apply(samples[,c("Tissue", "Treatment")], 
 1, function(x) paste(na.exclude(x), collapse = ".")) 

dge.list <- DGEList(counts=counts)

dge.norm <- calcNormFactors(dge.list)

samples <- data.frame(SAMPLE_ID = samples$SAMPLE_ID, Mother = as.factor(samples$Mother), Tissue = as.factor(samples$Tissue), Sex = as.factor(samples$Sex), Treatment = as.factor(samples$Treatment))

design3 <- model.matrix(~0+ Treatment + Treatment:Mother , samples)
colnames(design3) <- c("TreatmentARG", "TreatmentCONT" ,  
 "TreatmentARG.P.8191", "TreatmentCONT.P.8191", "TreatmentARG.P.8192", "TreatmentCONT.P.8192",
 "TreatmentARG.P.8193", "TreatmentCONT.P.8193", "TreatmentARG.P.8206", "TreatmentCONT.P.8206", 
 "TreatmentARG.P.8285", "TreatmentCONT.P.8285")

keep.filtered <- filterByExpr(dge.list, design3)
dge <- dge.norm[keep.filtered,keep.lib.sizes=FALSE]

dge <- as.matrix(dge)

dupC <- duplicateCorrelation(object = dge, design = design3, block = samples$Mother)

fit <- lmFit(dge, design3, correlation = dupC$consensus)

cm <- makeContrasts(Treatment=TreatmentARG-TreatmentCONT,levels = design3)

fit2 <- contrasts.fit(fit, cm)
fit2.bayes <- eBayes(fit2)

tab3 <- topTable(fit2.bayes, coef = 'Treatment')

                           logFC        AveExpr     t       P.Value     adj.P.Val   B
 #ENSSSCG00000018065    -14997.66667    16273.0556  -8.162678   1.84E-06    0.02518022  -4.336885   
 #ENSSSCG00000007849    137 206.4444    6.520178    1.98E-05    0.11455024  -4.361206
 #ENSSSCG00000010817    48.66667    100.5       6.36721     2.52E-05    0.11455024  -4.364157
 #ENSSSCG00000034356    64.33333    20      6.071297    4.03E-05    0.13766863  -4.37029
 #ENSSSCG00000002620    68.66667    152.1111    5.67904     7.67E-05    0.19861386  -4.379368
 #ENSSSCG00000007914    110 237.0556    5.60247     8.72E-05    0.19861386  -4.381278
 #ENSSSCG00000018080    -2619.33333 1400.6667   -5.438972   1.15E-04    0.22444833  -4.385518
 #ENSSSCG00000039715    48.33333    112.3889    5.002999    2.45E-04    0.41790153  -4.397986
 #ENSSSCG00000003660    437.33333   1527        4.799328    3.51E-04    0.50011588  -4.40444
 #ENSSSCG00000028646    353.33333   822.1111    4.776304    3.66E-04    0.50011588  -4.405197

ADD REPLY • link updated 6.4 years ago by Gordon Smyth 52k • written 6.4 years ago by susana.amaral.teixeira ▴ 20

score 3 · Answer 1 · 2019-02-25

You're right, the very large logFC values in your analysis are not at all normal. They have arisen because you have input raw RNA-seq counts to lmFit instead of normalized log-intensity values.

It is not correct to input a matrix of RNA-seq read counts directly to lmFit. You have to first transform the counts to logCPM values using either the voom function or the cpm function. See Chapter 15 "RNA-seq Data" of the limma User's Guide.

There so many issues with your code that it would be better and quicker if I gave you a correct sequence code sequence rather than trying to comment on lots of individual things. The correct analysis is actually simpler than your code so far.

Read in data:

samples <- read.delim("25d.txt", stringsAsFactors=FALSE)
countf <- file.path(folder, paste0(samples$SAMPLE_ID, '_ReadsPerGene.counts'))
dge <-   readDGE(countf, group=samples$Treatment)
noint <- rownames(dge) %in% c("N_multimapping","N_noFeature","N_ambiguous")
dge <- dge[!noint,]

Filter and convert to logCPM:

keep <- filterByExpr(dge)
dge <- dge[keep,,keep.lib.sizes=FALSE]
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=2)

Random effect correlation:

design <- model.matrix(~Treatment, data=samples)
dupC <- duplicateCorrelation(logCPM, design, block=samples$Mother)
dupC$consensus

Differential expression analysis:

fit <- lmFit(logCPM, design, block=samples$Mother, correlation=dupC$consensus)
fit <- eBayes(fit, trend=TRUE, robust=TRUE)
topTable(fit, coef=2)

score 0 · Answer 2 · 2019-03-15

0

Entering edit mode

susana.amaral.teixeira ▴ 20

@susanaamaralteixeira-19915

Last seen 6.4 years ago

Thanks a lot for your attention and tips. I did your changes that you suggested and folds changes were corrects. Your support is very importante in our analises, main when used the programas for complex models. I'm very grateful!

ADD COMMENT • link 6.3 years ago susana.amaral.teixeira ▴ 20