Search
Question: Strange distribution of logFC
0
gravatar for Claire
5 days ago by
Claire0
Claire0 wrote:

Dear all,

I have RNAseq data of two groups. After doing differential gene expression using edgeR and DESeq2(the results are similar). I found the distribution of logFC is a little strange, having two peaks in the distribution.

My guess would be that most genes are equally enriched in both goups, i.e. the mode of the logFC distribution (probably Gaussian-like) is around 0.

Perhaps someone can tell you why this is so.

Code can be seen bolow:

P1<-read.table("P1.STAR-counts-stranded.txt",header = TRUE)
row.names(P1) <- P1[,1]
P2<-read.table("P2.STAR-counts-stranded.txt",header = TRUE)
row.names(P2) <- P2[,1]
H3<-read.table("H3.STAR-counts-stranded.txt",header = TRUE)
row.names(H3) <- H3[,1]
H4<-read.table("H4.STAR-counts-stranded.txt",header = TRUE)
row.names(H4) <- H4[,1]
LIST<-list(P1,P2,H3,H4)
COL<-unique(unlist(lapply(LIST, colnames)))
ROW<-unique(unlist(lapply(LIST, rownames)))

TOTAL<-matrix(data=NA,nrow=length(ROW), ncol = length(COL),dimnames = list(ROW,COL))

for (ls in LIST){
  TOTAL[rownames(ls),colnames(ls)]<-as.matrix(ls)
}
write.csv(TOTAL,file="/home/Claire/Desktop/data/STAR_6/TOTAL-PH-STAR-counts-stranded.csv")

#############################edgeR#####################################
setwd("/home/Claire/Desktop/data/STAR_6/")
library(edgeR)
HPcount<-read.csv("TOTAL-PH-STAR-counts-stranded.csv",header = TRUE,sep = ",")
HPcounts<-HPcount[,8:11]
samplenames<-c("P1","P2","H3","H4")
HPgeneid<-HPcount[,2]
rownames(HPcounts)<-HPgeneid
colnames(HPcounts) <- samplenames
HPgenes<-HPcount[,2:7]
group <- as.factor(c("P", "P", "H","H"))
DGEList <- DGEList(counts=HPcounts, group=group,genes=HPgenes)
DGEList$samples
keep <- rowSums(cpm(DGEList)>0.5) >= 2
DGEList_keep <- DGEList[keep , keep.lib.sizes=FALSE]
DGEList_keep_norm <- calcNormFactors(DGEList_keep)
#design matrix
design<-model.matrix(~0+group)
colnames(design)<-levels(group)
PvsH<-makeContrasts(P-H, levels = design)

#Estimating the dispersion
y<-estimateDisp(DGEList_keep_norm,design)
fit<-glmFit(y,design)
#to perform likelihood ration tests
lrt<-glmLRT(fit,contrast = PvsH)
lrt
###############################hist and volcanno plot#########
volcanoData <- cbind(lrt$table$logFC, -log10(lrt$table$PValue))
colnames(volcanoData) <- c("logFC", "negLogPval")
head(volcanoData)

plot(volcanoData, pch=19)

hist(lrt$table$logFC,breaks=100)
ADD COMMENTlink modified 4 days ago by Aaron Lun17k • written 5 days ago by Claire0

Where is the code you used to obtain the log-fold changes and to generate the plots?

ADD REPLYlink written 5 days ago by Aaron Lun17k

Thank you for your reminding. The codes have been added in the question.

ADD REPLYlink modified 5 days ago • written 5 days ago by Claire0

Are you sure this is right? You construct design twice but you have no command to construct PvsH.

ADD REPLYlink written 5 days ago by Aaron Lun17k

Yes, I contructed PvsH, sorry about the negligence. This time it is right

ADD REPLYlink modified 5 days ago • written 5 days ago by Claire0

Well, it can't be right, because you have:

design<-model.matrix(~0+group)
PvsH<-makeContrasts(P-H, levels = design)
colnames(design)<-levels(group)

... which doesn't make sense, because the column names from model.matrix should be groupP and groupH before you rename them. So the call to makeContrasts at this location cannot be correct.

Please try running your own code in a fresh session, i.e., with no saved workspaces. Given all these errors, it is difficult to trust that the code you have shown is actually generating the output that you claim it does.

ADD REPLYlink modified 5 days ago • written 5 days ago by Aaron Lun17k

I wanted to simplify the codes, which made it worse, sorry about that... This is the code I rerun with no saved workspaces, without any changes. It creates the exact the same outputs.

ADD REPLYlink modified 5 days ago • written 5 days ago by Claire0
1
gravatar for Aaron Lun
4 days ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

You have a very strange way of constructing your CSV file; I will just assume that HPcounts actually contains the gene-wise read counts, and not some strange coercion of gene names into integers.

If this is the case, then your log-fold change histogram simply indicates that you have lots of DE genes in both directions. The surprising aspect is not the fact that you have two modes - after all, there has to be a mode somewhere for the up- and down-regulated genes - but that there are enough DE genes on either side for the modes to be visible in the distribution of log-fold changes from all genes.

Having lots of DE genes may compromise normalization, because we need to assume that most genes are not DE in order to correct for composition biases. More than 30% of genes being DE in both directions will start to cause issues with calcNormFactors. However, this shouldn't affect the significance of the top DE genes, where the log-fold changes should be so large that any inaccuracy should be negligible. 

ADD COMMENTlink modified 4 days ago • written 4 days ago by Aaron Lun17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 142 users visited in the last hour