Question: Strange distribution of logFC
0
23 months ago by
Jack0
Jack0 wrote:

Dear all,

I have RNAseq data of two groups. After doing differential gene expression using edgeR and DESeq2(the results are similar). I found the distribution of logFC is a little strange, having two peaks in the distribution.

My guess would be that most genes are equally enriched in both goups, i.e. the mode of the logFC distribution (probably Gaussian-like) is around 0.

Perhaps someone can tell you why this is so.

Code can be seen bolow:

P1<-read.table("P1.STAR-counts-stranded.txt",header = TRUE)
row.names(P1) <- P1[,1]
row.names(P2) <- P2[,1]
row.names(H3) <- H3[,1]
row.names(H4) <- H4[,1]
LIST<-list(P1,P2,H3,H4)
COL<-unique(unlist(lapply(LIST, colnames)))
ROW<-unique(unlist(lapply(LIST, rownames)))

TOTAL<-matrix(data=NA,nrow=length(ROW), ncol = length(COL),dimnames = list(ROW,COL))

for (ls in LIST){
TOTAL[rownames(ls),colnames(ls)]<-as.matrix(ls)
}
write.csv(TOTAL,file="/home/Claire/Desktop/data/STAR_6/TOTAL-PH-STAR-counts-stranded.csv")

#############################edgeR#####################################
setwd("/home/Claire/Desktop/data/STAR_6/")
library(edgeR)
HPcounts<-HPcount[,8:11]
samplenames<-c("P1","P2","H3","H4")
HPgeneid<-HPcount[,2]
rownames(HPcounts)<-HPgeneid
colnames(HPcounts) <- samplenames
HPgenes<-HPcount[,2:7]
group <- as.factor(c("P", "P", "H","H"))
DGEList <- DGEList(counts=HPcounts, group=group,genes=HPgenes)
DGEList$samples keep <- rowSums(cpm(DGEList)>0.5) >= 2 DGEList_keep <- DGEList[keep , keep.lib.sizes=FALSE] DGEList_keep_norm <- calcNormFactors(DGEList_keep) #design matrix design<-model.matrix(~0+group) colnames(design)<-levels(group) PvsH<-makeContrasts(P-H, levels = design) #Estimating the dispersion y<-estimateDisp(DGEList_keep_norm,design) fit<-glmFit(y,design) #to perform likelihood ration tests lrt<-glmLRT(fit,contrast = PvsH) lrt ###############################hist and volcanno plot######### volcanoData <- cbind(lrt$table$logFC, -log10(lrt$table$PValue)) colnames(volcanoData) <- c("logFC", "negLogPval") head(volcanoData) plot(volcanoData, pch=19) hist(lrt$table\$logFC,breaks=100)
modified 23 months ago by Aaron Lun25k • written 23 months ago by Jack0

Where is the code you used to obtain the log-fold changes and to generate the plots?

Thank you for your reminding. The codes have been added in the question.

Are you sure this is right? You construct design twice but you have no command to construct PvsH.

Yes, I contructed PvsH, sorry about the negligence. This time it is right

Well, it can't be right, because you have:

design<-model.matrix(~0+group)
PvsH<-makeContrasts(P-H, levels = design)
colnames(design)<-levels(group)

... which doesn't make sense, because the column names from model.matrix should be groupP and groupH before you rename them. So the call to makeContrasts at this location cannot be correct.

Please try running your own code in a fresh session, i.e., with no saved workspaces. Given all these errors, it is difficult to trust that the code you have shown is actually generating the output that you claim it does.

I wanted to simplify the codes, which made it worse, sorry about that... This is the code I rerun with no saved workspaces, without any changes. It creates the exact the same outputs.

1
23 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

You have a very strange way of constructing your CSV file; I will just assume that HPcounts actually contains the gene-wise read counts, and not some strange coercion of gene names into integers.

If this is the case, then your log-fold change histogram simply indicates that you have lots of DE genes in both directions. The surprising aspect is not the fact that you have two modes - after all, there has to be a mode somewhere for the up- and down-regulated genes - but that there are enough DE genes on either side for the modes to be visible in the distribution of log-fold changes from all genes.

Having lots of DE genes may compromise normalization, because we need to assume that most genes are not DE in order to correct for composition biases. More than 30% of genes being DE in both directions will start to cause issues with calcNormFactors. However, this shouldn't affect the significance of the top DE genes, where the log-fold changes should be so large that any inaccuracy should be negligible.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.