Majority upregulated genes in edgeR
Entering edit mode
Last seen 4.7 years ago

Hi, I am doing a paired analysis between 24 pairs of Normal and Diseased samples, using edgeR. I take the entries with FDR < 0.05. Out of a total of around 57700 genes (including the multiple isoforms), I have obtained around 11500 genes within my selected FDR. However, among these, I find that about 8000 are up-regulated in diseased condition. This shows that my data is biased towards a particular set. Furthermore, when looking at the log(fc), the upregulated genes have a max value of 7, while in case of downregulated genes, there are only a couple with values 4, and the majority around 2 and below.

I would also like to state here, that the genes that I have found differential (both up and down) do indeed seem to be correct, as there are journals in which the top differential genes (in both categories) have been reported.

I have noticed similar results when doing the analysis between various sets. In each case, my upregulated genes / transcripts/miRNAs dominate.

I would like some clarification as to whether I am doing something wrong or what the reason possibly could be for this.

I am providing the script I have run below, as well as few lines of my input file: (I have considered the genes which have a minimum of 4 reads, in either all normal or all diseased samples)

Input file:

gene_name   gene_type   length  R1_R1_71N   R2_R1_71D   R3_R1_79N   R4_R1_79D   R5_R1_85N   R6_R1_85D ...
A1BG    protein_coding  8321    113 106 96  71  97  73
A1BG-AS1    antisense   7432    115 106 95  85  90  79


raw_counts <- read.table(file.choose(),header = TRUE, sep = "\t") 
group <- factor(c("56","56","43","43","62","62","134","134","135","135","58","58","67","67","59","59","48","48","159","159","127","127","76","76" 24 pairs)) 
condition <- c("N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D" 24 pairs)
y <- DGEList(counts = raw_counts[,4:51], genes = raw_counts[,1:3], remove.zeros = TRUE)
y_norm <- calcNormFactors(y)
keep <-rowSums(y$counts)>= 96
y_norm <- y_norm[keep, ,keep.lib.sizes=FALSE]
df <- data.frame(Sample = colnames(y_norm),group,condition)
design <- model.matrix(~group + condition)
rownames(design) = colnames(y_norm)
y_norm <- estimateDisp(y_norm,design,robust=TRUE)
fit <- glmQLFit(y_norm, design)
lrt <- glmQLFTest(fit)
result <- topTags(lrt, n=197684, adjust.method="BH","none", p.value=1)
rpkm <- rpkm(y_norm)
edger differential expression genes fold change • 1.2k views
Entering edit mode

What makes you say that there are more genes up-regulated in the disease state than down-regulated? You don't show any results that would suggest that.

It's certainly not true that edgeR tends to give positive logFCs regardless of what is tested. The signs of the logFCs will in fact just reverse if you change the order of the levels of the condition factor.

Entering edit mode

What I meant was that from the results, out of all the genes with FDR < 0.05, majority genes had a negative value of log(fc), which in this case means upregulated in disease.

Very few had positive values of log(fc). (Downregulated)

I have provided the number of total differential genes, and the no of upregulated and downregulated genes, at the start of my question.

Unfortunately I cannot disclose the actual results here.

Entering edit mode
Aaron Lun ★ 28k
Last seen 7 hours ago
The city by the bay

The imbalance you describe isn't really an issue. 57700 genes (after filtering), 11500 DE, 8000 up-regulated? That's still less than 20% DE overall, with a ~2:1 up- to down-regulated ratio. Seems perfectly feasible to me, especially when dealing with irregular disease biology. Contrast this to ChIP-seq where I would be disappointed with anything less than a 19:1 ratio of up- to down-regulated regions when comparing a ChIP sample to its negative control.

Having said that, I notice a few issues in your code that may contribute to the imbalance. For starters, you ran calcNormFactors() before you filtered. If you have lots of low-abundance genes, this can introduce discreteness that interferes with accurate trimming. In addition, you set keep.lib.sizes=FALSE in the subsetting operation, which means that the library sizes in y are no longer the same library sizes that were used to compute the normalization factors. This is problematic as the factors only make sense in the context of the library sizes used during their calculation; your resulting log-fold changes will probably be shifted. The correct approach would be to filter first (with or without keep.lib.sizes=FALSE) and then run calcNormFactors().

Entering edit mode

Thanks Aaron for your suggestions.


Login before adding your answer.

Traffic: 333 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6