Question

Filter genes that have >= 1 cpm and are expressed in 80% of the case in the group (case=Yes, control=No)

0

Entering edit mode

fr8712ca-s • 0

@38b1deaf

Last seen 1 day ago

Sweden

Hi, I am a master student in biomedicine at Lund University in Sweden. I am using edgeR for differential expression of genes. I am a beginner in the use of edgeR.


setwd("C:/Users/Analysis")
library("edgeR")
library(tidyr)

data <- read.csv("C:/Users/Analysis/Data/Case_Control_Phenotype_Features.csv")

group <- as.factor(data$T2D) #group include individuals with T2D (YES) and non-diabetics (NO)
age <-as.numeric(data$Age_exact)
sex <-as.factor(data$Sex)
BMI <-as.numeric(gsub(",",".",data$BMI))
cd8t <- as.numeric(data$CD8T)
cd4t <- as.numeric(data$CD4T)
nk <- as.numeric(data$NK)
bcell <- as.numeric(data$Bcell)
mono <- as.numeric(data$Mono)
gran <- as.numeric(data$Gran)

# gene counts is a csv file obtained from FeatureCounts: first column contains the ENSids (from row 2), column two has in row 1 the sample ID and the rest are the counts for each ENSid. 
x <- read.csv("C:/Users/Analysis_CC/Data/case_control_gene_counts.csv")  # number of case 91 and number of control 66
head(x)
length(x)  #154
length(group) #153
y <- DGEList(counts=x,group=group) # Setting first column of `counts` as gene annotation, I got this message when I run the script. Doing that I have the same lenght 153, for both x and y.
head(y)

# count per millions: like or bigger than 1 and 73 is the 80% of the case 
keep  <- rowSums(cpm(y)>=1) >= 73  
table(keep)

#Filtration
y <- y[keep, , keep.lib.sizes=FALSE]
head(y)

#Recalculate library sizes
y$samples$lib.size <- colSums(y$counts)
y$samples

# Normalization 
y <- calcNormFactors(y)
y$samples


# Adjusting the cpm after the recalculation of the library size and the normalization
c<-cpm(y)
write.csv(c,file="C:/Users/Analysis/Results/.../testedgeRtmmoutputN_par_1_73.csv")

#To get logcpm
logcpm <- cpm(y, prior.count=2, log=TRUE)
write.csv(logcpm,file="C:/Users/Analysis/Results/.../logcpm_par.csv")

#Estimate the dispersion to model the biological and technical variability in your data
design <- model.matrix(~group+sex+age+BMI+cd8t+cd4t+nk+bcell+mono+gran)
y <- estimateDisp(y, design)

# Perform exact testing for differential expression (how do I differenziate case from control)
et <- exactTest(y)

# Differential Expression Analysis: 
write.table(topTags(et, 65000L), file = "C:/Users/Analysis/Results/outputfile.csv")
results <- read.table("C:/Users/Analysis/Results/outputfile.csv")

results$FDR<-p.adjust(results$PValue,"fdr")

write.table(results$FDR, "C:/Users/Analysis/Results/outputfile_FDR.csv")

I think that what I wrote in the line: keep <- rowSums(cpm(y)>=1) >= 73
is not what I want to do. I want to filtarte the genes that has at least 1 cpm. I also what to keep 80% of the case (in the group T2D the case have the value YES and are 91 samples ) to use in the test exactTest(y). Control are 66 samples.

My question is: how the case is identified in the group? (with a number like case 1 and control 0?) Are the case are assigned? How can I do in my code, to filter only 80% of the case to compare with the control when I perform the exactTest?

I hope it is clear what I am asking. Best regards Francesca

RBioinf • 342 views

ADD COMMENT • link updated 9 weeks ago by ATpoint ★ 4.0k • written 9 weeks ago by fr8712ca-s • 0

score 1 · Answer 1 · 2024-02-22

Especially as a beginner you should be reading and following the user guide which contains current best practices for prefiltering and DE analysis. The exact test is outdated for several years, glmQLFTest is the current standard to use, see https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf