**0**wrote:

I am trying to find differential gene expression between diseased and normal individuals. I have multiple factors I need to correct for, including sex, race, and age, but once I correct for them, over 90% of my genes are significantly differentially expressed, even after FDR correction. I am sure I am doing something wrong but I can't figure out what it is. I'm using raw count RNA-seq data that looks like this:

```
P19B648.BM_36_544. S109B355.BM_36_296 S151B648.BM_36_491
ENSG00000198888 6232 1200 4325
ENSG00000198763 4559 1185 3897
ENSG00000198804 157147 25107 61023
ENSG00000198712 24954 3051 13748
ENSG00000228253 508 80 306
```

and a separate file with the patient information that looks like this:

```
disease factored_disease AOD factored_AOD sex factored_sex race factored_race
S151B648.BM_36_437 1 1 74 12 F 1 H 3
S110B355.BM_36_310 1 1 78 16 F 1 W 4
S110B355.BM_36_355 1 1 86 24 F 1 W 4
S151B648.BM_36_491 2 2 77 15 F 1 W 4
```

my code is the following, where expression data refers to the first table I've posted and patient*clinical*match refers to the second table:

```
> DGE.expression <- DGEList(expression_data)
>
> # Convert to count per million (CPM)
> DGE.expression.cpm <- cpm(DGE.expression)
>
> #Take log of CPM
> DGE.expression.lcpm <- cpm(DGE.expression, log = T)
>
> # filter out low count genes by keeping only genes with cpm > 1 in more than 20% of samples
> keep.exprs <- rowSums(DGE.expression.cpm >= 1) >= (0.2 * ncol(DGE.expression))
>
> # apply filter
> DGE.expression <- DGE.expression[keep.exprs,, keep.lib.sizes=FALSE]
> nrow(DGE.expression$counts) # number of total gene tested
[1] 15415
>
> # TMM scale normalization
> DGE.expression <- calcNormFactors(DGE.expression, method = "TMM")
>
> sample_data <- patient_clinical_match[rownames(patient_clinical_match) %in% colnames(expression_data),]
> dim(sample_data)
[1] 19 9
>
> #Grab factored values for limma
> sample_data_necessary <- sample_data%>%dplyr::select(factored_disease, factored_AOD, factored_race, factored_sex)
>
> DGE.expression$samples <- merge(DGE.expression$samples, sample_data_necessary, by="row.names")
>
> design <- model.matrix(~0+as.factor(DGE.expression$samples$factored_disease) + DGE.expression$samples$factored_AOD + DGE.expression$samples$factored_sex + DGE.expression$samples$factored_race) # create the model
>
> colnames(design) <- c("Normal", "Disease", "age", "sex", "race")
>
> contr.matrix <- makeContrasts(
+ disease_normal = Disease - Normal,
+ levels = colnames(design))
> contr.matrix
Contrasts
Levels disease_normal
Normal -1
Disease 1
age 0
sex 0
race 0
>
> v <- voom(DGE.expression, design, plot=TRUE) # voom transform the data
>
> fit <- lmFit(v, design) #fit the model
>
> fit <- eBayes(fit) # compute p-values
>
> x=topTable(fit, number = nrow(fit))
> nrow(fit)
[1] 15415
> sum(x$adj.P.Val>0.05)
[1] 973
```

**25k**• written 8 months ago by ravelarvargas •

**0**