Question

My linear regression model is giving me too many differentially expressed genes

0

Entering edit mode

ravelarvargas • 0

@ravelarvargas-17116

Last seen 6.6 years ago

I am trying to find differential gene expression between diseased and normal individuals. I have multiple factors I need to correct for, including sex, race, and age, but once I correct for them, over 90% of my genes are significantly differentially expressed, even after FDR correction. I am sure I am doing something wrong but I can't figure out what it is. I'm using raw count RNA-seq data that looks like this:

                      P19B648.BM_36_544.  S109B355.BM_36_296  S151B648.BM_36_491
ENSG00000198888              6232               1200               4325
ENSG00000198763              4559               1185               3897
ENSG00000198804            157147              25107              61023
ENSG00000198712             24954               3051              13748
ENSG00000228253               508                 80                306

and a separate file with the patient information that looks like this:

                     disease factored_disease AOD factored_AOD sex factored_sex race factored_race
S151B648.BM_36_437         1                1  74           12   F            1    H             3
S110B355.BM_36_310         1                1  78           16   F            1    W             4
S110B355.BM_36_355         1                1  86           24   F            1    W             4
S151B648.BM_36_491         2                2  77           15   F            1    W             4

my code is the following, where expression data refers to the first table I've posted and patientclinicalmatch refers to the second table:

> DGE.expression <- DGEList(expression_data)
> 
> # Convert to count per million (CPM)
> DGE.expression.cpm <- cpm(DGE.expression)
> 
> #Take log of CPM
> DGE.expression.lcpm <- cpm(DGE.expression, log = T)
> 
> # filter out low count genes by keeping only genes with cpm > 1 in more than 20% of samples
> keep.exprs <- rowSums(DGE.expression.cpm >= 1) >= (0.2 * ncol(DGE.expression))
> 
> # apply filter
> DGE.expression <- DGE.expression[keep.exprs,, keep.lib.sizes=FALSE]
> nrow(DGE.expression$counts)     # number of total gene tested
[1] 15415
> 
> # TMM scale normalization
> DGE.expression <- calcNormFactors(DGE.expression, method = "TMM")
> 
> sample_data <- patient_clinical_match[rownames(patient_clinical_match) %in% colnames(expression_data),]
> dim(sample_data)
[1] 19  9
> 
> #Grab factored values for limma
> sample_data_necessary <- sample_data%>%dplyr::select(factored_disease, factored_AOD, factored_race, factored_sex)
> 
> DGE.expression$samples <- merge(DGE.expression$samples, sample_data_necessary, by="row.names")
> 
> design <- model.matrix(~0+as.factor(DGE.expression$samples$factored_disease) + DGE.expression$samples$factored_AOD + DGE.expression$samples$factored_sex + DGE.expression$samples$factored_race) # create the model
> 
> colnames(design) <- c("Normal", "Disease", "age",  "sex", "race")
> 
> contr.matrix <- makeContrasts(
+   disease_normal = Disease - Normal, 
+   levels = colnames(design))
> contr.matrix
         Contrasts
Levels    disease_normal
  Normal              -1
  Disease              1
  age                  0
  sex                  0
  race                 0
> 
> v <- voom(DGE.expression, design, plot=TRUE) # voom transform the data
> 
> fit <- lmFit(v, design) #fit the model
> 
> fit <- eBayes(fit) # compute p-values
> 
> x=topTable(fit, number = nrow(fit))
> nrow(fit)
[1] 15415
> sum(x$adj.P.Val>0.05)
[1] 973

limma rna-seq differential gene expression edger linear regression • 1.8k views

ADD COMMENT • link updated 7.0 years ago by Aaron Lun ★ 29k • written 7.0 years ago by ravelarvargas • 0

score 1 · Answer 1 · 2019-02-01

1

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 7 minutes ago

The city by the bay

Well, you've got to use your contrast matrix somewhere. You're missing:

# after lmFit()
fit2 <- contrasts.fit(fit, contrast=contr.matrix)
# ... then run eBayes() on fit2.

Right now you're performing a DE analysis with coef=NULL in topTable(), which basically tests the null hypothesis that all non-intercept coefficients are equal to zero. In your case, this is equivalent to saying that the log-expression is zero in all samples under the null! Clearly this is a silly null hypothesis, and it is little wonder that it is rejected in almost all of your genes.

ADD COMMENT • link 7.0 years ago Aaron Lun ★ 29k

0

Entering edit mode

thanks for the suggestion! I've tried the following:

fit2 <- contrasts.fit(fit, contrast=contr.matrix)
fit2 <- eBayes(fit2) # compute p-values
topTable(fit2, number = nrow(fit2))

Unfortunately this leads my data to having no significant differentially expressed genes, which I don't think is correct either. I think some aspect of my model must still be wrong somewhere

ADD REPLY • link 7.0 years ago ravelarvargas • 0

0

Entering edit mode

Many people say that, mostly out of wishful thinking. But sometimes there just aren't any DE genes. Or at least, their effect sizes are not large enough to detect with your sample size - 19 patients is pretty low for human studies. If you think there should be DE genes, you should be able to back that up with some specific examples of genes that are known to be DE in the system being studied (and are not DE in your experiment). Usually, further examination of these genes can reveal further issues, e.g., batch effects or failed/misassigned samples.

I am also bemused about your use of factored_race. It seems you are treating it as a continuous covariate (based on the fact that there is only one column in the design matrix for race but there are at least four levels, based on the patient information file) - this would be wrong. Conversely, your use of factored_AOD seems to involve converting a continuous covariate into ranks, which I doubt is what you intended. I could say the same thing about disease and sex but these are less problematic as there are only two levels here anyway.

The only other suggestion would be to try robust=TRUE in the eBayes() call. This may help with improving power if you have some genes that are very highly variable.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 29k

0

Entering edit mode

I've just given a subset of the data, it's actually 149 patients, I partitioned just 20 patients because I wanted a small example to post on bioconductor. In terms of why I think there should be DE genes it's because this RNA-seq data is a followup to a previous publication of microarray data for the same disease where they did find differentially-expressed genes, so I don't understand why there wouldn't be any here. Even when I take away all covariates and just partition my data into diseased and not diseased I am still not getting differentially expressed genes, and in fact my p-value gets closer to 1 as I remove covariates.

ADD REPLY • link 7.0 years ago ravelarvargas • 0

0

Entering edit mode

I don't understand why there wouldn't be any here.

Welcome to the reproducibility crisis. More seriously, take the top ~20 DE genes from their publication and examine their behaviour in your data. Are the log-fold changes completely different (i.e., close to zero) in your data? If so, this is not a problem of fine-tuning the model; your underlying data simply does not support DE. Visual plotting of these genes may also indicate other problems, e.g., sample misassignments, batch effects (if you construct a MDS plot using only those genes).

my p-value gets closer to 1 as I remove covariates

Not surprising, as you'll get inflated variance estimates if you don't account for other factors of variation. There may also be hidden factors that could be discovered with SVA or RUV to reduce the variance estimates and improve power. This is possible if you have enough samples, and 149 is better than 19.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 29k