I am trying to find differential gene expression between diseased and normal individuals. I have multiple factors I need to correct for, including sex, race, and age, but once I correct for them, over 90% of my genes are significantly differentially expressed, even after FDR correction. I am sure I am doing something wrong but I can't figure out what it is. I'm using raw count RNA-seq data that looks like this:
P19B648.BM_36_544. S109B355.BM_36_296 S151B648.BM_36_491
ENSG00000198888 6232 1200 4325
ENSG00000198763 4559 1185 3897
ENSG00000198804 157147 25107 61023
ENSG00000198712 24954 3051 13748
ENSG00000228253 508 80 306
and a separate file with the patient information that looks like this:
disease factored_disease AOD factored_AOD sex factored_sex race factored_race
S151B648.BM_36_437 1 1 74 12 F 1 H 3
S110B355.BM_36_310 1 1 78 16 F 1 W 4
S110B355.BM_36_355 1 1 86 24 F 1 W 4
S151B648.BM_36_491 2 2 77 15 F 1 W 4
my code is the following, where expression data refers to the first table I've posted and patientclinicalmatch refers to the second table:
> DGE.expression <- DGEList(expression_data)
>
> # Convert to count per million (CPM)
> DGE.expression.cpm <- cpm(DGE.expression)
>
> #Take log of CPM
> DGE.expression.lcpm <- cpm(DGE.expression, log = T)
>
> # filter out low count genes by keeping only genes with cpm > 1 in more than 20% of samples
> keep.exprs <- rowSums(DGE.expression.cpm >= 1) >= (0.2 * ncol(DGE.expression))
>
> # apply filter
> DGE.expression <- DGE.expression[keep.exprs,, keep.lib.sizes=FALSE]
> nrow(DGE.expression$counts) # number of total gene tested
[1] 15415
>
> # TMM scale normalization
> DGE.expression <- calcNormFactors(DGE.expression, method = "TMM")
>
> sample_data <- patient_clinical_match[rownames(patient_clinical_match) %in% colnames(expression_data),]
> dim(sample_data)
[1] 19 9
>
> #Grab factored values for limma
> sample_data_necessary <- sample_data%>%dplyr::select(factored_disease, factored_AOD, factored_race, factored_sex)
>
> DGE.expression$samples <- merge(DGE.expression$samples, sample_data_necessary, by="row.names")
>
> design <- model.matrix(~0+as.factor(DGE.expression$samples$factored_disease) + DGE.expression$samples$factored_AOD + DGE.expression$samples$factored_sex + DGE.expression$samples$factored_race) # create the model
>
> colnames(design) <- c("Normal", "Disease", "age", "sex", "race")
>
> contr.matrix <- makeContrasts(
+ disease_normal = Disease - Normal,
+ levels = colnames(design))
> contr.matrix
Contrasts
Levels disease_normal
Normal -1
Disease 1
age 0
sex 0
race 0
>
> v <- voom(DGE.expression, design, plot=TRUE) # voom transform the data
>
> fit <- lmFit(v, design) #fit the model
>
> fit <- eBayes(fit) # compute p-values
>
> x=topTable(fit, number = nrow(fit))
> nrow(fit)
[1] 15415
> sum(x$adj.P.Val>0.05)
[1] 973
thanks for the suggestion! I've tried the following:
Unfortunately this leads my data to having no significant differentially expressed genes, which I don't think is correct either. I think some aspect of my model must still be wrong somewhere
Many people say that, mostly out of wishful thinking. But sometimes there just aren't any DE genes. Or at least, their effect sizes are not large enough to detect with your sample size - 19 patients is pretty low for human studies. If you think there should be DE genes, you should be able to back that up with some specific examples of genes that are known to be DE in the system being studied (and are not DE in your experiment). Usually, further examination of these genes can reveal further issues, e.g., batch effects or failed/misassigned samples.
I am also bemused about your use of
factored_race
. It seems you are treating it as a continuous covariate (based on the fact that there is only one column in the design matrix forrace
but there are at least four levels, based on the patient information file) - this would be wrong. Conversely, your use offactored_AOD
seems to involve converting a continuous covariate into ranks, which I doubt is what you intended. I could say the same thing about disease and sex but these are less problematic as there are only two levels here anyway.The only other suggestion would be to try
robust=TRUE
in theeBayes()
call. This may help with improving power if you have some genes that are very highly variable.I've just given a subset of the data, it's actually 149 patients, I partitioned just 20 patients because I wanted a small example to post on bioconductor. In terms of why I think there should be DE genes it's because this RNA-seq data is a followup to a previous publication of microarray data for the same disease where they did find differentially-expressed genes, so I don't understand why there wouldn't be any here. Even when I take away all covariates and just partition my data into diseased and not diseased I am still not getting differentially expressed genes, and in fact my p-value gets closer to 1 as I remove covariates.
Welcome to the reproducibility crisis. More seriously, take the top ~20 DE genes from their publication and examine their behaviour in your data. Are the log-fold changes completely different (i.e., close to zero) in your data? If so, this is not a problem of fine-tuning the model; your underlying data simply does not support DE. Visual plotting of these genes may also indicate other problems, e.g., sample misassignments, batch effects (if you construct a MDS plot using only those genes).
Not surprising, as you'll get inflated variance estimates if you don't account for other factors of variation. There may also be hidden factors that could be discovered with SVA or RUV to reduce the variance estimates and improve power. This is possible if you have enough samples, and 149 is better than 19.