edgeR for continuous independent variables
2
0
Entering edit mode
ycding ▴ 10
@ycding-7496
Last seen 12 months ago
United States

Hi edgeR Users,

I have employed edgeR to analyze RNAseq data before. Those data had two groups, one is control and one is treatment. However, now I need to investigate effect of chemical exposure on gene expression, so chemical exposure is a continuous variable, for example, from 30 ng/ml to 1000 ng/ml. I have several questions and hope to get help from you!

1.  for the code, y <- DGEList(counts=x,group=group), I can just use y <- DGEList(counts=x) since no group in my data set, right?

2. there are 320 samples sequenced in four different time since I submitted those samples in four different month. after filtering lowly expressed genes, doing TMM normalization, I will use PCA to check batch effect related to the four different sample submission ; if obvious batch effect observed, I will use combat in sva package to adjust batch effect.  according to my experience, there are some negative values for counted values after batch removal, do I need to add the absolute value of the minimal negative values to the whole data set before running edgeR?

3.  For design matrix, I plan to use:  design <- model.matrix(~ chemical_Concentration + age + race) , both chemical concentration and age are continuous variable, race is a categorical variable.  is this design code correct?

4. then I plan to use robust =TRUE in disp <- estimateDisp(y, design, robust = TRUE) and fit <- glmFit(disp, design, robust = TRUE) as I am analyzing miRNA data, some miRNA would be highly expressed as outliers, so robust =TRUE could minimize the predominant influence from those outlying expression, is my understanding correct?

5. lastly, I would use lrt <- glmLRT(fit, coef = 2) and topTags(lrt, n=Inf, adjust.method = "BH", p.value = 0.05).

in my design, I am interested in the coefficient for chemical concentration (log2 transformed), so coef=2 is correct, right?

6. also I am not very sure about the workflow, so first filtering lowly expressed miRNA, then doing TMM, then adjust batch effect;  or I need to do batch effect before doing filtering and TMM?

Thank you very much in advance!!

Ding

edger • 1.9k views
1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 1 hour ago
The city by the bay
1. Yes.
2. You should be supplying the raw counts to edgeR, and these should always be positive. If there are unknown factors of variation, sva (or RUVseq, or friends) will report these for inclusion into the design matrix. You should not be using any form of corrected values as input into edgeR functions.
3. That seems sensible, assuming no interaction between concentration and the other two variables.
4. Yes, but use glmQLFit with robust=TRUE instead of glmFit. This uses the QL framework, which provides more accurate type I error control. See the workflow here for an example.
5. Use glmQLFTest instead of glmLRT. Your coefficient choice is correct.
6. Filter, TMM normalize, give normalized expression values (from cpm) to sva or RUVseq or whatever, and if it's necessary, cbind the resulting factors to your design matrix for fitting with edgeR.
0
Entering edit mode

I agree that sva or RUVseq could be used to search for hidden batch effects, but ycding is only asking about a known batch factor (month) in Question 2. To handle a known factor like that, I think just including it in the design matrix would be simple and good.

0
Entering edit mode
@gordon-smyth
Last seen 25 minutes ago
WEHI, Melbourne, Australia

Question 2: It is not correct or necessary to use the sva package here. Not correct because sva is not designed for use with counts. Not necessary because you know what the batches are already. If you find that there is a batch effect, just include batch as a factor in the design matrix, then edgeR will make the correction as part of the linear model fit.

You might re-visit what you have done for similar analyses in the past, because your remark about getting negative counts is worrying. You should not be applying any batch correction to the counts and you should never have negative counts.

0
Entering edit mode

Hi Dr. Smyth, thank you very much for your helpful reply.  I got it that batch variables should be included in the design matrix.  the RNAseq data I analyzed before were generated from cell line treatment studies, including 10 to 30 samples (control and treatment with duplicates), small data sets, so sequencing done in one batch, no batch effect issue. however, I analyzed microarray data a few years ago, they were a few to several hundreds tumor samples, then batch effects were adjusted using sva before running limma, I have impression that negative values might be generated in batch removed data, so I guess negative values might be an issue if doing batch correction for RNAseq counts after TMM normalization.

0
Entering edit mode

It is not correct to batch correct expression data before using limma. As always, the batch effects or surrogate variables need to be included in the linear model rather than "removed" ahead of time. See for example: remove microarray batch effects using Limma

The problem isn't with negative values though. Microarray expression values are on the log-scale so negative values are quite normal and not a concern.

0
Entering edit mode

I wanted to generate scatter plots or boxplots to show differential expression between different comparing groups, so I thought I need to remove the batch effect first before generating those plots.

I read the link remove microarray batch effects using Limma,  in the link, no other treatment factors were included in the batch removal process.  in the following batch removal using combat in sva package, I have included the two factors that I am interested, agegroup and tissue, in the batch removal, so avoiding removal of features I am interested. also, for each batch, samples are randomized, so avoiding confounding effect related to the two factors I am interested in.

I will redo the following analysis by including the batch factor in the design matrix.  before doing that, can you do me a great favor to take a look at my code below?  we measured expression for paired tumor and normal samples for young group (< 45 years) and old group (>70 years), respectively; and want to identify genes differentially expressed between young and old.

#3.  to perform batch removal by combat program in sva package
library(sva)

batch = sample98$sample_batch trts = factor(paste(sample98$agegroup, sample98$tissuetype, sep=".")) mod = model.matrix(~trts) combat1 = ComBat(dat=array98normalized, batch=batch, mod=mod, par.prior=TRUE) #export batchremoved data #write.csv (combat1, "array98_quantilenormalized_batchremoved-by-combat.csv") #to filter off genes with little variation using std values and genes with hybridization problem using mean values # to remove genes with std at bottom 25% or mean value at bottom 25%. combat1=as.data.frame(combat1) std=rowSds(combat1) mean=rowMeans(combat1) summary(std) summary(mean) combat1=cbind(combat1, std, mean) combat1_3Q=combat1[which(combat1$std > summary(std)[2] & combat1$mean > summary(mean)[2]), -c(99,100)] #write.csv(combat1_3Q, "normalized_batchremoved_3Q_06172015.csv") combat1_3Q_trans <- t(combat1_3Q) #write.csv(combat1_3Q_trans, "normalized_batchremoved_3Q_trans_06172015.csv") #4. to use duplicationCorrelation Function in limma, treat pair id (surgicalno) as a random effect. treatments=factor(paste(sample98$agegroup, sample98$tissuetype, sep=".")) design=model.matrix(~0 +treatments) colnames(design) = levels(treatments) corfit=duplicateCorrelation(combat1_3Q, design, block=sample98$surgicalno)
corfit$consensus #then this inter-subject correlation is input into the linear model fit fit=lmFit(combat1_3Q, design, block=sample98$surgicalno, correlation=corfit\$consensus)

#5.  to make contrast and find DEGs for all comparisons
contrast=makeContrasts (
oldtumor_to_oldnormal = old.tumor - old.normal,
youngtumor_to_youngnormal = young.tumor - young.normal,
youngnormal_oldnormal=young.normal - old.normal,
youngtumor_oldtumor =young.tumor - old.tumor,
agegroup_difference=(young.tumor - young.normal)- (old.tumor -old.normal),
levels=design)

# then to compute these contrasts and moderated t test
fit2=contrasts.fit(fit, contrast)
fit2=eBayes(fit2)

#5.1 DEGs_youngtumor_to_youngnormal
DEGs_youngtumor_to_youngnormal =topTable(fit2, coef="youngtumor_to_youngnormal", n=14022)

0
Entering edit mode

I don't have time to go through your code in detail, but it isn't correct to filter by standard deviation. You should only filter by mean.