Advice regarding RNAseq data with no replicates, best approach to accomplish fold-change
2
0
Entering edit mode
@stefany-solano-12399
Last seen 4.1 years ago

Hello, I'm very sorry for posting again a question regarding RNA-seq with no replicates. I've read most of the posts and still don't know what's the best way to proceed. My experimental design was: two industrial fermenters, one as control, the other as the treatment and we took samples every 24 hours for a total of 4 samples. Due to a fail on the fermenter I have replicates only for the first time points. So My first approach was to use the last two time points as one (replicates). I've used edge R, and make contrast for each pairwise comparison I want to test. My BCV is really high (2.06) and I've used glmQLFit. So I got the FC for each of the contrasts. But my main problem is that the replicates are very very different from each other so my supervisor suggest to redo the analysis by individual fermenter run (which means without replicates). Does this seems sensible? If so, I've been checking edgeR on how to proceed but as soon as I get to calculate the dispersion it fails, I get this error: "Warning message:

In estimateGLMCommonDisp.default(y = y$counts, design = design, : No residual df: setting dispersion to NA". My code: count_cols <- c("Gene","GLU24H","GLU48H","GLU72H","GLU.FA24H","GLU.FA48H","GLU.FA72H") x<-read.delim('~/Desktop/PhDProject/3rd_Year_RESULTS/Counts/counts_run1.csv',skip=0, sep=",", check.names=FALSE, colClasses='character', na.strings=c()) x[,count_cols] <- apply(x[,count_cols], 2, function(v) as.numeric(v)) counts <- x[, count_cols] keepMin <- apply(counts, 1, max) >= 1 keepCpm <- rowSums(cpm(counts)> 1) >= 1 keep <- keepMin & keepCpm x <- x[keep,] counts <- counts[keep,] counts group<-factor(c("GLU24H","GLU48H","GLU72H","GLU.FA24H","GLU.FA48H","GLU.FA72H")) y <- DGEList(counts=counts, group = group) y <- calcNormFactors(y, method="TMM") y$samples
genes<-as.numeric(x$GENE) typeof(genes) PGRA_rpkm<-rpkm(counts, log=FALSE, gene.length = genes) PGRA_rpkm2<-cbind(x$Gene,PGRA_rpkm)
plotMD(cpm(PGRA_rpkm, log=TRUE), column=6) > abline(h=0, col="red", lty=2, lwd=2)

design <- model.matrix(~0+group)
y <- estimateGLMCommonDisp(y,design)

Any help will be really useful as I'm currently stuck at this.

rnaseq edger DGE no replicates • 1.0k views
4
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 1 hour ago
The city by the bay

You don't need replicates for every time point to estimate the dispersion. If you have replicates at one time point, edgeR will use those to estimate the dispersion, i.e., you only need non-zero residual degrees of freedom in the entire model. For each gene, we model the counts using a negative binomial distribution that has the same dispersion for every sample, so it doesn't matter where the replication occurs.

But anyway, I'll move onto the data set you've presented. As it's currently set up, you don't have any residual d.f. because you have 6 coefficients to estimate from 6 samples. This means that there's no information left to estimate the dispersion after fitting of the GLM, resulting in the error from estimateGLMCommonDisp. There are two solutions; the first is to follow the advice in Section 2.11 of the user's guide, and just fit the GLM with a "sensible" dispersion value for the purpose of computing log-fold changes for your desired contrasts:

fit <- glmFit(d, design, dispersion=0.05)
res <- glmLRT(fit, contrast=makeContrasts(groupGLU.FA24H-groupGLU24H, levels=design))


You could also sacrifice some residual d.f. in your model to obtain that sensible estimate:

time <- factor(rep(c("24", "48", "72"), 2))
fermenter <- factor(rep(c("GLU", "GLU.FA"), each=3))
design0 <- model.matrix(~time+fermenter)
# ... estimate dispersion with design0.


I wouldn't put much faith in the p-values, though. The dispersion estimates will be somewhat inflated as design0 assumes that the two fermenters are mostly similar in their effect on expression over time. More importantly, testing with dispersion estimates obtained with a different model will only work with glmLRT, not glmQLFTest; and you won't be able to model gene-specific dispersion values, which will reduce the accuracy of your results. This is the cost of not having replicates - see point 3 in Section 2.11.

The second strategy is to assume linearity in the responses, which allows you to pull out some residual d.f. from your design:

time.as.numeric <- rep(c(24, 48, 72), 2)
designL <- model.matrix(~0+fermenter+time.as.numeric:fermenter)
# ... run through edgeR with designL


You can then test for differences in the gradient of the time response, which tells you if there's a difference in the time effect. Testing for differences in the intercept also tell you if there's already a difference at zero hours. If you had more samples, you could weaken the assumption of linearity by fitting a fermenter-specific spline instead, which will model a smooth curve with respect to time.

0
Entering edit mode
@stefany-solano-12399
Last seen 4.1 years ago

Hello Aaron, thank you very much!!! Ok following your nice explanation I'll proceed with the second strategy. Although I have some questions (I'm new to R and edgeR so I've learned basically using this package and looking up on internet how to do stuff, this said my understanding of how to code/interpret some things is not very good). Hence, my next questions are,

1)If I proceed with edgeR using designL, do you recommend then to use glmLRT in order to get the FC values for each of the contrast I'm looking for? (which basically are a pairwise comparison between fermenters and time).

2)I do have more samples but come from another a fermentation ran 6 months after. Biologically speaking I spect some difference... my question is, statistically speaking do you think I can use those samples on the same model to get a better fit or should I work with them separately? this has been one of my starting questions but I ended up in circles. Also I use them all together (fermentation 1 and fermentation 2) to calculate the rpkms in edgeR, does this seems sensible to you?

Stefany

0
Entering edit mode

If you want to compare specific time points with designL, you can do something like:

fermenterGLU_time.as.numeric*X + fermenterGLU -
(fermenterGLU.FA_time.as.numeric*X + fermenterGLU.FA)

... in makeContrasts (note that I've replaced the colon with an underscore). This basically asks whether the value of the line at time X is the same between fermenters. Personally, though, I don't find this very interesting; it would be more meaningful to test for differences in the gradients to determine if the effect of time is different between fermenters. For example, it is possible for the gradients to go in the opposite direction w.r.t. time, but if the lines cross at X, you would never detect DE at X.

Regarding the extra data, if you want to compare it to the two fermenter runs in your current data set, you'll have to put it in the same model. There will be clear batch effects because the two data sets were generated at different times, but if you're willing to assume that the batch effect is additive, then you can still safely compare the gradients between fermenters. However, you won't be able to directly compare expression values between fermenters, because this is confounded with the batch.

0
Entering edit mode

Ok, just to be sure I'm on the same page as you, you mean to compare between time and condition isn't it? that's what you mean by gradients? to determine wether, let's say, at 24H there's a difference between fermenter Glu and fermenter Glu.FA. Is that what you mean? That' what I'm looking for as well, clearly I didn't express myself right I'm sorry about that. Because I've already done that using the two fermentation runs and for all the comparisons I did I got no DEG, none!! this could be due, as you just said, the nature of the batch effect on the samples. That's why we decided to go back and use the data by run of fermentation and repeat the analysis.

So, getting back to my questions, I'm not sure what you mean by the gradients I'm sorry Aaron, could you explain to me again please? I do like your suggestion. Getting this data was really expensive and due to the failure our experimental design has suffered a lot...plus my lack of experience had make difficult the analysis.

0
Entering edit mode

No, we're not on the same page. I'm referring to the rate of change in expression over time (i.e., the fermenter-specific gradient of the linear regression). You can easily test whether the gradient is the same between fermenters, just do:

fermenterGLU_time.as.numeric - fermenterGLU.FA_time.as.numeric

... in makeContrasts. However, if you want to test for differences in expression between fermenters at specific time points, you'll have to use the code in my first comment above. Of course, if you have batch effects between fermenters, there's not much point in doing such a comparison, as differences between fermenters would be confounded with the batch.

I'm not sure what you've actually done to try to find DEGs, but failing to find any is not particularly strange; it happens all the time if you don't have enough power in your design, or there are low-quality samples, or there's just no DE in your system.

Finally, if you're having trouble, I would suggest consulting a local statistician or bioinformatician. The support site is good for dealing with specific problems, but more general questions require someone who understands the biology of the system, and that's hard to do over the Internet. Presumably you didn't sequence a whole bunch of stuff without having someone to analyze it.

0
Entering edit mode

Thank you very much Aaron, I'll try your suggestions now. The quality of the sequences are good and I've double check the pipeline I've used to mapped and to count the transcripts... I think the problem is the experimental design... anyway. Thank you very much! I've never asked anything before on this site and was kind of scared to do so... so very grateful with your help!!!

0
Entering edit mode

Hi Aaron, I have another question. So regarding the model you suggested, thank you it does lower the BCV and the average log expression looks much much better. The thing is I'm a bit confused about how to write the contrast (I know you gave me the code but I definitely most be doing something wrong as I get an error. My code below:

> GLU.FA24vrsGLU24<-makeContrasts(fermenterGLU.FA_time.as.numeric*X + fermenterGLU.FA-(fermenterGLU_time.as.numeric*24 + fermenterGLU ), levels = designL)
Error in makeContrasts(fermenterGLU.FA_time.as.numeric * X + fermenterGLU.FA -  :
The levels must by syntactically valid names in R, see help(make.names).  Non-valid names: fermenterGLU:time.as.numeric,fermenterGLU.FA:time.as.numeric

I've also tried:

> GLU.FA-GLU_wholeset<-makeContrasts(fermenterGLU_time.as.numeric - fermenterGLU.FA_time.as.numeric, levels=designL)
Error in makeContrasts(fermenterGLU_time.as.numeric - fermenterGLU.FA_time.as.numeric,  :
The levels must by syntactically valid names in R, see help(make.names).  Non-valid names: fermenterGLU:time.as.numeric,fermenterGLU.FA:time.as.numeric

I'm trying to get a comparison between each fermenter and the corresponding time point and between fermenters, can you spot what's wrong on the code? what am I missing or doing wrong?

Stefany

0
Entering edit mode

You need to replace the colons in the column names of design with something else, as colons are special characters when writing formulae. I've assumed that colons are replaced by underscores; check out gsub.