Question: Help analyzing RNA-seq data from FPKM values with multiple time points
0
6 months ago by
rsrzkk0
rsrzkk0 wrote:

Hello,

I was given a file that contains only FPKM values from an RNA-seq experience for genes that have been measured at different time points. I read Best DEG tool for datasets with FPKM counts? and Differential expression of RNA-seq data using limma and voom() posts about performing a differential gene expression analysis when only having FPKM values. However, as not being an expert in using the limma package, I would like to know how can I analyze such data knowing that we have a different time point please.

Here is a couple of lines from my file:

tracking_id    Total_FPKM    1hour        3hour        6hour        ctrl        gene_id        gene            Log2(FPKM)
XLOC_000003    0.0912937    0.0542082    0.070862    0.0912937    0.0348845    XLOC_000003    LOC100132062    -3.4533409
XLOC_000005    0.4593880    0.3969370    0.413311    0.4582030    0.4593880    XLOC_000005    -                -1.1222149
XLOC_000009    3.3084000    3.2661400    3.308400    3.2586100    3.1813200    XLOC_000009    LOC643837        1.7261337


So, the first step is to create the design. (The guide on 9.6.1 explains how to perform analysis on different time points and so I tried to adapt it):

filename <- read.table("filename.csv", header = T, sep = "\t", stringsAsFactors = F , check.names = F, encoding = "UTF-8")
colnames_log2fpkm <- c("Log2(1hour)", "Log2(3hour)", "Log2(6hour)", "Log2(ctrl)")
log2fpkm <- c("Log2(1hour)"=log2(filename[,3]), "Log2(3hour)"=log2(filename[,4]), "Log2(6hour)"=log2(filename[,5]), "Log2(ctrl)"=log2(filename[,6]))
filename <- cbind(filename, setNames( lapply(colnames_log2fpkm, function(x) x=log2fpkm), colnames_log2fpkm) )
filename <- filename[!is.infinite(rowSums(filename[,c(9,10,11,12)])),]

targetframe <- data.frame("Column"=c("filename[,12]", "filename[,10]", "filename[,11]", "filename[,12]"), "Targets"=c("ctrl", "1hour", "3hour", "6hour"))
lev <- c("ctrl", "1hour", "3hour", "6hour")
f <- factor(targetframe$Targets, levels=lev) design <- model.matrix(~0+f) colnames(design) <- lev  > design ctrl 1hour 3hour 6hour 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 4 0 0 0 1 attr(,"assign") [1] 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$f
[1] "contr.treatment"


At this point, I am not sure how to proceed... so any help would be appreciated.

sessioninfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_3.38.3

loaded via a namespace (and not attached):
[1] BiocManager_1.30.4 compiler_3.5.1     htmltools_0.3.6    tools_3.5.1        yaml_2.2.0
[6] Rcpp_1.0.0         rmarkdown_1.11     knitr_1.21         digest_0.6.18      xfun_0.4
[11] evaluate_0.12


rnaseq limma • 299 views
modified 6 months ago by Aaron Lun24k • written 6 months ago by rsrzkk0
Answer: Help analyzing RNA-seq data from FPKM values with multiple time points
2
6 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

You don't have replicates, so you can't use the design from 9.6.1. Assuming control is zero-time, the best you can do is to fit a spline with 2 degrees of freedom (see 9.6.2). From then on, it's a case of running eBayes and dropping the spline coefficients in topTable. Personally I would just assume a linear relationship with time to (i) get more residual d.f. for variance estimation and (ii) to improve the power of the test.

Dear Aaron,

I will come back later with the results that I have obtained.

Thanks

Hello again @Aaron Lun,

I tried following your suggested answer but I am getting an error when using the eBayes function:

No residual degrees of freedom in linear model fits


which means that my data don't have any replicates. What can I do in this case ? This is my code:

{r}

filename <- read.table("filename.csv", header = T, sep = "\t", stringsAsFactors = F , check.names = F, encoding = "UTF-8")
filename["Log2(ctrl)"] <- log2(filename$ctrl) filename["Log2(1hour)"] <- log2(filename$1hour)
filename["Log2(3hour)"] <- log2(filename$3hour) filename["Log2(6hour)"] <- log2(filename$6hour)
filename <- filename[!is.infinite(rowSums(filename[,c(9,10,11,12)])),]
rownames(filename) <- filename[,1]
filename <- filename[,c(9,10,11,12)]

targetframe <- data.frame("Column"=c("filename[,1]", "filename[,2]", "filename[,3]", "filename[,4]"), "Group"=c("ctrl", "1hour", "3hour", "6hour"), "Time"=c(0, 1, 3, 6))
lev <- c("ctrl", "1hour", "3hour", "6hour")

X <- ns(targetframe$Time, df=2) Group <- factor(targetframe$Group, levels = lev)
design <- model.matrix(~Group*X)
fit <- lmFit(filename, design)
fit <- eBayes(fit)



And when I execute lmFit function, it gives me this:

{r}

Coefficients not estimable: X1 X2 Group1hour:X1 Group3hour:X1 Group6hour:X1 Group1hour:X2 Group3hour:X2 Group6hour:X2
Warning message:
Partial NA coefficients for 22184 probe(s)



You would do well to understand the code before you try to use it. The scenario in 9.6.2 is not the same as your experimental design, it is not a copy-and-paste situation. Clue: X and Group are redundant.

Hello @Aarun Lun,

Thank you for your last message and sorry for the late reply as I am busy working on other projects. I was able to get past my issue by simply doing:

X <- ns(targetframe$Time, df=2) #Group <- factor(targetframe$Group, levels = lev)

design <- model.matrix(~X)


So, after running eBayes, I obtain the fit table which has lines as following:

    coefficients..Intercept.    coefficients.X1    coefficients.X2    df.residual    sigma    stdev.unscaled..Intercept.    stdev.unscaled.X1    stdev.unscaled.X2    Amean    s2.post    t..Intercept.    t.X1    t.X2    df.total    p.value..Intercept.    p.value.X1    p.value.X2    lods..Intercept.    lods.X1    lods.X2    F    F.p.value

XLOC_000003    -4.764523    1.9740934    0.85518825    1    0.17541590    0.8992115    2.467548    1.24896    -4.079700    0.02188396    -35.81745    5.4080313    4.628605    2.273476    0.0003537779    0.024415293    0.03399609    1.6022364    -3.499540    -3.942351    1029.6917    0.0004267146


And there are lots of columns. You mentioned I should drop the spline coefficient (columns 1 to 3) using topTable with the coeff argument. But at that point, I am really unsure about what to do next. Also, shouldn't I focus only on the p-values ? Why should I drop the spline coefficients ? Sorry for the trouble and thank you in advance.

1

One question at a time.

1. Yes. The log-fold changes corresponding to the spline coefficients are very difficult to interpret.
2. Because the null hypothesis is that there is no trend, which requires you to drop all spline coefficients to form the null model (which in this case is a design matrix containing just the intercept).

@Aaron Lun

I was requested to do a heatmap of the log2FC for the DEG. I know it is not meaningful at all but they need to have some kind of lead. They have a list of genes that seems interesting to them and they want to verify them experimentally. So, my question is:

1. If I understood you correctly, the spline coefficients corresponds to log-fold changes but which column?  Is it coefficients..Intercept ? or coefficients.X1 ? or coefficients.X2 ?

Thank you again and Merry Christmas.

1

No spline coefficient corresponds to a simple log-fold change. What do you expect? If the trend is parabolic, what kind of log-fold change would you be able to define to describe it?

If you must have a log-fold change, refit the model with a linear time relationship and report the log-fold change per day with the p-value from the model fitted with the spline. This allows you to have your cake (in terms of a p-value that accounts for non-linear trends) and eat it too (in terms of having a single metric that describes, roughly, the average direction and magnitude of the change). If you want more detail, you will need to examine the trend directly, e.g., by plotting the log-expression versus time for genes of interest.

Hello again Aaron Lun,

Thank you for your last comment. I think I got a glimpse of what you are trying to say. Putting it differently, I will use the fitted model obtained previously by taking out the P-value column because it conserves the general trend of my spline, right ?

But I want to get something clear first. Did you mean to report the log-fold change per hour instead of per day ?

Kindly, I ask you to help me out on this one seeing how more knowledgeable you are. I have no experience in using Limma, let alone all it using these types of data with only FPKM. I would learn a lot from you to be honest.

As you already figured out, I am expecting to have a dataframe with gene names, log2FC and their p-values of the control vs treatment at the end of my analysis. The ctrl is at time 0 of the treatment and the rest are at different timing like I mentioned in my original post.

Can you please guide me on how to achieve such results ?

Thank you a lot.

1
1. Yes.
2. Yes, per hour.
3. Use sort.by="none" in topTable in analyses performed with both models; take the p-value and FDR from the spline model; and use them to replace the same fields in the linear model. This gives you the desired table of results, as long as you remember that the p-value and log-fold change are derived from different models.

This is what I was able to write:

fit <- eBayes(fit)
toptableFit <- topTable(fit = fit, sort.by = "none", number = Inf, coef = 1) # corresponds to 1hour ?
toptableFit2 <- topTable(fit = fit, sort.by = "none", number = Inf, coef = 2) # corresponds to 3hour ?
toptableFit3 <- topTable(fit = fit, sort.by = "none", number = Inf, coef = 3) # corresponds to 6hour ?

colnames(toptableFit)


[1] "logFC", "AveExpr", "t", "P.Value", "adj.P.Val", "B"

LinearFit <- as.data.frame(fit[,1]) # corresponds to 1hour ?
LinearFit2 <- as.data.frame(fit[,2]) # corresponds to 3hour ?
LinearFit3 <- as.data.frame(fit[,3]) # corresponds to 6hour ?

colnames(LinearFit)


[1] "coefficients", "df.residual", "sigma", "stdev.unscaled", "Amean", "s2.post", "t", "df.total",, "p.value",, "lods", "F", "F.p.value"

Now, I need to replace the columns from the spline model in the linear model but before that, I noticed that the Log2FC does not change so the only values that need to be replaced are the pval and the FDR:

# for 1hour
LinearFit[,9] <- toptableFit[,4]
LinearFit[,12] <- toptableFit[,5]


Is this the correct way ?

1

Your code is so wrong that it's difficult to describe all the errors.

So, here's the correct code instead:

library(limma)
y <- matrix(rnorm(40000), ncol=4) # made up data.
time.points <- c(0, 1, 3, 6) # numeric, NOT FACTOR.

# First fit with splines.
X <- splines::ns(time.points, 2)
designS <- model.matrix(~X)
fitS <- lmFit(y, designS)
fitS <- eBayes(fitS)
resS <- topTable(fitS, coef=2:3, sort.by="none", n=Inf)

# Second fit with a linear trend.
designL <- model.matrix(~time.points)
fitL <- lmFit(y, designL)
fitL <- eBayes(fitL)
resL <- topTable(fitL, coef=2, sort.by="none", n=Inf)

# Creating final result table.
res <- data.frame(logFC=resL$logFC, AveExpr=resL$AveExpr,
P.Value=resS$P.Value, adj.P.Val=resS$adj.P.Val,
row.names=rownames(resS))

Thank you Aaron Lun...

I guess I really still have a long way to go. Sorry for the messed up code. Today, I was able to learn new things thanks to you. This solved my issue.

Best regards,

John