Question: Help analyzing RNA-seq data from FPKM values with multiple time points
0
gravatar for rsrzkk
5 weeks ago by
rsrzkk0
rsrzkk0 wrote:

Hello,

I was given a file that contains only FPKM values from an RNA-seq experience for genes that have been measured at different time points. I read Best DEG tool for datasets with FPKM counts? and Differential expression of RNA-seq data using limma and voom() posts about performing a differential gene expression analysis when only having FPKM values. However, as not being an expert in using the limma package, I would like to know how can I analyze such data knowing that we have a different time point please.

Here is a couple of lines from my file:

tracking_id    Total_FPKM    1hour        3hour        6hour        ctrl        gene_id        gene            Log2(FPKM)
XLOC_000003    0.0912937    0.0542082    0.070862    0.0912937    0.0348845    XLOC_000003    LOC100132062    -3.4533409
XLOC_000005    0.4593880    0.3969370    0.413311    0.4582030    0.4593880    XLOC_000005    -                -1.1222149
XLOC_000009    3.3084000    3.2661400    3.308400    3.2586100    3.1813200    XLOC_000009    LOC643837        1.7261337

So, the first step is to create the design. (The guide on 9.6.1 explains how to perform analysis on different time points and so I tried to adapt it):

filename <- read.table("filename.csv", header = T, sep = "\t", stringsAsFactors = F , check.names = F, encoding = "UTF-8")
colnames_log2fpkm <- c("Log2(1hour)", "Log2(3hour)", "Log2(6hour)", "Log2(ctrl)")
log2fpkm <- c("Log2(1hour)"=log2(filename[,3]), "Log2(3hour)"=log2(filename[,4]), "Log2(6hour)"=log2(filename[,5]), "Log2(ctrl)"=log2(filename[,6]))
filename <- cbind(filename, setNames( lapply(colnames_log2fpkm, function(x) x=log2fpkm), colnames_log2fpkm) )
filename <- filename[!is.infinite(rowSums(filename[,c(9,10,11,12)])),]

targetframe <- data.frame("Column"=c("filename[,12]", "filename[,10]", "filename[,11]", "filename[,12]"), "Targets"=c("ctrl", "1hour", "3hour", "6hour"))
lev <- c("ctrl", "1hour", "3hour", "6hour")
f <- factor(targetframe$Targets, levels=lev)
design <- model.matrix(~0+f)
colnames(design) <- lev

 

> design
  ctrl 1hour 3hour 6hour
1    1     0     0     0
2    0     1     0     0
3    0     0     1     0
4    0     0     0     1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`f`
[1] "contr.treatment"

At this point, I am not sure how to proceed... so any help would be appreciated.

 

sessioninfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] limma_3.38.3

loaded via a namespace (and not attached):
 [1] BiocManager_1.30.4 compiler_3.5.1     htmltools_0.3.6    tools_3.5.1        yaml_2.2.0        
 [6] Rcpp_1.0.0         rmarkdown_1.11     knitr_1.21         digest_0.6.18      xfun_0.4          
[11] evaluate_0.12

Thanks in advance.

 

limma rnaseq • 166 views
ADD COMMENTlink modified 5 weeks ago by Aaron Lun22k • written 5 weeks ago by rsrzkk0
Answer: Help analyzing RNA-seq data from FPKM values with multiple time points
2
gravatar for Aaron Lun
5 weeks ago by
Aaron Lun22k
Cambridge, United Kingdom
Aaron Lun22k wrote:

You don't have replicates, so you can't use the design from 9.6.1. Assuming control is zero-time, the best you can do is to fit a spline with 2 degrees of freedom (see 9.6.2). From then on, it's a case of running eBayes and dropping the spline coefficients in topTable. Personally I would just assume a linear relationship with time to (i) get more residual d.f. for variance estimation and (ii) to improve the power of the test.

ADD COMMENTlink written 5 weeks ago by Aaron Lun22k

Dear Aaron,

Thanks for your kind answer. I will try to follow your directives. Also, could you please tell me if my design is properly formatted ?

I will come back later with the results that I have obtained.

Thanks

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by rsrzkk0

Hello again @Aaron Lun,

I tried following your suggested answer but I am getting an error when using the eBayes function:

No residual degrees of freedom in linear model fits

which means that my data don't have any replicates. What can I do in this case ? This is my code:

```{r}

filename <- read.table("filename.csv", header = T, sep = "\t", stringsAsFactors = F , check.names = F, encoding = "UTF-8")
filename["Log2(ctrl)"] <- log2(filename$ctrl)
filename["Log2(1hour)"] <- log2(filename$`1hour`)
filename["Log2(3hour)"] <- log2(filename$`3hour`)
filename["Log2(6hour)"] <- log2(filename$`6hour`)
filename <- filename[!is.infinite(rowSums(filename[,c(9,10,11,12)])),]
rownames(filename) <- filename[,1]
filename <- filename[,c(9,10,11,12)]

targetframe <- data.frame("Column"=c("filename[,1]", "filename[,2]", "filename[,3]", "filename[,4]"), "Group"=c("ctrl", "1hour", "3hour", "6hour"), "Time"=c(0, 1, 3, 6))
lev <- c("ctrl", "1hour", "3hour", "6hour")

X <- ns(targetframe$Time, df=2)
Group <- factor(targetframe$Group, levels = lev)
design <- model.matrix(~Group*X)
fit <- lmFit(filename, design)
fit <- eBayes(fit)

```

And when I execute lmFit function, it gives me this:

```{r}

Coefficients not estimable: X1 X2 Group1hour:X1 Group3hour:X1 Group6hour:X1 Group1hour:X2 Group3hour:X2 Group6hour:X2 
Warning message:
Partial NA coefficients for 22184 probe(s)

```

Thanks in advance

ADD REPLYlink written 4 weeks ago by rsrzkk0

You would do well to understand the code before you try to use it. The scenario in 9.6.2 is not the same as your experimental design, it is not a copy-and-paste situation. Clue: X and Group are redundant. 

ADD REPLYlink written 4 weeks ago by Aaron Lun22k

Hello @Aarun Lun,

Thank you for your last message and sorry for the late reply as I am busy working on other projects. I was able to get past my issue by simply doing:

X <- ns(targetframe$Time, df=2)

#Group <- factor(targetframe$Group, levels = lev)

design <- model.matrix(~X)

So, after running eBayes, I obtain the fit table which has lines as following:

    coefficients..Intercept.    coefficients.X1    coefficients.X2    df.residual    sigma    stdev.unscaled..Intercept.    stdev.unscaled.X1    stdev.unscaled.X2    Amean    s2.post    t..Intercept.    t.X1    t.X2    df.total    p.value..Intercept.    p.value.X1    p.value.X2    lods..Intercept.    lods.X1    lods.X2    F    F.p.value


XLOC_000003    -4.764523    1.9740934    0.85518825    1    0.17541590    0.8992115    2.467548    1.24896    -4.079700    0.02188396    -35.81745    5.4080313    4.628605    2.273476    0.0003537779    0.024415293    0.03399609    1.6022364    -3.499540    -3.942351    1029.6917    0.0004267146

And there are lots of columns. You mentioned I should drop the spline coefficient (columns 1 to 3) using topTable with the coeff argument. But at that point, I am really unsure about what to do next. Also, shouldn't I focus only on the p-values ? Why should I drop the spline coefficients ? Sorry for the trouble and thank you in advance.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by rsrzkk0
1

One question at a time. 

  1. Yes. The log-fold changes corresponding to the spline coefficients are very difficult to interpret.
  2. Because the null hypothesis is that there is no trend, which requires you to drop all spline coefficients to form the null model (which in this case is a design matrix containing just the intercept).
ADD REPLYlink written 27 days ago by Aaron Lun22k

@Aaron Lun

I was requested to do a heatmap of the log2FC for the DEG. I know it is not meaningful at all but they need to have some kind of lead. They have a list of genes that seems interesting to them and they want to verify them experimentally. So, my question is:

1. If I understood you correctly, the spline coefficients corresponds to log-fold changes but which column?  Is it coefficients..Intercept ? or coefficients.X1 ? or coefficients.X2 ?

Thank you again and Merry Christmas.

ADD REPLYlink written 26 days ago by rsrzkk0
1

No spline coefficient corresponds to a simple log-fold change. What do you expect? If the trend is parabolic, what kind of log-fold change would you be able to define to describe it?

If you must have a log-fold change, refit the model with a linear time relationship and report the log-fold change per day with the p-value from the model fitted with the spline. This allows you to have your cake (in terms of a p-value that accounts for non-linear trends) and eat it too (in terms of having a single metric that describes, roughly, the average direction and magnitude of the change). If you want more detail, you will need to examine the trend directly, e.g., by plotting the log-expression versus time for genes of interest.

ADD REPLYlink modified 26 days ago • written 26 days ago by Aaron Lun22k

Hello again Aaron Lun,

Thank you for your last comment. I think I got a glimpse of what you are trying to say. Putting it differently, I will use the fitted model obtained previously by taking out the P-value column because it conserves the general trend of my spline, right ?

But I want to get something clear first. Did you mean to report the log-fold change per hour instead of per day ?

Kindly, I ask you to help me out on this one seeing how more knowledgeable you are. I have no experience in using Limma, let alone all it using these types of data with only FPKM. I would learn a lot from you to be honest.

As you already figured out, I am expecting to have a dataframe with gene names, log2FC and their p-values of the control vs treatment at the end of my analysis. The ctrl is at time 0 of the treatment and the rest are at different timing like I mentioned in my original post.

Can you please guide me on how to achieve such results ?

Thank you a lot.

ADD REPLYlink modified 21 days ago • written 21 days ago by rsrzkk0
1
  1. Yes.
  2. Yes, per hour.
  3. Use sort.by="none" in topTable in analyses performed with both models; take the p-value and FDR from the spline model; and use them to replace the same fields in the linear model. This gives you the desired table of results, as long as you remember that the p-value and log-fold change are derived from different models.
ADD REPLYlink written 21 days ago by Aaron Lun22k

Thank you for your reply.

This is what I was able to write:

 

fit <- eBayes(fit)
toptableFit <- topTable(fit = fit, sort.by = "none", number = Inf, coef = 1) # corresponds to 1hour ?
toptableFit2 <- topTable(fit = fit, sort.by = "none", number = Inf, coef = 2) # corresponds to 3hour ?
toptableFit3 <- topTable(fit = fit, sort.by = "none", number = Inf, coef = 3) # corresponds to 6hour ?

colnames(toptableFit)

[1] "logFC", "AveExpr", "t", "P.Value", "adj.P.Val", "B"

LinearFit <- as.data.frame(fit[,1]) # corresponds to 1hour ?
LinearFit2 <- as.data.frame(fit[,2]) # corresponds to 3hour ?
LinearFit3 <- as.data.frame(fit[,3]) # corresponds to 6hour ?

colnames(LinearFit)

[1] "coefficients", "df.residual", "sigma", "stdev.unscaled", "Amean", "s2.post", "t", "df.total",, "p.value",, "lods", "F", "F.p.value"

Now, I need to replace the columns from the spline model in the linear model but before that, I noticed that the Log2FC does not change so the only values that need to be replaced are the pval and the FDR:

# for 1hour
LinearFit[,9] <- toptableFit[,4]
LinearFit[,12] <- toptableFit[,5]

Is this the correct way ?

Thank you in advance.

ADD REPLYlink modified 13 days ago • written 13 days ago by rsrzkk0
1

Your code is so wrong that it's difficult to describe all the errors.

So, here's the correct code instead:

library(limma)
y <- matrix(rnorm(40000), ncol=4) # made up data.
time.points <- c(0, 1, 3, 6) # numeric, NOT FACTOR.

# First fit with splines.
X <- splines::ns(time.points, 2)
designS <- model.matrix(~X)
fitS <- lmFit(y, designS)
fitS <- eBayes(fitS)
resS <- topTable(fitS, coef=2:3, sort.by="none", n=Inf)

# Second fit with a linear trend.
designL <- model.matrix(~time.points)
fitL <- lmFit(y, designL)
fitL <- eBayes(fitL)
resL <- topTable(fitL, coef=2, sort.by="none", n=Inf)

# Creating final result table.
res <- data.frame(logFC=resL$logFC, AveExpr=resL$AveExpr,
    P.Value=resS$P.Value, adj.P.Val=resS$adj.P.Val,
    row.names=rownames(resS))
ADD REPLYlink modified 12 days ago • written 12 days ago by Aaron Lun22k

Thank you Aaron Lun...

I guess I really still have a long way to go. Sorry for the messed up code. Today, I was able to learn new things thanks to you. This solved my issue.

Best regards,

John

ADD REPLYlink written 12 days ago by rsrzkk0

Bump ... I need help on this please

ADD REPLYlink written 27 days ago by rsrzkk0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 117 users visited in the last hour