Question

GLM model on time course analysis and removing the outliers in Deseq2

1

Entering edit mode

rahel14350 ▴ 10

@rahel14350-8472

Last seen 6.9 years ago

United States

Dear Michael,

Please accept my apologize in advance for such a lengthy question. I have a data set of 200 RNAseq samples from an experiment on two treatments(trtR/trtNR) and the sampling is done at 7 different time-points. I have different number of replicates: for trtR two replicates and for trtNR 5 replicates for each time-points (some time number of replicate is more).

I want to employ a generalized linear model that allows me to quantify the genes that are differentially expressed at each time point and also across all time points. Then a likelihood ratio test to identify differentially expressed genes. I learned from this question https://www.biostars.org/p/133304/ that if I want to have linear affect of time on gene expression, I need to introduce time as a numeric covariate rather than a factor. Therefore, here is my model:

>Time=as.numeric(sampleTable3$Timepoint)
>Trt_type=sampleTable3$Treatment_type

>coldata <- data.frame(Time, Treatment_type)
>rownames(coldata) <- colnames(txi$counts)

>ddsTxi_time <- DESeqDataSetFromTximport(txi, colData=coldata, design=~ Time + Trt_type:Time + Trt_type)

>colData(ddsTxi_time)$Time<-factor(colData(ddsTxi_time)$Time, levels=c("0", "4", "24", "72", "336", "1008","2352"))

>ddsTxi_time= ddsTxi_time[rowSums(counts(ddsTxi_time))>1,]
>dds_time <- DESeq(ddsTxi_time, test="LRT", reduced = ~ 1)
>res_time <- results(dds_time)

>colData(dds_time)
DataFrame with 63 rows and 3 columns
                         Time2 Treatment_Response2 replaceable
                     <numeric>        <factor>   <logical>
F02242_NR_72h_23        72              NR       FALSE
F02245_NR_14w_23      2352              NR       FALSE
...                    ...             ...        ...
F02382_R_14w_29       2352               R        TRUE
F02384_R_72h_29         72               R       FALSE

>resultsNames(dds_time)
[1] "Intercept"                         "Time"                              
[3] "Trt_typeR_vs_NR"             "Time.Trt_type.R"

mcols(res_time5)$description
[1] "mean of normalized counts for all samples"                                          
[2] "log2 fold change (MLE): Time.Trt_type.R"                             
[3] "standard error: Time.Trt_type.R"                                     
[4] "LRT statistic: '~ Time + Trt_type:Time + Trt_type' vs '~ 1'"
[5] "LRT p-value: '~ Time + Trt_type:Time + Trt_type' vs '~ 1'"  
[6] "BH adjusted p-values"

1: Is this model correct or should I introduce treatment type as a numeric covariate as well?
2: The L2FC is coming from Time.Treatment typeR wich it is the l2fc from the genes with lowest p-value over time effect in treatmentR. Correct?

For having the effect of time on Trt_type.NR (l2fc and p-value), I did:

>results(dds_time, contrast=list(c("Time","Trt_typeR_vs_NR"))

is this correct? which this one is just giving me the l2fc in this level and not the p-values. How can I get the genes that are uniquely differentially expressed in trtNR with their corresponding p-value (or should I run the analysis two times separately for each treatment and then check the shared/unique DEGs).

The second part of my question is about removing the outlier, as you see below, I have many outliers.

>summary(res_time)
out of 42823 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 373, 0.87%
LFC < 0 (down)   : 376, 0.88%
outliers [1]     : 112, 0.26%
low counts [2]   : 11622, 27%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Should I remove this outliers from my result before further analysis?

I could not access ?results because of some memory issue. Can I also define the "cooksCutoff" from this command?

maxCooks <- apply(assays(dds_time4)[["cooks"]], 1, max).

I am using the last update of DEseq2.

Many thanks in advance for your reply.

Kind regards,

Rahel

deseq2 GLM timecourse • 3.6k views

ADD COMMENT • link updated 8.2 years ago by Michael Love 43k • written 9.1 years ago by rahel14350 ▴ 10

0

Entering edit mode

For looking at differentially expressed at each time point I'd suggest reverting to looking at incorporating the factor version of time. That way you'll be able to extract the genes that are differential at specific time-points, with the use of a contrast.

For the across all time points question, a valid approach would be to use the numeric version of of time (I can't think of a valid scenario where you'd want treatment type to be numeric, but it will make no difference, I suspect). In that case, the genes significant for the "Trt_typeR_vs_NR" contrast will roughly correspond to genes that have the same time-profile, but R is consistently higher (or lower) than NR. "Time.Trt_type.R" will look find genes that, colloquially, have different gradients in the different conditions.

I think DESeq2 deals with outliers for you, so I don't think they're a cause for concern.

There are warning signs in your excerpt of code (converting a factor directly to a numeric, rather than via an intermediate character; using LRT comparing a saturated model against a constant model) that makes me think you should probably consult a local statistician to cover some of the subtleties of the analysis, as the design is sufficiently complicated and the power of DESeq2 is such that you may need to check the results you get correspond to the questions you're wanting to ask.

ADD REPLY • link 9.1 years ago Gavin Kelly ▴ 690

0

Entering edit mode

Dear Gavin,

Many thanks for your reply and explanation. I know if I use time as factor is more flexible but it is not any more linear ...

ADD REPLY • link 9.1 years ago rahel14350 ▴ 10

0

Entering edit mode

It's still linear, in the sense of linear model, but it's not looking for a straight-line trend - to get the "at each timepoint" differential genes, you need to use the factor approach, and to get the "across all timepoints", you'll (probably) need the numeric approach. So if you want to know which genes are differential at 24hrs (irrespective of other timepoints), then you'd be best off with a factor-based approach. If you want to know which: genes are responding in a linear manner across all timepoints; which genes have different time-gradients across the timecourse; or which genes are differential at specific timepoints interpolating inbetween your measured points (not recommended!), then you'll need a numeric variable for time.

ADD REPLY • link 9.1 years ago Gavin Kelly ▴ 690

0

Entering edit mode

Many thanks for your explanation. I decided to do factor based analysis and check the output for each time point separately ...

ADD REPLY • link 9.1 years ago rahel14350 ▴ 10

score 2 · Answer 1 · 2017-01-04

The first thing to note is that you don't want to perform an LRT of your full design vs ~1. This is testing if any variables in the design help to explain differences in counts across samples. This is likely not what you want to test. This made sense for the person on Biostars because they were comparing ~time vs ~1, which is testing for whether there are any differences across time, but you have a more complex experimental design.

You may need to consult with a statistician who can help you to formulate the designs and contrasts you want to perform, or you can follow the time series example in our workflow:

www.bioconductor.org/help/workflows/rnaseqGene/#time

"If I want to have linear affect of time on gene expression, I need to introduce time as a numeric covariate rather than a factor"

Are you certain that gene expression will be increasing or decreasing in constant multiplicative amounts with units of time? This means you cannot have saturation effects or U-shaped changes in gene expression over time.

Including time as a factor (as you have it in the above code) is more flexible, and I would recommend this if you have replicates. If you did not have replicates, but a fine grid of time points, I would recommend modeling with splines. This would require you to collaborate with a statistician who is familiar with spline modeling in R. But again, you can just use the factor as you have replicates, and this is the most flexible model for changes across time points.

It doesn't make sense to include treatment as a numeric. It is certainly a categorical variable and modeled in R with factors. You should probably discuss the modeling with a local statistician, as they can help explain why all these choices make sense.