GLM model on time course analysis and removing the outliers in Deseq2
1
1
Entering edit mode
rahel14350 ▴ 10
@rahel14350-8472
Last seen 5.1 years ago
United States

Dear Michael,


Please accept my apologize in advance for such a lengthy question. I have a data set of 200 RNAseq samples from an experiment on two treatments(trtR/trtNR) and the sampling is done at 7 different time-points. I have different number of replicates: for trtR two replicates and for trtNR 5 replicates for each time-points (some time number of replicate is more).


I want to employ a generalized linear model that allows me to quantify the genes that are differentially expressed at each time point and also across all time points. Then a likelihood ratio test to identify differentially expressed genes. I learned from this question https://www.biostars.org/p/133304/ that if I want to have linear affect of time on gene expression, I need to introduce time as a numeric covariate rather than a factor. Therefore, here is my model:

>Time=as.numeric(sampleTable3$Timepoint)
>Trt_type=sampleTable3$Treatment_type

>coldata <- data.frame(Time, Treatment_type)
>rownames(coldata) <- colnames(txi$counts)

>ddsTxi_time <- DESeqDataSetFromTximport(txi, colData=coldata, design=~ Time + Trt_type:Time + Trt_type)

>colData(ddsTxi_time)$Time<-factor(colData(ddsTxi_time)$Time, levels=c("0", "4", "24", "72", "336", "1008","2352"))

>ddsTxi_time= ddsTxi_time[rowSums(counts(ddsTxi_time))>1,]
>dds_time <- DESeq(ddsTxi_time, test="LRT", reduced = ~ 1)
>res_time <- results(dds_time)

>colData(dds_time)
DataFrame with 63 rows and 3 columns
                         Time2 Treatment_Response2 replaceable
                     <numeric>        <factor>   <logical>
F02242_NR_72h_23        72              NR       FALSE
F02245_NR_14w_23      2352              NR       FALSE
...                    ...             ...        ...
F02382_R_14w_29       2352               R        TRUE
F02384_R_72h_29         72               R       FALSE

>resultsNames(dds_time)
[1] "Intercept"                         "Time"                              
[3] "Trt_typeR_vs_NR"             "Time.Trt_type.R"

mcols(res_time5)$description
[1] "mean of normalized counts for all samples"                                          
[2] "log2 fold change (MLE): Time.Trt_type.R"                             
[3] "standard error: Time.Trt_type.R"                                     
[4] "LRT statistic: '~ Time + Trt_type:Time + Trt_type' vs '~ 1'"
[5] "LRT p-value: '~ Time + Trt_type:Time + Trt_type' vs '~ 1'"  
[6] "BH adjusted p-values"                                                           

1: Is this model correct or should I introduce treatment type as a numeric covariate as well?
2: The L2FC is coming from Time.Treatment typeR wich it is the l2fc from the genes with lowest p-value over time effect in treatmentR. Correct?

For having the effect of time on Trt_type.NR (l2fc and p-value), I did:

>results(dds_time, contrast=list(c("Time","Trt_typeR_vs_NR"))

is this correct? which this one is just giving me the l2fc in this level and not the p-values. How can I get the genes that are uniquely differentially expressed in trtNR with their corresponding p-value (or should I run the analysis two times separately for each treatment and then check the shared/unique DEGs).

The second part of my question is about removing the outlier, as you see below, I have many outliers.

>summary(res_time)
out of 42823 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 373, 0.87%
LFC < 0 (down)   : 376, 0.88%
outliers [1]     : 112, 0.26%
low counts [2]   : 11622, 27%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Should I remove this outliers from my result before further analysis?

I could not access ?results because of some memory issue. Can I also define the "cooksCutoff" from this command?

maxCooks <- apply(assays(dds_time4)[["cooks"]], 1, max).

I am using the last update of DEseq2.

Many thanks in advance for your reply.

Kind regards,

Rahel

deseq2 GLM timecourse • 2.6k views
ADD COMMENT
0
Entering edit mode

For looking at  differentially expressed at each time point I'd suggest reverting to looking at incorporating the factor version of time. That way you'll be able to extract the genes that are differential at specific time-points, with the use of a contrast.

For the across all time points question, a valid approach would be to use the numeric version of of time (I can't think of a valid scenario where you'd want treatment type to be numeric, but it will make no difference, I suspect). In that case, the genes significant for the "Trt_typeR_vs_NR"  contrast will roughly correspond to genes that have the same time-profile, but R is consistently higher (or lower) than NR. "Time.Trt_type.R" will look find genes that, colloquially, have different gradients in the different conditions.

I think DESeq2 deals with outliers for you, so I don't think they're a cause for concern.

There are  warning signs in your excerpt of code (converting a factor directly to a numeric, rather than via an intermediate character; using LRT comparing a saturated model against a constant model) that makes me think you should probably consult a local statistician to cover some of the subtleties of the analysis, as the design is sufficiently complicated and the power of DESeq2 is such that you may need to check the results you get correspond to the questions you're wanting to ask.

ADD REPLY
0
Entering edit mode

Dear Gavin, 

Many thanks for your reply and explanation. I know if I use time as factor is more flexible but it is not any more linear ... 

ADD REPLY
0
Entering edit mode

It's still linear, in the sense of linear model, but it's not looking for a straight-line trend - to get the "at each timepoint" differential genes, you need to use the factor approach, and to get the "across all timepoints", you'll (probably) need the numeric approach.  So if you want to know which genes are differential at 24hrs (irrespective of other timepoints), then you'd be best off with a factor-based approach.  If you want to know which: genes are responding in a linear manner across all timepoints; which genes have different time-gradients across the timecourse; or which genes are differential at specific timepoints interpolating inbetween your measured points (not recommended!), then you'll need a numeric variable for time.

ADD REPLY
0
Entering edit mode

Many thanks for your explanation. I decided to do factor based analysis and check the output for each time point separately ...

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

The first thing to note is that you don't want to perform an LRT of your full design vs ~1. This is testing if any variables in the design help to explain differences in counts across samples. This is likely not what you want to test. This made sense for the person on Biostars because they were comparing ~time vs ~1, which is testing for whether there are any differences across time, but you have a more complex experimental design.

You may need to consult with a statistician who can help you to formulate the designs and contrasts you want to perform, or you can follow the time series example in our workflow:

www.bioconductor.org/help/workflows/rnaseqGene/#time

"If I want to have linear affect of time on gene expression, I need to introduce time as a numeric covariate rather than a factor"

Are you certain that gene expression will be increasing or decreasing in constant multiplicative amounts with units of time? This means you cannot have saturation effects or U-shaped changes in gene expression over time.

Including time as a factor (as you have it in the above code) is more flexible, and I would recommend this if you have replicates. If you did not have replicates, but a fine grid of time points, I would recommend modeling with splines. This would require you to collaborate with a statistician who is familiar with spline modeling in R. But again, you can just use the factor as you have replicates, and this is the most flexible model for changes across time points.

It doesn't make sense to include treatment as a numeric. It is certainly a categorical variable and modeled in R with factors. You should probably discuss the modeling with a local statistician, as they can help explain why all these choices make sense.

 

ADD COMMENT
0
Entering edit mode

Dear Michael,

Many thanks for your time and reply. I understood and I will try to find a statistician around to help me further with a linear design of changes across time.

Then, If I want to continue with DeSeq2, I go for Time as a factor and this design:

DESeqDataSetFromTximport(txi, colData=coldata, design=~ Time + Trt_type:Time + Trt_type)

DESeq(ddsTC, test="LRT", reduced = ~ time + Trt_type)

To check how time is effected the gene expression under different treatment type. is that correct?

Many thanks again, Rahel

ADD REPLY
1
Entering edit mode

That looks correct, to test for any differences in gene expression over time attributed to treatment, accounting for a potential constant fold change difference present at time=0.

(Usually we recommend to put the interaction term at the end of the design formula, in case you were to perform a Wald test, and not specify which coefficient to test, results() would correctly guess the last term in the design formula.)

ADD REPLY

Login before adding your answer.

Traffic: 949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6