#### The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: data preparation for timecourse
0
4.3 years ago by
b.curran0
New Zealand
b.curran0 wrote:

I'm using the timecourse package in bioconductor. I have been obtaining size factors for each of the samples from DESeq's sizeFactors() before log transforming the raw data and using this in mb.long() from timecourse.

Should I be doing this or is timecourse a package which handles the sample size normalization by itself?  I ask because I recently re-ran some of my samples forgetting to use the sizeFactor() adjustment and the top genes are much better at reflecting trends seen in other analyses. Which makes me curious.

Cheers

Ben.

timecourse • 749 views
modified 4.3 years ago by Dario Strbenac1.4k • written 4.3 years ago by b.curran0
0
4.3 years ago by
United States
James W. MacDonald49k wrote:

How are you using the size factors with timecourse? I am not sure that there is a direct way to use RNA-Seq data with this package, and would imagine that you are violating assumptions.

In general, if you don't have enough replication for the CLT to kick in, you have to either model RNA-Seq data using a negative binomial as reference distribution (using DESeq2, edgeR, baySeq, etc) or estimate observation-level weights that you can use to account for heteroscedasticity if you want to use regular linear modeling (voom() in the limma package). Simply adjusting for library depth and log transforming the data is not likely to be sufficient.

Is there a reason that you don't want to use DESeq2 directly for your analysis?

I am indeed using DESeq2 for my analysis.

I have 57 samples of RNASeq data from two treatments spread over 10 time points (only got 3 replicates for control treatment at time zero :-/). When I run DeSeq2 pairwise, longitudinally along the inoculated samples, there is a bunch of interesting changes around the 72hr mark. Using timecourse is meant to be another way of looking at the data to visualise what's happening. My primary means of identification of genes of interest though is DESeq2.

I was using the size factors from DESeq to adjust for library size before log transforming the data for timecourse. Something like:

sizeFactors <- sizeFactors(dds)

M3 <- as.matrix(counts[1:39039,])

M3<-scale(M3, center=FALSE, scale=sizeFactors)

M3<-log(M3+1)

It became interesting when I ran it without the normalisation step, just with log transforming the data, I found an overlap of about 30 or so genes, many of them pathogen response related proteins and transcription factors. There was also a considerable overlap (~100) of genes identified by timecourse and betr (fdr->.01, probability >= 0.99) as being interesting. Of course, this means nothing if I've got the input for timecourse wrong (garbage in, garbage out). It was just one of those moments that make you stop and go "oooo, look at that, that's interesting".

So if I transform my raw counts with voom, then log transform, that would be sufficient for using RNASeq data in timecourse?

Slight detour: heteroscdasticity in RNASeq data. My understanding of it is that saying that the variation of each individual genes is different at each timepoint? If that is the case, and I correct for it, how do I know that I'm correcting for something that is due to library size - i.e. a small number of genes skewing the read counts/leaving less room for everything else, or if it's an actual biological ... thing?

Basically, if I compare 3 replicates at a time point A with 3 replicates at time point B, the variance is always going to be different, even if the conditions are the same. In an ideal world they wouldn't be, if all environmental conditions were the same, the variation would also be the same. Plants don't behave like that though. So how do I tell if it's a technical thing I'm correcting for or if it's just plants being plants.

Unless of course, I've got my understanding of heteroscedasticity all wrong.

Cheers

Ben.

hi Ben,

"Using timecourse is meant to be another way of looking at the data to visualise what's happening."

A few questions: Can you explain what biological questions you are interested in answering, e.g. what kind of genes you are interested in finding? DESeq2 (as well as other packages mentioned in this thread) can support testing a variety of hypotheses. Can you show table(dds$time, dds$condition)? Are there any other covariates other than condition and time?

Primarily pathogen response genes. And I have been finding them with DESeq and EdgeR. The primary attraction of Timecourse was that I was thinking it would be good for finding genes D.E over the timecourse as a whole rather than all the pairwise comparisons I had done with DESeq. And the nifty expression profile pictures it spat out. Given that it's not really appropriate, though ...

No other covariates no.

table(dds$condition) Control Psa 15 15 table(dds$time)

120  24  48  72  96
6   6   6   6   6 

Cheers

Ben.

I think you are misunderstanding heteroscedasticity. The problem with RNA-Seq data is that the variability of the measure is different when you have really low counts, as compared to when you have lots of counts. Since most statistics assume that the two groups being compared have the same variability, if that assumption isn't met you could think there was a difference between the mean expression when in fact the difference was due to changes in variability.

All the methods mentioned here (DESeq2, edgeR, voom, and EBSeqHMM) are based on the idea that the count data are 'over dispersed' and have non-constant variation, and account for that as best they can. Unfortunately, timecourse doesn't (as far as I can tell) use observational-level weights, so simply using voom() first won't help (voom log transforms the data but then also computes observation-level weights that can then be passed into e.g., lm.fit, to account for the heteroscedasticity. I don't believe timecourse does that).

Long story short, timecourse is almost surely not the package for you to use, because you are violating assumptions, which will result in incorrect results.

Thank you for that though, that explains it quite clearly. In as much as that is a concept that I had grasped when reading the DESeq paper. I think some of the other links regarding heteroscedasticity somewhat confused me.

0
4.3 years ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:

EBSeqHMM is designed for this analysis. Timecourse is not a suitable software package.