Search
Question: Design and independent filtering of patient RNA-seq data
0
3.3 years ago by
mat.lesche40
Germany
mat.lesche40 wrote:

Hello,

I'm dealing with data sets which come from patient data and I need some advice for the formula design to do a gene expression analysis. I don't want to compare the data sets between each other!

The data sets consist of two groups, sample which have a disease and samples which don't have it. It looks like this:

Sample Status
1 healthy
2 healthy
3 healthy
4 healthy
5 healthy
6 sick
7 sick
8 sick
9 sick
10 sick

What I would do know is the following design

dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ Status)

But I'm quite sure this is not the right way to go, because I'm dealing with human samples and they do not only differ by Status. They have other variances. A PCA confirms this as well. There is not real clustering between healthy and sick. The samples are spread all other the coordinate system and sick and healther cluster within each other. What I do see is a separation between samples on the x-axis. There is a gap between the samples and I have a cluster of healthy/sick samples, then the gap and again healthy/sick cluster. I assume it might be a batch effect. So here are my question.

1. Is it worth to account for unknown batches? Like it is described here: http://www.bioconductor.org/help/workflows/rnaseqGene/#batch

I'm not a big fan trying to see something which might not be there which is why I have my doubts about such procedures.

2. What kind of design can I do? And can I somehow account for samples differences? A design like Sample + Status doesn't work. Or should I do likelihood ratio test with DESeq2?

I have additional information for some of the data sets, like age of the patients, bmi index, disease status but I'm not sure how to work these into a design formula. For example how would I build I factor for Age (where would be the group separation age 25 or 40 or 60...)

One data set is from amplified samples and when I run DESeq2 with the design which I mentioned above, I'll get 1 differentially expressed gene at a FDR of 0.1. What I see is that a lot of genes get filtered due to independent filtering:

out of 52719 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 1, 0.0019%
LFC < 0 (down)   : 0, 0%
outliers [1]     : 0, 0%
low counts [2]   : 49725, 94%
(mean count < 1012.9)

which seems quite high. If I turn independentFiltering off I get 0 genes. What I'd like to know is why the threshold is so high.

Thx

Mathias

modified 3.3 years ago by James W. MacDonald47k • written 3.3 years ago by mat.lesche40
3
3.3 years ago by
United States
James W. MacDonald47k wrote:

As for question #1, you have a PCA plot that indicates either very little differences, or possibly true biological differences obscured by some uncontrolled technical variability (or a combination thereof). Given that you have evidence for only one gene, wouldn't you want to see if sva helps? Is there a downside here that I am not seeing?

I am also not sure why you have doubts about such procedures. Is there something particular about sva that you don't like?

0
3.3 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.8k wrote:

I believe that the independent filtering chooses the threshold that maximizes the number of genes that pass the 10% FDR threshold. As you have noticed, this doesn't work well when that maximum is one. It sounds to me like there is no detectable signal for the healthy vs sick comparison.

Yes, Ryan is correct. And in fact, we are planning to tweak the filtering process this devel cycle: in the case where there are only e.g. single digit gains in #(adj p < alpha), to avoid aggressive filtering (which would give here 0 adjusted p-values < alpha).

hey, thx for the answer. any comment on question number 2? Or is there nothing I can do than just compare sick vs healthy

I wouldn't go exploring phenotypic covariates to include to reduce the variance, because this kind of a procedure isn't really safe from a statistical point of view. The main conclusion is that, if there are differences, you need more sample size to detect them, or you need to account for technical variation (either with batch correction or improvements to protocol).

Alternatively, if you do EDA and figure out that, for example, there are age-based differences, you could add a factor variable of age categories to the model, to control for this. But I would caution against looking at all possible covariate (unless you do a final analysis on new samples), because you will end up overfitting to these 10 samples.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Michael Love19k

Hey Michael,

thx for the answer and I have to apology for the sample overview. It was an example. Just to be precise, I have about 40 samples per experiment and half of them are from healthy patients, the others from sick patients.

But what do you mean with EDA?

It is just if you look at a PCA of patient data and you don't see a clustering at all between the two conditions, it makes you doubt that the disease has some bigger impact and will be hard to trace with a gene expression analysis.

Yes, by EDA I mean, for example, PCA to look for major differences between samples. By changing the color/shape of points, you can see the contribution across genes of different covariates to differences in counts.

Hi Michael,

I understand what you mean. this is what the pca looks like at the moment: http://www.biotec.tu-dresden.de/sharing/t/upload41ovSC/pca.pdf Right now I have only the information sick and healthy and I'm waiting to get more information. I don't think I can't do more at the moment. The only thing I could try to find out with sva would be the gap on the x-axis between the 8 samples on the right and the rest.

If apply sva, I only followed rnaseqGene workflow on bioconductor, then the pca looks like this:

If I interpret this correctly, then SV1 shows the batch effect.

thanks

mathias

ADD REPLYlink modified 3.2 years ago • written 3.3 years ago by mat.lesche40

The Principal Component Analysis is important for EDA and quality control - but I wouldn't necessarily give up on the analysis if the disease and controls don't show a clear separation in a PCA plot. It's feasible that the difference only affects a small number of genes (which wouldn't dominate the PCA). Looking for DE genes in a supervised manner (i.e. DESeq2) could find them. And as said above, you might (but no guarantee) gain power by including batch covariates, ideally explicitly known ones, perhaps also ones fit from, e.g., PCA.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Wolfgang Huber13k

I'm sorry to ask this but what do you mean with covariates. Is it for example if I divide my samples in two groups. One consists of the 8 sample on left-side of the first principle component (see pdf), the other group is the rest and I add it to the design like

~ Batch + Status

Thx

Mathias

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by mat.lesche40

hi Mat,

You shouldn't come up with covariates by eye.

By covariates, we mean getting the actual batch information from the lab (what day the samples were processed, etc), this would be considered a technical covariate, or biological covariate like sex, age, etc.

Or you can use the sva software to estimate the batches in a statistically principled way. See the svaseq() function in the sva package. We have an example of incorporating sva surrogate variables here:

http://master.bioconductor.org/help/workflows/rnaseqGene/#batch

Yes okay, I wouldn't that. But let's say I know the data of sample and I add this as a covariate, then it may still not be visible in a pca?

I followed your guideline and applied sva (see http://www.biotec.tu-dresden.de/sharing/t/uploadwjRMyu/pca.sva.pdf ). The problem is, that I don't really understand the method. I'll have to read up on that. For example how many surrogate variables n.sv variable). It's not really explained in the help page in R either.

There is also one other problem with another data set. This time I'll have about 15,000 DEGs and I know have to filter on lfcThreshold and/or altHypothesis. But before I ask about guidance in this case, I'll have to read the vignette again and check for related bioconductor posts.

thx again!