Question

Effectiveness of normalisation if many DE genes using DESeq2

0

Entering edit mode

Alex So • 0

@alex-so-17105

Last seen 5.8 years ago

European Union

Hello,

I’m working with Deseq2 in order to analyse some RNA-seq generated with a plant species. I have a time series experiment with 8 time point and 4 replicates for each time point.

I just have a question concerning the normalisation method used by Deseq2 and his usability with my data. Our plant has around 40.000 genes, and deseq2 says that around 20.000 of these genes are differentially expressed (using a time series design). Around 10.000 are up-regulated and 10.000 are down-regulated.

I read that the normalisation needs that only a small amount of the genes are differentially expressed. Do you think I can use Deseq2 on these data ? Based of the following topic Normalization in DESeq2 I would say that It’s possible if the outputed MAplot is well centred/balanced on the x axxis, but I’m not sure of this… at all...

Thank you in advance and don’t hesitate if you need more informations.

Alex.

deseq2 • 1.2k views

ADD COMMENT • link updated 6.4 years ago by Michael Love 43k • written 6.4 years ago by Alex So • 0

score 1 · Answer 1 · 2018-08-29

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 12 hours ago

United States

For a time series, it could be that each time point has some genes which are up and down, but not enough to be a problem for normalization. But when you ask which are changing at any time point you get a large number. What code have you run to analyze the time series? How do you get 20,000?

ADD COMMENT • link 6.4 years ago Michael Love 43k

0

Entering edit mode

Hello, thanks for your answer.

I will only be able to give you my code tomorrow, but here are some precisions that may be helpfull.

We are looking at the formation of an organ. The data look Like this :

Rep1 2h

Rep2 2h

Rep3 2h

Rep4 2h

Rep1 4h

Rep2 4h

Rep3 4h

Rep4 4h

...

Rep1 24h

Rep2 24h

Etc...

I also ran the test by making pairwise comparisons. If I take time points far from each other (2h vs 24h for exemple) I get à high amount of genes that are differentially expressed (like 20.000). If I take points close to each other’s (2h vs 4h for example) I get less DE genes (around 5000)

Thanks you for your answer.

Alex.

ADD REPLY • link 6.4 years ago Alex So • 0

0

Entering edit mode

Hello michael, tanks you again for you answer,

Here are the precisions you asked for.

I have two input files :

gene_count_matrix.csv :

gene_id,T000A,T000B,T000C,T000D,T012A,T012B,T012C,T012D,…...,T120D,T132A,T132B,T132C,T132D

gene_number_1,0,0,0,0,0,3,0,…….,0,0,0,3

gene_number_2,601,431,343,532,404,333,340,374,…...,37,174,96,102

etc

file_list.txt

"id","time"

"T000A","000-ref"

"T000B","000-ref"

"T000C","000-ref"

"T000D","000-ref"

"T012A","012"

"T012B","012"

"T012C","012"

"T012D","012"

…….

"T132A","132"

"T132B","132"

"T132C","132"

"T132D","132"

Importing the data

countData <- as.matrix(read.csv("gene_count_matrix.csv", row.names="gene_id"))
colData <- read.csv("file_list.txt", sep=",", row.names=1)
all(rownames(colData) %in% colnames(countData))
(true)
countData <- countData[, rownames(colData)]
all(rownames(colData) == colnames(countData))
(true)

Small analyse for time series

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ time)
dds <- DESeq(dds, test="LRT", reduced=~1)
res <- results(dds)
resOrdered <- res[order(res$pvalue),]
resOrdered <- subset(resOrdered, resOrdered$padj < 0.05)

OR Small analyse with default parameters do pairwise comparisons

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ organ)
dds <- DESeq(dds)
res <- results(dds, contrast=c("time", "000-ref", "012"))
resOrdered <- res[order(res$pvalue),]
resOrdered <- subset(resOrdered, resOrdered$padj < 0.05)

Getting the number of DE genes

NROW(resOrdered)
NROW(as.data.frame(subset(resOrdered, log2FoldChange > 0)))
NROW(as.data.frame(subset(resOrdered, log2FoldChange < 0)))

The number of DE gene will change a lot depending of the “contrast” argument I provide. For example, if I take “000-ref” vs “132” (immature organ vs mature organ), I have 20.000 DE genes, and if I take “000-ref” vs “012” (immature organ vs less immature organ), I have 1,000 DE genes. I was wrong with the number using the time series design, as I get arround 28,000 DE genes.

We don’t have a control or something like this. We are just looking at the formation of an organ.

Was this the precisions that you needed ?

Thanks again

Alex.

ADD REPLY • link 6.4 years ago Alex So • 0

0

Entering edit mode

So normalization is a problem if you have some time steps in which all genes are changing relative to the others, but not a problem for the time steps that have “only” 1000 genes DE. Do you have any kind spike in information to deal with the fact that perhaps all genes change across developmental time in your experiment?

ADD REPLY • link 6.4 years ago Michael Love 43k

0

Entering edit mode

Thanks for you answer.

I tried to filter the data a bit more based on log2FoldChange. For the timecourse design, if I take genes with l2fc > 1 and those with l2fc < 1 I get 5,000 et 6,000 genes respectively, from 28,000 genes considered differentially expressed. There is a lot of genes with a very low l2fc.

These data are “a bit” difficult to analyse. We are not really using the results of the DE analyses for distant time points because during the development of the organ, a lot of genes are “moving”, and we are expecting that. We have tones of pattern with genes with no expression at the beginning and high expression at the end, or with the opposite pattern.

Instead we are very often looking at expression pattern for some specific genes in which we are interested in using the “plotCounts” function, or with heatmap of the expression values transformed with the "rld" function. Do you think we can trust the results of these functions for our data ?

What do you mean exacty by : Do you have any kind spike in information to deal with the fact that perhaps all genes change across developmental time in your experiment ?

(my engish skills are limited, i’m not sure to hunderstand this well)

Thanks again !

Alex.

ADD REPLY • link 6.4 years ago Alex So • 0

0

Entering edit mode

If there are global changes in expression, like all the genes being "up-regulated", then you can't rely on normalization which has no reference point. The standard normalization approaches for RNA-seq need some set of genes that are not changing much across samples (they don't have to be exactly 0, but just small LFCs across samples). You can use software like RUVSeq to identify these genes even, if they are potentially a very small set or there are reasons not to trust the built-in methods of the statistical programs, but they have to exist. It sounds like you have some time points where there are no genes that are even close to not changing. When such experiments are performed, sometimes spike-in RNA is used to provide some kind of reference. DESeq2 can make use of these control or reference set if it exists. Can you consult with others who produced the experiment (if there are others who were involved)?

ADD REPLY • link 6.4 years ago Michael Love 43k

0

Entering edit mode

Hello,

I will try to use RUVSeq. These data have been generated by my team before I arrived, but I think that the only thing we have is the raw reads... So it will be hard to find such control. But if RUVSeq can help it will be good.

Here is the MA-plot of the worst condition (first time point vs last time point). Do you think it's that terrible ? Even in this case I see that some thousand of genes are not moving that much, but I don't know if it's enough.

MA plot DESeq2

Tanks for you help

Alex.

ADD REPLY • link 6.4 years ago Alex So • 0

0

Entering edit mode

Without any kind of reference, you will have to rely on the assumption that the y=0 defined by the normalization process is close to true. It assumes that this distribution (imagine a probability density projected onto the y-axis) is centered on log fold change of 0. If all the genes are up-regulated then you will miss this, and without a reference or an idea of a set of genes that are unchanging, there's not much way around this issue.

ADD REPLY • link 6.4 years ago Michael Love 43k

0

Entering edit mode

Tanks for this answer.

I will look for known housekeeping genes in the literature. If I understand well, It will be possible to improve the normalisation by telling to DESeq2 that "This small list of genes is known to have almost the same level of expression every time" ?

My team have charged me to look for a gene with an almost constant level of expression among all condition. I found one which was very consistent and highly expressed. They are using it for q-pcr and it seems to be OK. This gene is a widely known housekeeping gene, I hope that it's a good sign.

Can I give a list to DESeq2 as a parameter or is it more complicated than that ?

Tanks a lot !

Alex.

ADD REPLY • link 6.4 years ago Alex So • 0

0

Entering edit mode

Yes you would run estimateSizeFactors() before DESeq() and use the controlGenes argument.

ADD REPLY • link 6.4 years ago Michael Love 43k

0

Entering edit mode

Tanks a lot, I'm working on the identification of such genes. I have already found some candidates and I tried running deseq2 by including them. I'm doing this :

countData <- as.matrix(read.csv("gene_count_matrix.csv", row.names="gene_id"))
colData <- read.csv("file_list", sep=",", row.names=1)
all(rownames(colData) %in% colnames(countData))
countData <- countData[, rownames(colData)]
all(rownames(colData) == colnames(countData))
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ organ)
listConstantGenes<-c("gene1","gene2","gene3"...)
dds <- estimateSizeFactors(dds, controlGenes=rownames(countData) %in% listConstantGenes)
dds <- DESeq(dds)
res <- results(dds)
etc.

I hope that it's OK (if the choice of the gene is good of course)

Tanks a lot for your help.

Alex

ADD REPLY • link 6.4 years ago Alex So • 0

0

Entering edit mode

Yes that’s how it should look

ADD REPLY • link 6.4 years ago Michael Love 43k