Question

DESeq2 for Ribosomal profiling analysis

1

Entering edit mode

jxu006 ▴ 10

@jxu006-9977

Last seen 8.8 years ago

Hi Mike,

I am trying to use DESeq2 to do some analysis on differential translation efficiency analysis using Ribosomal profiling and RNAseq datasets. I have two group of samples (control and treatment) with 3 replicates each group. I have both RNAseq and Ribosome profiling datasets. My analysis goal is to identify a gene list which are genes with significantly changed translation efficiency. Translation efficiency=(RPKM of Riboseq)/(RPKM of RNASeq)

I have compiled the read counts file count_table which is something like this:

    GL1.Riboseq    GL2.Riboseq    GL3.Riboseq    PM1.Riboseq    PM2.Riboseq    PM3.Riboseq    GL1.RNAseq    GL2.RNAseq    GL3.RNAseq    PM1.RNAseq    PM2.RNAseq    PM3.RNAseq

LOC100506869    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0    0    0    0    0    0

LOC102724604    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0    0    0    0    0    0

MTVR2    3.00000000    1.00000000    1.00000000    1.00000000    1.00000000    1.00000000    1    1    3    0    1    1

LOC100506860    0.00000000    2.00000000    0.00000000    2.00000000    0.00000000    0.00000000    5    7    8    7    9    10

LOC102724601    0.00000000    0.00000000    0.00000000    1.00000000    0.00000000    0.00000000    0    1    1    5    1    2

ATRX    468.00000000    451.00000000    352.00000000    463.00000000    157.00000000    65.00000000    555    384    495    350    815    389

LOC441204    0.00000000    1.00000000    2.00000000    1.00000000    3.00000000    1.00000000    0    0    0    0    2    0

TCOF1    1369.00000000    1377.00000000    969.00000000    1275.00000000    611.00000000    330.00000000    5680    5988    5520    7483    7086    7913

NSRP1    307.00000000    259.00000000    251.00000000    328.00000000    124.00000000    66.00000000    651    583    574    493    837    511

SPPL3    124.00000000    103.00000000    97.00000000    117.00000000    44.00000000    27.00000000    1099    1252    1203    1349    1256    1260

I also compiled the sample table sample_table which is something like this:

           condition sampleType

control1.Riboseq        control    riboseq

control2.Riboseq        control    riboseq

control3.Riboseq        control    riboseq

treatment1.Riboseq        treatment    riboseq

treatment2.Riboseq        treatment    riboseq

treatment3.Riboseq        treatment    riboseq

control1.RNAseq          control   rnaseq

control2.RNAseq         control     rnaseq

control3.RNAseq         control     rnaseq

treatment1.RNAseq         treatment     rnaseq

treatment2.RNAseq        treatment     rnaseq

treatment3.RNAseq         treatment     rnaseq

My code:

dds <- DESeqDataSetFromMatrix(countData = count_table, colData = sample_table,design = ~sampleType+condition+sampleType:condition)

dds <- DESeq(dds)

res <- results(dds)

resOrdered <- res[order(res$padj),] 

write.delim(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T)

The result file diffTE_DESeq2.txt shows:

"baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj"

"EIF4B" 6332.26385509347 -0.668271435938997 0.0719422396978361 -9.28899960226144 1.55745244593353e-20 1.59950366197374e-16

"RPL7" 18760.8711555484 -0.664037246334803 0.0991791168656775 -6.69533332540294 2.15180579951482e-11 1.10495227805086e-07

"RPS3A" 18548.9009854286 -0.691914863488827 0.105378268654482 -6.56601092733359 5.16810250882598e-11 1.76921375885476e-07

"TPT1" 12666.0153749445 -0.66412280362569 0.104253670496368 -6.3702582409204 1.88710186387301e-10 4.84513403549395e-07

"EEF1B2" 6012.12004960793 -0.533457411938409 0.0859752788772395 -6.2047767556545 5.47746012346194e-10 1.12507030935908e-06

1). Is the analysis correctly done?

2). The log2Foldchange seems to be opposite, which means it shows the log2Foldchange of Control vs Treatment. I want to get the log2Foldchange of Treatment vs Control. How should I revise the code?

3). Also, What does the baseMean mean under this analysis?

Thanks,

deseq2 • 5.6k views

ADD COMMENT • link 8.8 years ago • updated 8.7 years ago jxu006 ▴ 10

score 0 · Answer 1 · 2016-03-23

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 days ago

United States

Yes the code is correct.

Regarding the LFC sign, you should read the section of the vignette, "Note on factor levels", which is also mentioned in the workflow (search for "relevel")

vignette("DESeq2")

The meaning of a column of the results table can be looked up by asking for the metadata on the columns:

mcols(res)

ADD COMMENT • link 8.8 years ago Michael Love 43k

0

Entering edit mode

Thanks, Mike. I read the "Note on factor levels". I am still a little bit confused. I did not specify the factor levels in my code. Deseq should compare based on the alphabetical order of the levels. That means DEseq should have compared the control and treatment and produced the log2FC of (treatment vs control). But from the results, I can tell the log2FC is control vs treatment, but not treatment vs control. Am I understanding something wrong? My apologize if this is a minor issue.

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

Yes, it is "treatment" over "control" from alphabetic order, but it is also using "rnaseq" over "ribosomal", whereas you said you want the opposite (ribo / RNA). You need both ratios to be correct or else the sign will be flipped from what you are expecting.

ADD REPLY • link 8.8 years ago Michael Love 43k

0

Entering edit mode

Thanks a lot, Mike. I got it. I changed the code and now the log2FC shows correct comparison. I used padj 0.05 as the cut off for significant genes. I noticed that the gene list number decreased from 600 to 500 using padj 0.05 cutoff compared with my previous code. Is this normal? After I corrected the comparison order, the gene number also changed. Is this reasonable?

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

Yes, it is expected that there would be less genes at a more stringent FDR threshold.

The number would not change after changing the order of the levels of the factors. You should make certain that there was nothing else in your code that you changed in between analyses.

ADD REPLY • link 8.8 years ago Michael Love 43k

0

Entering edit mode

The code is changed from:

dds <- DESeqDataSetFromMatrix(countData = count_table, colData = sample_table,design = ~sampleType+condition+sampleType:condition)

dds <- DESeq(dds)

res <- results(dds)

resOrdered <- res[order(res$padj),] 

write.delim(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T)

To

dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ sampleType+condition+sampleType:condition)

dds$condition <- factor(dds$condition, levels=c("GL","PM"))

dds$sampleType <- factor(dds$sampleType, levels=c("rnaseq","riboseq"))

dds <- DESeq(dds)

res <- results(dds)

resOrdered <- res[order(res$padj),]

write.table(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T)

I did not change anything else. But the gene number decreased from 600 to 500 using the same padj 0.05 criterion. Do you know why?

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

Are you certain that it is different? Can you show the full code and output:

table(res1$padj < 0.05) 

table(res2$padj < 0.05)

ADD REPLY • link 8.8 years ago Michael Love 43k

0

Entering edit mode

From:

library(DESeq2)

count_table <- read.table("combineCount.txt",sep="\t",header=T,row.names=1)

samplesCondition <- c("GL","GL","GL","PM","PM","PM","GL","GL","GL","PM","PM","PM")

sampleType <- c(rep("riboseq",6),rep("rnaseq",6))

sample_table <- data.frame(sampleName = colnames(count_table),condition=samplesCondition,sampleType=sampleType,row.names=1)

dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ sampleType+condition+sampleType:condition)

dds <- DESeq(dds)

res <- results(dds)

resOrdered <- res[order(res$padj),]

write.delim(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T)

To

library(DESeq2)

count_table <- read.table("combineCount.txt",sep="\t",header=T,row.names=1)

samplesCondition <- c("GL","GL","GL","PM","PM","PM","GL","GL","GL","PM","PM","PM")

sampleType <- c(rep("riboseq",6),rep("rnaseq",6))

sample_table <- data.frame(sampleName = colnames(count_table),condition=samplesCondition,sampleType=sampleType,row.names=1)

dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ sampleType+condition+sampleType:condition)

dds$condition <- factor(dds$condition, levels=c("GL","PM"))

dds$sampleType <- factor(dds$sampleType, levels=c("rnaseq","riboseq"))

dds <- DESeq(dds)

res <- results(dds)

resOrdered <- res[order(res$padj),]

write.table(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T)

table(res1$padj<0.05)

FALSE TRUE

9688 582

table(res2$padj<0.05)

FALSE TRUE

9777 494

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

Sorry to keep going back and forth, but the point of my question was to see if I can spot a difference (which maybe snuck in accidentally) between your building of res1 and res2. But you copied some code which doesn't have res1 or res2 defined.

If you could, just run this code exactly as I have it:

dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ sampleType+condition+sampleType:condition)
dds1 <- dds
dds1 <- DESeq(dds1)
res1 <- results(dds1)
dds2 <- dds
dds2$condition <- factor(dds2$condition, levels=c("GL","PM"))
dds2$sampleType <- factor(dds2$sampleType, levels=c("rnaseq","riboseq"))
dds2 <- DESeq(dds2)
res2 <- results(dds2)
table(res1$padj<0.05)
table(res2$padj<0.05)

ADD REPLY • link 8.8 years ago Michael Love 43k

0

Entering edit mode

I ran it and it shows:

> table(res1$padj<0.05)

FALSE TRUE

9688 582

> table(res2$padj<0.05)

FALSE TRUE

9777 494

Do you know what is going on?

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

No this is strange, can you send me the dds object via email so I can try to debug? save(dds, file="dds.rda")

ADD REPLY • link 8.8 years ago Michael Love 43k

score 0 · Answer 2 · 2016-03-24

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 days ago

United States

hi,

I'm adding a new answer, just to have more space to copy my code.

Using the current version of DESeq2, I cannot reproduce the discrepancy you see:

> library(DESeq2)
> packageVersion("DESeq2")
[1] ‘1.10.1’

I need this line of code, in order to work with a dds built under certain older versions of DESeq2:

dds <- updateObject(dds)

Then:

dds1 <- dds
dds1 <- DESeq(dds1)
res1 <- results(dds1)

dds2 <- dds
dds2$condition <- factor(dds2$condition, levels=c("GL","PM"))
dds2$sampleType <- factor(dds2$sampleType, levels=c("rnaseq","riboseq"))
dds2 <- DESeq(dds2)
res2 <- results(dds2)

Gives me:

> table(res1$padj<0.05, res2$padj<0.05)
       
        FALSE TRUE
  FALSE  9999    0
  TRUE      0  444

The stat, pvalue and padj from res1 and res2 fall on y=x if you plot them, and have on the order of 1e-6 numerical difference in values.

You should update to the latest version of R (now 3.2.4), by going downloading from CRAN:

https://cran.r-project.org/

Then install the latest DESeq2 with:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

Note that this will replace your current Bioconductor installation with the latest version, but you should always be performing any new analyses with the latest version of Bioconductor. For example, sometimes I see DESeq2 version 1.0 or 1.2 floating around, but these early releases were from 2013 and are very much out of date now. There has been 3 years of bug fixes and development to converge to the stable version of DESeq2 that is currently available (few algorithmic changes since 1.4, which is the version that was used in the DESeq2 publication in 2014).

ADD COMMENT • link 8.8 years ago Michael Love 43k

0

Entering edit mode

Thanks. I have updated my R to R 3.2.4. Then I did:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

But it installed the DESeq2 1.2.10 version. I tried a couple of times, it only installed the 1.2.10.

Do you know why?

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

The message is below:

> source("http://bioconductor.org/biocLite.R")

Bioconductor version 2.13 (BiocInstaller 1.12.1), ?biocLite for help

A newer version of Bioconductor is available for this version of R,

?BiocUpgrade for help

> biocLite("DESeq2")

BioC_mirror: http://bioconductor.org

Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.2.4.

Temporarily using Bioconductor version 2.13

Installing package(s) 'DESeq2'

trying URL 'http://bioconductor.org/packages/2.13/bioc/src/contrib/DESeq2_1.2.10.tar.gz'

Content type 'application/x-gzip' length 1399856 bytes (1.3 MB)

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

What's your sessionInfo() output?

ADD REPLY • link 8.8 years ago Michael Love 43k

0

Entering edit mode

Sorry, I missed your second post, you should follow the information trail printed after source()

ADD REPLY • link 8.8 years ago Michael Love 43k

0

Entering edit mode

Thanks, Fixed.

For my analysis of identification of differential translational efficiencies genes, do you think it is useful to pre-filtering the low counts genes before running DESeq2?

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

There have been some very recent discussions on this exact question on the site, if you just look up recent posts tagged DESeq2.

ADD REPLY • link 8.8 years ago Michael Love 43k

0

Entering edit mode

I notice that you posted another post:

DESeq2 testing ratio of ratios (RIP-Seq, CLIP-Seq, ribosomal profiling)

In your post, you said that you used the code：

dds <- DESeqDataSet(se, design= ~ assay + condition + assay:condition)
dds <- DESeq(dds, test="LRT", reduced= ~ assay + condition)
results(dds)

For my code,

dds <- DESeq(dds)

I did not used the parameter test="LRT", recused=~assay+condition.

Is this a problem?

Thanks,

ADD REPLY • link 8.8 years ago jxu006 ▴ 10

0

Entering edit mode

Not a problem. You can use either the Wald or likelihood ratio test here. They are different tests but will give similar results here. It is your choice as analyst which to perform. See the vignette.

ADD REPLY • link 8.8 years ago Michael Love 43k

score 0 · Answer 3 · 2016-04-03

Hi Mike,

I have another question:

For the ribosome profiling and RNAseq analysis I mentioned above, the count table is:

GL1.Riboseq    GL2.Riboseq    GL3.Riboseq    PM1.Riboseq    PM2.Riboseq    PM3.Riboseq    GL1.RNAseq    GL2.RNAseq    GL3.RNAseq    PM1.RNAseq    PM2.RNAseq    PM3.RNAseq

LOC100506869    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0    0    0    0    0    0

LOC102724604    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0    0    0    0    0    0

MTVR2    3.00000000    1.00000000    1.00000000    1.00000000    1.00000000    1.00000000    1    1    3    0    1    1

LOC100506860    0.00000000    2.00000000    0.00000000    2.00000000    0.00000000    0.00000000    5    7    8    7    9    10

LOC102724601    0.00000000    0.00000000    0.00000000    1.00000000    0.00000000    0.00000000    0    1    1    5    1    2

ATRX    468.00000000    451.00000000    352.00000000    463.00000000    157.00000000    65.00000000    555    384    495    350    815    389

LOC441204    0.00000000    1.00000000    2.00000000    1.00000000    3.00000000    1.00000000    0    0    0    0    2    0

TCOF1    1369.00000000    1377.00000000    969.00000000    1275.00000000    611.00000000    330.00000000    5680    5988    5520    7483    7086    7913

NSRP1    307.00000000    259.00000000    251.00000000    328.00000000    124.00000000    66.00000000    651    583    574    493    837    511

SPPL3    124.00000000    103.00000000    97.00000000    117.00000000    44.00000000    27.00000000    1099    1252    1203    1349    1256    1260

As you know, here are the raw counts. For the RNAseq read counts, I used the exon reads counts for each gene.

For the Riboseq reads counts, as you know, most of the ribosome profiling reads are mapped to CDS region (Although there are sometimes some reads that are mapped to the 5UTR region). Should I use the exon read counts or CDS region counts for Riboseq? If we use CDS reads count for Ribsoeq and exon read counts for RNAseq, then the counts will be different between Riboseq and RNAseq. Is this ok to do the differential translation efficiency analysis (Translation efficiency=(RPKM of Riboseq)/(RPKM of RNASeq))?