DESeq2 for Ribosomal profiling analysis
3
1
Entering edit mode
jxu006 ▴ 10
@jxu006-9977
Last seen 5.5 years ago

Hi Mike,

I am trying to use DESeq2 to do some analysis on differential translation efficiency analysis using Ribosomal profiling and RNAseq datasets. I have two group of samples (control and treatment) with 3 replicates each group. I have both RNAseq and Ribosome profiling datasets. My analysis goal is to identify a gene list which are genes with significantly changed translation efficiency. Translation efficiency=(RPKM of Riboseq)/(RPKM of RNASeq)

I have compiled the read counts file count_table which is something like this:

    GL1.Riboseq    GL2.Riboseq    GL3.Riboseq    PM1.Riboseq    PM2.Riboseq    PM3.Riboseq    GL1.RNAseq    GL2.RNAseq    GL3.RNAseq    PM1.RNAseq    PM2.RNAseq    PM3.RNAseq

LOC100506869    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0    0    0    0    0    0

LOC102724604    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0    0    0    0    0    0

MTVR2    3.00000000    1.00000000    1.00000000    1.00000000    1.00000000    1.00000000    1    1    3    0    1    1

LOC100506860    0.00000000    2.00000000    0.00000000    2.00000000    0.00000000    0.00000000    5    7    8    7    9    10

LOC102724601    0.00000000    0.00000000    0.00000000    1.00000000    0.00000000    0.00000000    0    1    1    5    1    2

ATRX    468.00000000    451.00000000    352.00000000    463.00000000    157.00000000    65.00000000    555    384    495    350    815    389

LOC441204    0.00000000    1.00000000    2.00000000    1.00000000    3.00000000    1.00000000    0    0    0    0    2    0

TCOF1    1369.00000000    1377.00000000    969.00000000    1275.00000000    611.00000000    330.00000000    5680    5988    5520    7483    7086    7913

NSRP1    307.00000000    259.00000000    251.00000000    328.00000000    124.00000000    66.00000000    651    583    574    493    837    511

SPPL3    124.00000000    103.00000000    97.00000000    117.00000000    44.00000000    27.00000000    1099    1252    1203    1349    1256    1260



I also compiled the sample table sample_table which is something like this:

           condition sampleType

control1.Riboseq        control    riboseq

control2.Riboseq        control    riboseq

control3.Riboseq        control    riboseq

treatment1.Riboseq        treatment    riboseq

treatment2.Riboseq        treatment    riboseq

treatment3.Riboseq        treatment    riboseq

control1.RNAseq          control   rnaseq

control2.RNAseq         control     rnaseq

control3.RNAseq         control     rnaseq

treatment1.RNAseq         treatment     rnaseq

treatment2.RNAseq        treatment     rnaseq

treatment3.RNAseq         treatment     rnaseq

My code:

dds <- DESeqDataSetFromMatrix(countData = count_table, colData = sample_table,design = ~sampleType+condition+sampleType:condition)

dds <- DESeq(dds)

res <- results(dds)

resOrdered <- res[order(res$padj),] write.delim(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T)  The result file diffTE_DESeq2.txt shows: "baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj" "EIF4B" 6332.26385509347 -0.668271435938997 0.0719422396978361 -9.28899960226144 1.55745244593353e-20 1.59950366197374e-16 "RPL7" 18760.8711555484 -0.664037246334803 0.0991791168656775 -6.69533332540294 2.15180579951482e-11 1.10495227805086e-07 "RPS3A" 18548.9009854286 -0.691914863488827 0.105378268654482 -6.56601092733359 5.16810250882598e-11 1.76921375885476e-07 "TPT1" 12666.0153749445 -0.66412280362569 0.104253670496368 -6.3702582409204 1.88710186387301e-10 4.84513403549395e-07 "EEF1B2" 6012.12004960793 -0.533457411938409 0.0859752788772395 -6.2047767556545 5.47746012346194e-10 1.12507030935908e-06 1). Is the analysis correctly done? 2). The log2Foldchange seems to be opposite, which means it shows the log2Foldchange of Control vs Treatment. I want to get the log2Foldchange of Treatment vs Control. How should I revise the code? 3). Also, What does the baseMean mean under this analysis? Thanks, deseq2 • 3.3k views ADD COMMENT 0 Entering edit mode @mikelove Last seen 1 day ago United States Yes the code is correct. Regarding the LFC sign, you should read the section of the vignette, "Note on factor levels", which is also mentioned in the workflow (search for "relevel") vignette("DESeq2") The meaning of a column of the results table can be looked up by asking for the metadata on the columns: mcols(res) ADD COMMENT 0 Entering edit mode Thanks, Mike. I read the "Note on factor levels". I am still a little bit confused. I did not specify the factor levels in my code. Deseq should compare based on the alphabetical order of the levels. That means DEseq should have compared the control and treatment and produced the log2FC of (treatment vs control). But from the results, I can tell the log2FC is control vs treatment, but not treatment vs control. Am I understanding something wrong? My apologize if this is a minor issue. ADD REPLY 0 Entering edit mode Yes, it is "treatment" over "control" from alphabetic order, but it is also using "rnaseq" over "ribosomal", whereas you said you want the opposite (ribo / RNA). You need both ratios to be correct or else the sign will be flipped from what you are expecting. ADD REPLY 0 Entering edit mode Thanks a lot, Mike. I got it. I changed the code and now the log2FC shows correct comparison. I used padj 0.05 as the cut off for significant genes. I noticed that the gene list number decreased from 600 to 500 using padj 0.05 cutoff compared with my previous code. Is this normal? After I corrected the comparison order, the gene number also changed. Is this reasonable? ADD REPLY 0 Entering edit mode Yes, it is expected that there would be less genes at a more stringent FDR threshold. The number would not change after changing the order of the levels of the factors. You should make certain that there was nothing else in your code that you changed in between analyses. ADD REPLY 0 Entering edit mode The code is changed from: dds <- DESeqDataSetFromMatrix(countData = count_table, colData = sample_table,design = ~sampleType+condition+sampleType:condition) dds <- DESeq(dds) res <- results(dds) resOrdered <- res[order(res$padj),]

write.delim(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T)

To

dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ sampleType+condition+sampleType:condition)

dds$condition <- factor(dds$condition, levels=c("GL","PM"))

dds$sampleType <- factor(dds$sampleType, levels=c("rnaseq","riboseq"))

dds <- DESeq(dds)

res <- results(dds)

resOrdered <- res[order(res$padj),] write.table(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T) I did not change anything else. But the gene number decreased from 600 to 500 using the same padj 0.05 criterion. Do you know why? ADD REPLY 0 Entering edit mode Are you certain that it is different? Can you show the full code and output: table(res1$padj < 0.05)

table(res2$padj < 0.05)  ADD REPLY 0 Entering edit mode From: library(DESeq2) count_table <- read.table("combineCount.txt",sep="\t",header=T,row.names=1) samplesCondition <- c("GL","GL","GL","PM","PM","PM","GL","GL","GL","PM","PM","PM") sampleType <- c(rep("riboseq",6),rep("rnaseq",6)) sample_table <- data.frame(sampleName = colnames(count_table),condition=samplesCondition,sampleType=sampleType,row.names=1) dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ sampleType+condition+sampleType:condition) dds <- DESeq(dds) res <- results(dds) resOrdered <- res[order(res$padj),]

write.delim(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T)

To

library(DESeq2)

samplesCondition <- c("GL","GL","GL","PM","PM","PM","GL","GL","GL","PM","PM","PM")

sampleType <- c(rep("riboseq",6),rep("rnaseq",6))

sample_table <- data.frame(sampleName = colnames(count_table),condition=samplesCondition,sampleType=sampleType,row.names=1)

dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ sampleType+condition+sampleType:condition)

dds$condition <- factor(dds$condition, levels=c("GL","PM"))

dds$sampleType <- factor(dds$sampleType, levels=c("rnaseq","riboseq"))

dds <- DESeq(dds)

res <- results(dds)

resOrdered <- res[order(res$padj),] write.table(as.data.frame(resOrdered),file="diffTE_DESeq2.txt",row.names = T) table(res1$padj<0.05)

FALSE  TRUE

9688   582

table(res2$padj<0.05) FALSE TRUE 9777 494 ADD REPLY 0 Entering edit mode Sorry to keep going back and forth, but the point of my question was to see if I can spot a difference (which maybe snuck in accidentally) between your building of res1 and res2. But you copied some code which doesn't have res1 or res2 defined. If you could, just run this code exactly as I have it: dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ sampleType+condition+sampleType:condition) dds1 <- dds dds1 <- DESeq(dds1) res1 <- results(dds1) dds2 <- dds dds2$condition <- factor(dds2$condition, levels=c("GL","PM")) dds2$sampleType <- factor(dds2$sampleType, levels=c("rnaseq","riboseq")) dds2 <- DESeq(dds2) res2 <- results(dds2) table(res1$padj<0.05)
table(res2$padj<0.05) ADD REPLY 0 Entering edit mode I ran it and it shows: > table(res1$padj<0.05)

FALSE  TRUE

9688   582

> table(res2$padj<0.05) FALSE TRUE 9777 494 Do you know what is going on? ADD REPLY 0 Entering edit mode No this is strange, can you send me the dds object via email so I can try to debug? save(dds, file="dds.rda") ADD REPLY 0 Entering edit mode @mikelove Last seen 1 day ago United States hi, I'm adding a new answer, just to have more space to copy my code. Using the current version of DESeq2, I cannot reproduce the discrepancy you see: > library(DESeq2) > packageVersion("DESeq2") [1] ‘1.10.1’ I need this line of code, in order to work with a dds built under certain older versions of DESeq2: dds <- updateObject(dds) Then: dds1 <- dds dds1 <- DESeq(dds1) res1 <- results(dds1) dds2 <- dds dds2$condition <- factor(dds2$condition, levels=c("GL","PM")) dds2$sampleType <- factor(dds2$sampleType, levels=c("rnaseq","riboseq")) dds2 <- DESeq(dds2) res2 <- results(dds2) Gives me: > table(res1$padj<0.05, res2\$padj<0.05)

FALSE TRUE
FALSE  9999    0
TRUE      0  444


The stat, pvalue and padj from res1 and res2 fall on y=x if you plot them, and have on the order of 1e-6 numerical difference in values.

https://cran.r-project.org/

Then install the latest DESeq2 with:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

Note that this will replace your current Bioconductor installation with the latest version, but you should always be performing any new analyses with the latest version of Bioconductor. For example, sometimes I see DESeq2 version 1.0 or 1.2 floating around, but these early releases were from 2013 and are very much out of date now. There has been 3 years of bug fixes and development to converge to the stable version of DESeq2 that is currently available (few algorithmic changes since 1.4, which is the version that was used in the DESeq2 publication in 2014).

0
Entering edit mode

Thanks. I have updated my R to R 3.2.4. Then I did:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

But it installed the DESeq2 1.2.10 version. I tried a couple of times, it only installed the 1.2.10.

Do you know why?

0
Entering edit mode

The message is below:

> source("http://bioconductor.org/biocLite.R")

Bioconductor version 2.13 (BiocInstaller 1.12.1), ?biocLite for help

A newer version of Bioconductor is available for this version of R,

> biocLite("DESeq2")

BioC_mirror: http://bioconductor.org

Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.2.4.

Temporarily using Bioconductor version 2.13

Installing package(s) 'DESeq2'

Content type 'application/x-gzip' length 1399856 bytes (1.3 MB)

0
Entering edit mode
0
Entering edit mode
Sorry, I missed your second post, you should follow the information trail printed after source()
0
Entering edit mode

Thanks, Fixed.

For my analysis of identification of differential translational efficiencies genes, do you think it is useful to pre-filtering the low counts genes before running DESeq2?

0
Entering edit mode
There have been some very recent discussions on this exact question on the site, if you just look up recent posts tagged DESeq2.
0
Entering edit mode

I notice that you posted another post:

DESeq2 testing ratio of ratios (RIP-Seq, CLIP-Seq, ribosomal profiling)

In your post, you said that you used the code：

dds <- DESeqDataSet(se, design= ~ assay + condition + assay:condition)
dds <- DESeq(dds, test="LRT", reduced= ~ assay + condition)
results(dds)

For my code,

dds <- DESeq(dds)

I did not used the parameter test="LRT", recused=~assay+condition.

Is this a problem?

Thanks,

0
Entering edit mode
Not a problem. You can use either the Wald or likelihood ratio test here. They are different tests but will give similar results here. It is your choice as analyst which to perform. See the vignette.
0
Entering edit mode
jxu006 ▴ 10
@jxu006-9977
Last seen 5.5 years ago

Hi Mike,

I have another question:

For the ribosome profiling and RNAseq analysis I mentioned above, the count table is:

GL1.Riboseq    GL2.Riboseq    GL3.Riboseq    PM1.Riboseq    PM2.Riboseq    PM3.Riboseq    GL1.RNAseq    GL2.RNAseq    GL3.RNAseq    PM1.RNAseq    PM2.RNAseq    PM3.RNAseq

LOC100506869    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0    0    0    0    0    0

LOC102724604    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0    0    0    0    0    0

MTVR2    3.00000000    1.00000000    1.00000000    1.00000000    1.00000000    1.00000000    1    1    3    0    1    1

LOC100506860    0.00000000    2.00000000    0.00000000    2.00000000    0.00000000    0.00000000    5    7    8    7    9    10

LOC102724601    0.00000000    0.00000000    0.00000000    1.00000000    0.00000000    0.00000000    0    1    1    5    1    2

ATRX    468.00000000    451.00000000    352.00000000    463.00000000    157.00000000    65.00000000    555    384    495    350    815    389

LOC441204    0.00000000    1.00000000    2.00000000    1.00000000    3.00000000    1.00000000    0    0    0    0    2    0

TCOF1    1369.00000000    1377.00000000    969.00000000    1275.00000000    611.00000000    330.00000000    5680    5988    5520    7483    7086    7913

NSRP1    307.00000000    259.00000000    251.00000000    328.00000000    124.00000000    66.00000000    651    583    574    493    837    511

SPPL3    124.00000000    103.00000000    97.00000000    117.00000000    44.00000000    27.00000000    1099    1252    1203    1349    1256    1260

As you know, here are the raw counts. For the RNAseq read counts, I used the exon reads counts for each gene.

For the Riboseq reads counts, as you know, most of the ribosome profiling reads are mapped to CDS region (Although there are sometimes some reads that are mapped to the 5UTR region). Should I use the exon read counts or CDS region counts for Riboseq? If we use CDS reads count for Ribsoeq and exon read counts for RNAseq, then the counts will be different between Riboseq and RNAseq. Is this ok to do the differential translation efficiency analysis (Translation efficiency=(RPKM of Riboseq)/(RPKM of RNASeq))?

0
Entering edit mode

I'm not sure what is the best approach. The assay term will absorb some of the differences.

It's up to you to decide the best approach. Why not read the Riboseq literature, e.g. http://biorxiv.org/content/early/2016/01/11/017111

Remember, with DESeq2 you are performing tests on counts (including size factors, model terms etc to account for technical differences across samples) but definitely not testing on RPKM.