Search
Question: DESeq2 vs Ballgown results
1
gravatar for Christy
4 months ago by
Christy10
United States
Christy10 wrote:

I am using Hisat2 and Stringtie for alignment and assembly of human samples. When I run ballgown on my data, I am getting 29 genes that are significantly differentially expressed, however when I use DESeq2 for the analysis, I get 930 genes that are significantly different (q<0.05). I also compared the top 100 genes sorted by q value for both ballgown and DESeq2, and only 17 genes are the same between the two. 

My question is, why am I getting so few genes showing as significantly different with ballgown? If it were a problem of multiple sampling, shouldn't I have more overlap between DESeq2 and ballgown when they are sorted by q-value? 

In ballgown, I am using a coverage cutoff of 10 to filter my data before running the stat test. I've tried other values and methods for filtering my data in ballgown, and this is giving me the best result so far.

 

Thanks for your help!

 

ADD COMMENTlink modified 3 months ago • written 4 months ago by Christy10

Same problem with my data. Huge difference between DEG reported by Ballgown and DESeq2. I have now shifted from Hisat2>Stringtie>Ballgown to using Hisat2>FeatureCounts>DESeq2. Is this combination good?

ADD REPLYlink written 12 days ago by ag1805x0
1
Ballgown was not really designed for *gene*-level differential expression analysis -- it was written specifically to do *isoform*-level DE. Using DESeq2 with FeatureCounts is a much better-supported operation if your main interests are in gene-level DE. Count-based models like those in DESeq2 are totally appropriate for gene-level DE (whereas Stringtie and Ballgown are tools for when the count-based models *don't* work). So Hisat2 -> FeatureCounts -> DESeq2 would be a great workflow if you don't need isoform-level results. On Thu, Jul 5, 2018 at 12:44 AM ag1805x [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User ag1805x <https: support.bioconductor.org="" u="" 15215=""/> wrote Comment: > DESeq2 vs Ballgown results > <https: support.bioconductor.org="" p="" 107011="" #110717="">: > > Same problem with my data. Huge difference between DEG reported by > Ballgown and DESeq2. I have now shifted from Hisat2>Stringtie>Ballgown to > using Hisat2>FeatureCounts>DESeq2. Is this combination good? > ------------------------------ > > Post tags: ballgown, deseq2 > > You may reply via email or visit > C: DESeq2 vs Ballgown results >
ADD REPLYlink written 11 days ago by Alyssa Frazee190
2
gravatar for Alyssa Frazee
4 months ago by
Alyssa Frazee190
San Francisco, CA, USA
Alyssa Frazee190 wrote:

I am also biased :D and actually in this case I agree with Mike! For *gene*-level analysis, count-based models are probably the way to go. Ballgown and FPKM in general are not really needed or designed for gene-level analysis; their main purpose is *isoform*-level analysis (where count-based modeling approaches are less appropriate). ballgown's estimation of gene-level expression basically tries to "back out" the gene expression based on the FPKMs (cf https://github.com/alyssafrazee/ballgown/blob/master/R/ballgown-expr-methods.R#L177). This is a pretty ad-hoc method for gene expression estimation and it's probably better to just use the counts directly. But consider ballgown for all your transcript-level DE needs!

ADD COMMENTlink written 4 months ago by Alyssa Frazee190

Hi Alyssa,

Even with transcript-level DE analysis using Ballgown, we haven't had much success- many of the transcripts identified as D.E. from ballgown, did not show a clear D.E. when visualizing the raw bam files in IGV. I've described an example in issue #2 here

RNA-seq with hisat2+stringtie+ballgown

I wonder why this is?

Thanks

Nandita

ADD REPLYlink written 7 weeks ago by mnandita0
0
gravatar for Michael Love
4 months ago by
Michael Love18k
United States
Michael Love18k wrote:

Can you plot() the rank() from the two methods and see if they agree in this plot? What are you using to quantify for DESeq2? What counts do you use as input?

ADD COMMENTlink written 4 months ago by Michael Love18k

I am using a gene count matrix generated by stringtie as input for DESeq2. This is generated with the prepDE.py script here https://github.com/gpertea/stringtie/blob/master/prepDE.py.

Can you give me a little more info on how to plot the rank? 

 

Thanks!

ADD REPLYlink written 4 months ago by Christy10

I am suggesting to plot the rank of DESeq2's p-values against the rank of Ballgown's p-values. What you would expect if the methods generally agree on rank is that the small values in the bottom left of the plot would be roughly similar (some disagreement on rank but contained within the top genes). It's ok if the methods disagree on the exact ranking of the genes in the top right of the plot, but you would want some concordance in the bottom left.

ADD REPLYlink written 4 months ago by Michael Love18k

I may be doing something very wrong, because there is very little correlation.

Here is how I made the graph:

 

des <- res_subset_df
bg <- results_genes_subset

des <- des[des$baseMean > 0,]
des <- tibble::rownames_to_column(des, var = 'id')

des <- des[rank(des$pvalue)<5000,]
bg <- bg[rank(bg$pval)<5000,]

des <- des[match(bg$id, des$id, nomatch = 0),]
bg <- bg[match(des$id, bg$id,  nomatch = 0),]

des <- des[order(des$id),]
bg <- bg[order(bg$id),]

plot(rank(des$pvalue), rank(bg$pval), pch = '.',  xlab = "deseq2 rank p",
     ylab = "ballgown rank p",
     main = expression("deseq vs ballgown"))

 

 

To generate the DESeq results I'm doing this:

 

dds <- DESeqDataSetFromMatrix(countData = countData,
            colData = colData, design = ~diagnosis)

#subset to remove unaffected symptomatic

ddssubset <- dds[, colData(dds)$diagnosis == "affected" | colData(dds)$symptoms == "control"]

ddssubset <- DESeq(ddssubset)

res_subset <- results(ddssubset, contrast = c("diagnosis", "affected", "control"))
res_subset <- res_subset[order(res_subset$padj),]
res_subset_df <- data.frame(res_subset)

 

And for ballgown results, this:

bg = ballgown(dataDir = "ballgown", samplePattern = "P", pData=datainfo, 
              meas = "all")

bg_filt <- exprfilter(bg, cutoff = 10, meas = "cov")

#subset to remove unaffected symptomatic

statssubset = subset(bg_filt, "diagnosis != 'symptomatic'", 
                     genomesubset = FALSE)

results_genes_subset = stattest(statssubset, feature="gene",
                         covariate = "diagnosis", getFC = TRUE, 
                         meas = "FPKM")

 

ADD REPLYlink written 4 months ago by Christy10

I see a little concordance but it’s hard to say. DESeq2 models the counts while Ballgown here is modeling the fpkm, so the inputs are a bit different.

ADD REPLYlink written 4 months ago by Michael Love18k

I believe this might be what you are asking for... I've plotted the p-value vs the rank of the p value for my deseq2 and ballgown results. 

 

ADD REPLYlink written 4 months ago by Christy10

Another thing you can do is examine the plotCounts for some of the genes for DESeq2 and see that these seem plausible to you. If they look real but you want to require stronger DE signal you can increase lfcThreshold.

ADD REPLYlink written 4 months ago by Michael Love18k

Ok thanks I’ll try that. But so far, the genes we are getting back with DESeq2 make sense. The main issue is trying to understand why I’m not getting close to the same results with ballgown if what I’m seeing is real. 

ADD REPLYlink written 4 months ago by Christy10

Well I’m biased, but there is an statistical advantage to working on the counts with offsets, instead of the normalized expression (FPKM).

ADD REPLYlink written 4 months ago by Michael Love18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 355 users visited in the last hour