Low number of replicates DESeq

0

Entering edit mode

Federico Gaiti ▴ 130

@federico-gaiti-6419

Last seen 9.6 years ago

Hi all, I am using DESEq for a DGE analysis. I have STRANDED RNA-Seq data for 4 developmental stages with no replicates but I know that to have a more reliable DGE I should have replicates. So I got (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage. So my data situation at the moment is: STAGE 1 stranded STAGE 1.1 unstranded STAGE 1.2 unstranded STAGE 1.3 unstranded STAGE 2 stranded STAGE 2.1 unstranded STAGE 2.2 unstranded STAGE 2.3 unstranded STAGE 3 stranded STAGE 3.1 unstranded STAGE 3.2 unstranded STAGE 3.3 unstranded STAGE 4 stranded STAGE 4.1 unstranded STAGE 4.2 unstranded STAGE 4.3 unstranded Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples cluster together. If so, I thought to use the unstranded data for my DGE analysis to reach the final number of 4 replicates per stage. I mapped the raw reads to the genome using TOPHAT (v2.0.9) (fr- unstranded for unstranded data and fr-secondstrand for stranded data), used htseq-count (HTSeq 0.5.4p5) to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no. I then used DESeq (v1.14.0) to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low. I tried to use the option -s reverse for the stranded data and still got really low correlation. So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data, around 4-5M counts each stage (while both cases before the stranded ones were double in number). I included the metadata > Design condition ADULT ADULT ADULT1 ADULT ADULT2 ADULT ADULT3 ADULT JUV JUV JUV1 JUV JUV2 JUV JUV3 JUV COMP COMP COMP1 COMP COMP2 COMP COMP3 COMP PRECOMP PRECOMP PRECOMP1 PRECOMP PRECOMP2 PRECOMP PRECOMP3 PRECOMP and estimated the new size factors, normalized and calculated the new correlation. Pearson performed pretty well, confirmed by both a PCA and correlogram. So my initial idea was to do a DGE "treating" the stranded data as unstranded, having 4 replicates per stage. Though, I'd still like to figure out a way to use the stranded counts since I am not sure if I am losing some information running htseq-count using -s no on the stranded data. What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not. I would like to hear from you if you have any thoughts about this. Let me know if you need any further details to better understand the issue. Thanks in advance, Federico Federico Gaiti Ph.D. Candidate School of Biological Sciences University of Queensland St Lucia QLD 4072 Australia f.gaiti@uq.edu.au [[alternative HTML version deleted]]

Normalization DESeq Normalization DESeq • 2.4k views

ADD COMMENT • link updated 10.2 years ago by Steve Lianoglou ★ 13k • written 10.2 years ago by Federico Gaiti ▴ 130

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, Since you are just starting your analysis and are in the world of DESeq, you should probably switch to DESeq2. You mention things about "low correlation" but it's not clear what conditions you are comparing where. Instead of describing your analysis at a high level, showing the code that you used would be more helpful. That having been said, the first thing I would do is to perform the steps outlined in the DESeq2 vignette under the section "Data quality assessment by sample clustering and visualization" to see if your replicate data cluster closely together in meaningful ways using the heatmaps and PCA plots outlined there. HTH, -steve On Mon, Feb 24, 2014 at 3:31 PM, Federico Gaiti <f.gaiti at="" uq.edu.au=""> wrote: > Hi all, > > I am using DESEq for a DGE analysis. > > I have STRANDED RNA-Seq data for 4 developmental stages with no replicates but I know that to have a more reliable DGE I should have replicates. So I got (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage. > > So my data situation at the moment is: > > STAGE 1 stranded > STAGE 1.1 unstranded > STAGE 1.2 unstranded > STAGE 1.3 unstranded > STAGE 2 stranded > STAGE 2.1 unstranded > STAGE 2.2 unstranded > STAGE 2.3 unstranded > STAGE 3 stranded > STAGE 3.1 unstranded > STAGE 3.2 unstranded > STAGE 3.3 unstranded > STAGE 4 stranded > STAGE 4.1 unstranded > STAGE 4.2 unstranded > STAGE 4.3 unstranded > > Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples "cluster" together. If so, I thought to use the unstranded data for my DGE analysis to reach the final number of 4 replicates per stage. > > I mapped the raw reads to the genome using TOPHAT (v2.0.9) (fr- unstranded for unstranded data and fr-secondstrand for stranded data), used htseq-count (HTSeq 0.5.4p5) to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no. I then used DESeq (v1.14.0) to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low. > > I tried to use the option -s reverse for the stranded data and still got really low correlation. So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data, around 4-5M counts each stage (while both cases before the stranded ones were double in number). > > I included the metadata > > >> Design > condition > ADULT ADULT > ADULT1 ADULT > ADULT2 ADULT > ADULT3 ADULT > JUV JUV > JUV1 JUV > JUV2 JUV > JUV3 JUV > COMP COMP > COMP1 COMP > COMP2 COMP > COMP3 COMP > PRECOMP PRECOMP > PRECOMP1 PRECOMP > PRECOMP2 PRECOMP > PRECOMP3 PRECOMP > > and estimated the new size factors, normalized and calculated the new correlation. Pearson performed pretty well, confirmed by both a PCA and correlogram. So my initial idea was to do a DGE "treating" the stranded data as unstranded, having 4 replicates per stage. Though, I'd still like to figure out a way to use the stranded counts since I am not sure if I am losing some information running htseq-count using -s no on the stranded data. > > > What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not. > > > I would like to hear from you if you have any thoughts about this. > > > Let me know if you need any further details to better understand the issue. > > > Thanks in advance, > > Federico > > Federico Gaiti > Ph.D. Candidate > School of Biological Sciences > University of Queensland > St Lucia QLD 4072 > Australia > f.gaiti at uq.edu.au > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 10.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks Steve -- I'll have a look at the DESeq2 vignette. Here is what I've done, let me know if it's still unclear samtools view TOPHAT_STRANDED_sorted.bam | python -m HTSeq.scripts.count -s no - gtfFile > stranded_counts_noS.txt head(Counts) 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 XXXX 9 0 24 48 30 5 1 1 21 15 8 6 28 28 27 47 XXXX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XXXX 16 0 0 0 19 4 0 0 40 1 2 3 78 5 5 7 XXXX 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 ...... and then in R: data<-read.table("Counts",header=TRUE,row.names=1) head(data) Design=data.frame( row.names=colnames(data), condition=c('1','1','1','1','2','2','2','2','3','3','3','3','4','4' ,'4','4')) cds=newCountDataSet(data,Design) cds=estimateSizeFactors(cds) sizeFactors(cds) All_samples_normalized<-counts(cds,normalized=TRUE) table(rowSums(All_samples_normalized)==0) data<-All_samples_normalized[rowSums(All_samples_normalized)>0,] data_subset_matrix<-as.matrix(data) Spearman_normalized<-rcorr(data_subset_matrix,type="spearman") A<-Spearman_normalized$r write.table(A,file="Spearman_normalized_allsamples.txt") Pearson_normalized<-rcorr(data_subset_matrix,type="pearson") B<-Pearson_normalized$r write.table(B,file="Pearson_normalized_allsamples.txt") A_matrix<-as.matrix(A) B_matrix<-as.matrix(B) corrgram(B_matrix,order="PCA") corrgram(A_matrix,order="PCA") pca3=PCA(A_matrix,graph=TRUE) pca3=PCA(B_matrix,graph=TRUE) This gave me a nice correlation but I'd still like to to use the stranded counts for the DGE. The issue is that if I only use stranded data I don't have replicates. If I include the 3 unstranded replicates I need to use the option -s no for the stranded data because otherwise stranded and unstranded do not correlate. So my ideas was to use the unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Would it be possible? Or would it be better to stick to the -s no option for the DGE? Thanks Federico ________________________________________ From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on behalf of Steve Lianoglou [lianoglou.steve@gene.com] Sent: Tuesday, 25 February 2014 10:34 AM To: Federico Gaiti Cc: bioconductor at r-project.org Subject: Re: [BioC] Low number of replicates DESeq Hi, Since you are just starting your analysis and are in the world of DESeq, you should probably switch to DESeq2. You mention things about "low correlation" but it's not clear what conditions you are comparing where. Instead of describing your analysis at a high level, showing the code that you used would be more helpful. That having been said, the first thing I would do is to perform the steps outlined in the DESeq2 vignette under the section "Data quality assessment by sample clustering and visualization" to see if your replicate data cluster closely together in meaningful ways using the heatmaps and PCA plots outlined there. HTH, -steve On Mon, Feb 24, 2014 at 3:31 PM, Federico Gaiti <f.gaiti at="" uq.edu.au=""> wrote: > Hi all, > > I am using DESEq for a DGE analysis. > > I have STRANDED RNA-Seq data for 4 developmental stages with no replicates but I know that to have a more reliable DGE I should have replicates. So I got (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage. > > So my data situation at the moment is: > > STAGE 1 stranded > STAGE 1.1 unstranded > STAGE 1.2 unstranded > STAGE 1.3 unstranded > STAGE 2 stranded > STAGE 2.1 unstranded > STAGE 2.2 unstranded > STAGE 2.3 unstranded > STAGE 3 stranded > STAGE 3.1 unstranded > STAGE 3.2 unstranded > STAGE 3.3 unstranded > STAGE 4 stranded > STAGE 4.1 unstranded > STAGE 4.2 unstranded > STAGE 4.3 unstranded > > Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples "cluster" together. If so, I thought to use the unstranded data for my DGE analysis to reach the final number of 4 replicates per stage. > > I mapped the raw reads to the genome using TOPHAT (v2.0.9) (fr- unstranded for unstranded data and fr-secondstrand for stranded data), used htseq-count (HTSeq 0.5.4p5) to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no. I then used DESeq (v1.14.0) to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low. > > I tried to use the option -s reverse for the stranded data and still got really low correlation. So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data, around 4-5M counts each stage (while both cases before the stranded ones were double in number). > > I included the metadata > > >> Design > condition > ADULT ADULT > ADULT1 ADULT > ADULT2 ADULT > ADULT3 ADULT > JUV JUV > JUV1 JUV > JUV2 JUV > JUV3 JUV > COMP COMP > COMP1 COMP > COMP2 COMP > COMP3 COMP > PRECOMP PRECOMP > PRECOMP1 PRECOMP > PRECOMP2 PRECOMP > PRECOMP3 PRECOMP > > and estimated the new size factors, normalized and calculated the new correlation. Pearson performed pretty well, confirmed by both a PCA and correlogram. So my initial idea was to do a DGE "treating" the stranded data as unstranded, having 4 replicates per stage. Though, I'd still like to figure out a way to use the stranded counts since I am not sure if I am losing some information running htseq-count using -s no on the stranded data. > > > What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not. > > > I would like to hear from you if you have any thoughts about this. > > > Let me know if you need any further details to better understand the issue. > > > Thanks in advance, > > Federico > > Federico Gaiti > Ph.D. Candidate > School of Biological Sciences > University of Queensland > St Lucia QLD 4072 > Australia > f.gaiti at uq.edu.au > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.2 years ago Federico Gaiti ▴ 130

0

Entering edit mode

Hi, On Mon, Feb 24, 2014 at 5:18 PM, Federico Gaiti <f.gaiti at="" uq.edu.au=""> wrote: > Thanks Steve -- I'll have a look at the DESeq2 vignette. > > Here is what I've done, let me know if it's still unclear > > samtools view TOPHAT_STRANDED_sorted.bam | python -m HTSeq.scripts.count -s no - gtfFile > stranded_counts_noS.txt > > head(Counts) > > 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 > > XXXX 9 0 24 48 30 5 1 1 21 15 8 6 28 28 27 47 > > XXXX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > XXXX 16 0 0 0 19 4 0 0 40 1 2 3 78 5 5 7 > > XXXX 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 > ...... > > and then in R: > > data<-read.table("Counts",header=TRUE,row.names=1) > head(data) > Design=data.frame( > row.names=colnames(data), > condition=c('1','1','1','1','2','2','2','2','3','3','3','3','4',' 4','4','4')) > cds=newCountDataSet(data,Design) > cds=estimateSizeFactors(cds) > sizeFactors(cds) > All_samples_normalized<-counts(cds,normalized=TRUE) > table(rowSums(All_samples_normalized)==0) > data<-All_samples_normalized[rowSums(All_samples_normalized)>0,] > data_subset_matrix<-as.matrix(data) > Spearman_normalized<-rcorr(data_subset_matrix,type="spearman") > A<-Spearman_normalized$r > write.table(A,file="Spearman_normalized_allsamples.txt") > Pearson_normalized<-rcorr(data_subset_matrix,type="pearson") > B<-Pearson_normalized$r > write.table(B,file="Pearson_normalized_allsamples.txt") > A_matrix<-as.matrix(A) > B_matrix<-as.matrix(B) > corrgram(B_matrix,order="PCA") > corrgram(A_matrix,order="PCA") > pca3=PCA(A_matrix,graph=TRUE) > pca3=PCA(B_matrix,graph=TRUE) > > This gave me a nice correlation but I'd still like to to use the stranded counts for the DGE. Thanks for pasting the code now -- without the images or results from your output, it's hard to see what is reasonably correlated or not, but let's continue ... I see you're already doing PCA, and it would be interesting to see how your results compare to what you get by following along with the DESeq2 vignette and shooting your data through one of the variance stabilizing transforms (incidentally, the DESeq vignette also has a similar section, and it would have been worth while to browse that vignette and follow the advice outlined there as well) > The issue is that if I only use stranded data I don't have replicates. If I include the 3 unstranded replicates I need to use the option -s no for the stranded data because otherwise stranded and unstranded do not correlate. Are you saying that if you use the stranded data but do the counting as if it were unstranded, it looks "just fine," but counting the stranded data as stranded data makes it look quite bad? Interesting ....I've actually never used stranded RNAseq data before, but that's a bit surprising ... What if you only consider a subset of genes that are in regions that do not have annotated genes on the opposite strands -- the data for these genes from the stranded and unstranded should be quite similar, no? How do these look? > So my ideas was to use the unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. But why? If you're just doing "vanilla DGE" and you have data that you feel like you can use (the unstranded data) in triplicate, and a source of singlicate data (the stranded one), why do you want to torture some DGE analysis by cramming the stranded data through it assuming <something> from the unstranded. If you really want to use the stranded data, I'd encode the "library type" as a covariate in the linear model while doing differential expression. Both the DESeq and DESeq2 vignettes have a section on multi-factorial designs where they incorporate data from single and paired-end runs to perform a DGE analysis -- the way you would combine unstranded and stranded counts for a gene would be rather similar. > Would it be possible? Or would it be better to stick to the -s no option for the DGE? To be honest, if I had a set of data that had three replicates per condition, and one set of data that had "singlicate" samples in a condition which came from an apparently wildly different protocol, I'd probably just use the triplicate data and forget about the outlier data if I'm just doing "vanilla DGE". If the stranded data is giving you wildly different results, the simplest explanation might be that something may have gotten screwed up in the library prep -- but you have no way to tell since you have no replicates of those data. As I said, you can include it in your linear model by using a multi-factorial design if you *really* want to use it, but why bother for a vanilla DGE analysis? On the other hand, if you're using the stranded data to do something more than just vanilla DGE -- perhaps you're interested in quantifying anti-sense transcription, then I'd try to do a lot more exploratory analysis to see how I could explain large difference in quantitation that I'm seeing. I'd still be really pissed that I had no replicates, but it wouldn't be the first time a graduate student is stuck with an underpowered dataset that they are tasked to torture, and it sadly won't be the last ... still, in this situation, I'd go the multi-factorial design approach to combine both datasets for any analysis I'm asked to do. In the meantime, I'd also complain to my advisor/collaborator whatever to twist their arm enough to run replicates for the future. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, Please see my comments below Thanks ________________________________________ >Thanks for pasting the code now -- without the images or results from >your output, it's hard to see what is reasonably correlated or not, >but let's continue ... >I see you're already doing PCA, and it would be interesting to see how >your results compare to what you get by following along with the >DESeq2 vignette and shooting your data through one of the variance >stabilizing transforms (incidentally, the DESeq vignette also has a >similar section, and it would have been worth while to browse that >vignette and follow the advice outlined there as well) This is exactly what I've done. I started doing a simple correlation anylsis and then shot my data first through DESeq and then through DESeq2. You can find atttached some PCA plots and heatmaps when I used the option -s no for the stranded data In DESeq2: CountTable=read.table("DGE.txt",header=TRUE,row.names=1) samples<-data.frame(row.names=c("ADULT","ADULT1","ADULT2","ADULT3","JU V","JUV1","JUV2","JUV3","COMP","COMP1","COMP2","COMP3","PRECOMP","PREC OMP1","PRECOMP2","PRECOMP3"),condition=as.factor(c(rep("ADULT",4),rep( "JUV",4),rep("COMP",4),rep("PRECOMP",4)))) dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=samples,desig n=~condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) dds<-DESeq(dds) rld <- rlogTransformation(dds, blind=TRUE) rlogMat <- assay(rld) vstMat <- assay(vsd) vsd <- varianceStabilizingTransformation(dds, blind=TRUE) px <- counts(dds)[,1] / sizeFactors(dds)[1] ord <- order(px) ord <- ord[px[ord] < 150] ord <- ord[seq(1, length(ord), length=50)] last <- ord[length(ord)] vstcol <- c("blue", "black") matplot(px[ord],cbind(assay(vsd)[, 1], log2(px))[ord, ],type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)") legend("bottomright",legend = c(expression("variance stabilizing transformation"),expression(log[2](n/s[1]))),fill=vstcol) library("vsn") par(mfrow=c(1,3)) notAllZero <- (rowSums(counts(dds))>0) meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1),ylim = c(0,2.5)) meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5)) meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5)) library("RColorBrewer") library("gplots") select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100] hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10,6)) heatmap.2(assay(rld)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6)) heatmap.2(assay(vsd)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6)) distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition)) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) print(plotPCA(rld, intgroup=c("condition"))) In DESeq: data<-read.table("Counts",header=TRUE,row.names=1) Design=data.frame(row.names=colnames(data),condition=c('1','1','1','1' ,'2','2','2','2','3','3','3','3','4','4','4','4')) cds=newCountDataSet(data,Design) cds=estimateSizeFactors(cds) cdsBlind=estimateDispersions(cds,method='blind') vsd=varianceStabilizingTransformation(cdsBlind) library("RColorBrewer") library("gplots") select = order(rowMeans(counts(cdsBlind)), decreasing=TRUE)[1:30] hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100) heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6)) heatmap.2(counts(cdsBlind)[select,], col = hmcol, trace="none", margin=c(10,6)) dists = dist( t( exprs(vsd) ) ) mat = as.matrix( dists ) rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition)) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) print(plotPCA(vsd, intgroup=c("condition"))) >Are you saying that if you use the stranded data but do the counting >as if it were unstranded, it looks "just fine," but counting the >stranded data as stranded data makes it look quite bad? >Interesting ....I've actually never used stranded RNAseq data before, >but that's a bit surprising ... Exactly. it looks way better than counting the stranded data as stranded data. >What if you only consider a subset of genes that are in regions that >do not have annotated genes on the opposite strands -- the data for >these genes from the stranded and unstranded should be quite similar, >no? How do these look? I'll have a look, good idea! >But why? If you're just doing "vanilla DGE" and you have data that you >feel like you can use (the unstranded data) in triplicate, and a >source of singlicate data (the stranded one), why do you want to >torture some DGE analysis by cramming the stranded data through it >assuming <something> from the unstranded. >If you really want to use the stranded data, I'd encode the "library >type" as a covariate in the linear model while doing differential >expression. Both the DESeq and DESeq2 vignettes have a section on >multi-factorial designs where they incorporate data from single and >paired-end runs to perform a DGE analysis -- the way you would combine >unstranded and stranded counts for a gene would be rather similar. I followed the multifactorial design section in both DESeq and DESeq2 vignette and then checked the quality by sample clustering and visualization. In DESeq: CountTable=read.table("DGE.txt",header=TRUE,row.names=1) head(CountTable) Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT"," ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP","C OMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded","un stranded","unstranded","unstranded","stranded","unstranded","unstrande d","unstranded","stranded","unstranded","unstranded","unstranded","str anded","unstranded","unstranded","unstranded")) cdsFull=newCountDataSet(CountTable,Design) head(cdsFull) cdsFull=estimateSizeFactors(cdsFull) sizeFactors(cdsFull) cdsFull=estimateDispersions(cdsFull) plotDispEsts(cdsFull) fit1=fitNbinomGLMs(cdsFull,count ~ libType + condition) str(fit1) fit0=fitNbinomGLMs(cdsFull,count ~ libType ) pvalsGLM = nbinomGLMTest( fit1, fit0 ) padjGLM = p.adjust( pvalsGLM, method="BH" ) head(fit1) cdsFullB=newCountDataSet(CountTable,Design$condition) cdsFullB=estimateSizeFactors(cdsFullB) cdsFullB=estimateDispersions(cdsFullB) rs = rowSums ( counts ( cdsFull )) theta = 0.4 use = (rs > quantile(rs, probs=theta)) table(use) cdsFilt = cdsFull[ use, ] fitFilt1 = fitNbinomGLMs( cdsFilt, count ~ libType + condition ) fitFilt0 = fitNbinomGLMs( cdsFilt, count ~ libType ) pvalsFilt = nbinomGLMTest( fitFilt1, fitFilt0 ) padjFilt = p.adjust(pvalsFilt, method="BH" ) padjFiltForComparison = rep(+Inf, length(padjGLM)) padjFiltForComparison[use] = padjFilt tab3 = table( `no filtering` = padjGLM < .05,`with filtering` = padjFiltForComparison < .05 ) addmargins(tab3) res=cbind(fit1,pvalsGLM=pvalsGLM,padjGLM=padjGLM) resSig=res[which(res$padjGLM<0.05),] table(res$padjGLM<0.05) write.csv(res,file='res_Multifactor_DESeq.csv') hist(res$pvalsGLM,breaks=100) resFilt=cbind(fitFilt1,pvalsFilt=pvalsFilt,padjFilt=padjFilt) resSigFilt=resFilt[which(resFilt$padjFilt<0.05),] table(resFilt$padjFilt<0.05) write.csv(resFilt,file='resFilt_Multifactor_DESeq.csv') plot(rank(rs)/length(rs), -log10(pvalsGLM), pch=16, cex=0.45) h1 = hist(pvalsGLM[!use], breaks=50, plot=FALSE) h2 = hist(pvalsGLM[use], breaks=50, plot=FALSE) colori = c(`do not pass`="khaki", `pass`="powderblue") barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,space = 0, main = "", ylab="frequency") text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA) legend("topright", fill=rev(colori), legend=rev(names(colori))) cdsFullBlind = estimateDispersions( cdsFull, method = "blind" ) vsdFull = varianceStabilizingTransformation( cdsFullBlind ) select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:100] hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100) heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6)) heatmap.2(counts(cdsFull)[select,], col = hmcol, trace="none", margin=c(10,6)) dists = dist( t( exprs(vsdFull) ) ) mat = as.matrix( dists ) rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : ")) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) print(plotPCA(vsdFull, intgroup=c("condition", "libType"))) In DESeq2: CountTable=read.table("DGE.txt",header=TRUE,row.names=1) head(CountTable) Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT"," ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP","C OMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded","un stranded","unstranded","unstranded","stranded","unstranded","unstrande d","unstranded","stranded","unstranded","unstranded","unstranded","str anded","unstranded","unstranded","unstranded")) dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~condition + libType) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) rld <- rlogTransformation(dds, blind=TRUE) vsd <- varianceStabilizingTransformation(dds, blind=TRUE) vstMat <- assay(vsd) rlogMat <- assay(rld) px <- counts(dds)[,1] / sizeFactors(dds)[1] ord <- order(px) ord <- ord[px[ord] < 150] ord <- ord[seq(1, length(ord), length=50)] last <- ord[length(ord)] vstcol <- c("blue", "black") matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)") legend("bottomright", legend = c(expression("variance stabilizing transformation"), expression(log[2](n/s[1]))),fill=vstcol) library("vsn") par(mfrow=c(1,3)) notAllZero <- (rowSums(counts(dds))>0) meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5)) meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5)) meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5)) select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100] hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6)) heatmap.2(assay(rld)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6)) heatmap.2(assay(vsd)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6)) distsRL <- dist(t(assay(rld))) select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30] hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6)) distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition)) rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition, libType, sep=" : ")) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) print(plotPCA(rld, intgroup=c("condition", "libType"))) See attached the plots I obtained (*_MULTIFACTORIAL.jpeg) Let me know what you think >To be honest, if I had a set of data that had three replicates per >condition, and one set of data that had "singlicate" samples in a >condition which came from an apparently wildly different protocol, I'd >probably just use the triplicate data and forget about the outlier >data if I'm just doing "vanilla DGE". >If the stranded data is giving you wildly different results, the >simplest explanation might be that something may have gotten screwed >up in the library prep -- but you have no way to tell since you have >no replicates of those data. >As I said, you can include it in your linear model by using a >multi-factorial design if you *really* want to use it, but why bother >for a vanilla DGE analysis? >On the other hand, if you're using the stranded data to do something >more than just vanilla DGE -- perhaps you're interested in quantifying >anti-sense transcription, then I'd try to do a lot more exploratory >analysis to see how I could explain large difference in quantitation >that I'm seeing. I'd still be really pissed that I had no replicates, >but it wouldn't be the first time a graduate student is stuck with an >underpowered dataset that they are tasked to torture, and it sadly >won't be the last ... still, in this situation, I'd go the >multi-factorial design approach to combine both datasets for any >analysis I'm asked to do. In the meantime, I'd also complain to my >advisor/collaborator whatever to twist their arm enough to run >replicates for the future. You got the situation quite right. I agree with you, I wouldn't bother using the stranded data if I was doing a vanilla DGE but I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification. This is reason that brought me to include the untranded data in the analysis, to increase the number of replicates and test for DE to obtain a set of lncRNAs I can handle for downstream analysis. Let me know if you have any further idea Thanks a lot for help Federico

ADD REPLY • link 10.2 years ago Federico Gaiti ▴ 130

0

Entering edit mode

Hi, In the future, you don't have to include all of the code you've ever written :-) For instance, you have code to write tables, do DGE, do a mean vs. sd plot, etc. but it's not really germane to this conversation. Also, your attachments didn't make it through to the list, but since I was cc'd on the email, I got them in my inbox. You are providing diagnostic plots (PCA and heatmap) from a mult-factorial design, but you should note that the fact that you encode the library type in the design will not change the output of these plots. That is to say that the "libType" hasn't been corrected for in the variance stabilized data. It is only accounted for in your results form the DGE analysis. Anyway: the PCA plot from the vst transformed data (your DESeq1 pca plot) seems to suggest that your conditions are lining up pretty reasonably along PC2 (vertically), while PC1 simply separates your samples by libType -- would you agree with that assessment? (Forget the pca from the rlog transform -- it seems the VST is doing better here). If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you. How can you tell? One way is to do a DGE analysis with just the unstranded data, then do it again with the unstranded data AND stranded data using a design matrix with libType + condition, and just test for condition. Are the results between the two comparable when you look at the same contrasts? HTH, -steve On Tue, Feb 25, 2014 at 12:42 AM, Federico Gaiti <f.gaiti at="" uq.edu.au=""> wrote: > Hi Steve, > > Please see my comments below > > Thanks > ________________________________________ > >>Thanks for pasting the code now -- without the images or results from >>your output, it's hard to see what is reasonably correlated or not, >>but let's continue ... > >>I see you're already doing PCA, and it would be interesting to see how >>your results compare to what you get by following along with the >>DESeq2 vignette and shooting your data through one of the variance >>stabilizing transforms (incidentally, the DESeq vignette also has a >>similar section, and it would have been worth while to browse that >>vignette and follow the advice outlined there as well) > > This is exactly what I've done. I started doing a simple correlation anylsis and then shot my data first through DESeq and then through DESeq2. > You can find atttached some PCA plots and heatmaps when I used the option -s no for the stranded data > > In DESeq2: > > CountTable=read.table("DGE.txt",header=TRUE,row.names=1) > samples<-data.frame(row.names=c("ADULT","ADULT1","ADULT2","ADULT3"," JUV","JUV1","JUV2","JUV3","COMP","COMP1","COMP2","COMP3","PRECOMP","PR ECOMP1","PRECOMP2","PRECOMP3"),condition=as.factor(c(rep("ADULT",4),re p("JUV",4),rep("COMP",4),rep("PRECOMP",4)))) > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=samples,des ign=~condition) > colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PREC OMP","COMP","JUV","ADULT")) > dds<-DESeq(dds) > rld <- rlogTransformation(dds, blind=TRUE) > rlogMat <- assay(rld) > vstMat <- assay(vsd) > vsd <- varianceStabilizingTransformation(dds, blind=TRUE) > px <- counts(dds)[,1] / sizeFactors(dds)[1] > ord <- order(px) > ord <- ord[px[ord] < 150] > ord <- ord[seq(1, length(ord), length=50)] > last <- ord[length(ord)] > vstcol <- c("blue", "black") > matplot(px[ord],cbind(assay(vsd)[, 1], log2(px))[ord, ],type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)") > legend("bottomright",legend = c(expression("variance stabilizing transformation"),expression(log[2](n/s[1]))),fill=vstcol) > library("vsn") > par(mfrow=c(1,3)) > notAllZero <- (rowSums(counts(dds))>0) > meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1),ylim = c(0,2.5)) > meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5)) > meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5)) > library("RColorBrewer") > library("gplots") > select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100] > hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10,6)) > heatmap.2(assay(rld)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6)) > heatmap.2(assay(vsd)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6)) > distsRL <- dist(t(assay(rld))) > mat <- as.matrix(distsRL) > rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition)) > heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) > print(plotPCA(rld, intgroup=c("condition"))) > > In DESeq: > > data<-read.table("Counts",header=TRUE,row.names=1) > Design=data.frame(row.names=colnames(data),condition=c('1','1','1',' 1','2','2','2','2','3','3','3','3','4','4','4','4')) > cds=newCountDataSet(data,Design) > cds=estimateSizeFactors(cds) > cdsBlind=estimateDispersions(cds,method='blind') > vsd=varianceStabilizingTransformation(cdsBlind) > library("RColorBrewer") > library("gplots") > select = order(rowMeans(counts(cdsBlind)), decreasing=TRUE)[1:30] > hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6)) > heatmap.2(counts(cdsBlind)[select,], col = hmcol, trace="none", margin=c(10,6)) > dists = dist( t( exprs(vsd) ) ) > mat = as.matrix( dists ) > rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition)) > heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) > print(plotPCA(vsd, intgroup=c("condition"))) > >>Are you saying that if you use the stranded data but do the counting >>as if it were unstranded, it looks "just fine," but counting the >>stranded data as stranded data makes it look quite bad? > >>Interesting ....I've actually never used stranded RNAseq data before, >>but that's a bit surprising ... > > Exactly. it looks way better than counting the stranded data as stranded data. > >>What if you only consider a subset of genes that are in regions that >>do not have annotated genes on the opposite strands -- the data for >>these genes from the stranded and unstranded should be quite similar, >>no? How do these look? > > I'll have a look, good idea! > >>But why? If you're just doing "vanilla DGE" and you have data that you >>feel like you can use (the unstranded data) in triplicate, and a >>source of singlicate data (the stranded one), why do you want to >>torture some DGE analysis by cramming the stranded data through it >>assuming <something> from the unstranded. > >>If you really want to use the stranded data, I'd encode the "library >>type" as a covariate in the linear model while doing differential >>expression. Both the DESeq and DESeq2 vignettes have a section on >>multi-factorial designs where they incorporate data from single and >>paired-end runs to perform a DGE analysis -- the way you would combine >>unstranded and stranded counts for a gene would be rather similar. > > I followed the multifactorial design section in both DESeq and DESeq2 vignette and then checked the quality by sample clustering and visualization. > > In DESeq: > > CountTable=read.table("DGE.txt",header=TRUE,row.names=1) > head(CountTable) > Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT" ,"ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP", "COMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded"," unstranded","unstranded","unstranded","stranded","unstranded","unstran ded","unstranded","stranded","unstranded","unstranded","unstranded","s tranded","unstranded","unstranded","unstranded")) > cdsFull=newCountDataSet(CountTable,Design) > head(cdsFull) > cdsFull=estimateSizeFactors(cdsFull) > sizeFactors(cdsFull) > cdsFull=estimateDispersions(cdsFull) > plotDispEsts(cdsFull) > fit1=fitNbinomGLMs(cdsFull,count ~ libType + condition) > str(fit1) > fit0=fitNbinomGLMs(cdsFull,count ~ libType ) > pvalsGLM = nbinomGLMTest( fit1, fit0 ) > padjGLM = p.adjust( pvalsGLM, method="BH" ) > head(fit1) > cdsFullB=newCountDataSet(CountTable,Design$condition) > cdsFullB=estimateSizeFactors(cdsFullB) > cdsFullB=estimateDispersions(cdsFullB) > rs = rowSums ( counts ( cdsFull )) > theta = 0.4 > use = (rs > quantile(rs, probs=theta)) > table(use) > cdsFilt = cdsFull[ use, ] > fitFilt1 = fitNbinomGLMs( cdsFilt, count ~ libType + condition ) > fitFilt0 = fitNbinomGLMs( cdsFilt, count ~ libType ) > pvalsFilt = nbinomGLMTest( fitFilt1, fitFilt0 ) > padjFilt = p.adjust(pvalsFilt, method="BH" ) > padjFiltForComparison = rep(+Inf, length(padjGLM)) > padjFiltForComparison[use] = padjFilt > tab3 = table( `no filtering` = padjGLM < .05,`with filtering` = padjFiltForComparison < .05 ) > addmargins(tab3) > res=cbind(fit1,pvalsGLM=pvalsGLM,padjGLM=padjGLM) > resSig=res[which(res$padjGLM<0.05),] > table(res$padjGLM<0.05) > write.csv(res,file='res_Multifactor_DESeq.csv') > hist(res$pvalsGLM,breaks=100) > resFilt=cbind(fitFilt1,pvalsFilt=pvalsFilt,padjFilt=padjFilt) > resSigFilt=resFilt[which(resFilt$padjFilt<0.05),] > table(resFilt$padjFilt<0.05) > write.csv(resFilt,file='resFilt_Multifactor_DESeq.csv') > plot(rank(rs)/length(rs), -log10(pvalsGLM), pch=16, cex=0.45) > h1 = hist(pvalsGLM[!use], breaks=50, plot=FALSE) > h2 = hist(pvalsGLM[use], breaks=50, plot=FALSE) > colori = c(`do not pass`="khaki", `pass`="powderblue") > barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,space = 0, main = "", ylab="frequency") > text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA) > legend("topright", fill=rev(colori), legend=rev(names(colori))) > cdsFullBlind = estimateDispersions( cdsFull, method = "blind" ) > vsdFull = varianceStabilizingTransformation( cdsFullBlind ) > select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:100] > hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6)) > heatmap.2(counts(cdsFull)[select,], col = hmcol, trace="none", margin=c(10,6)) > dists = dist( t( exprs(vsdFull) ) ) > mat = as.matrix( dists ) > rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : ")) > heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) > print(plotPCA(vsdFull, intgroup=c("condition", "libType"))) > > In DESeq2: > > CountTable=read.table("DGE.txt",header=TRUE,row.names=1) > head(CountTable) > Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT" ,"ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP", "COMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded"," unstranded","unstranded","unstranded","stranded","unstranded","unstran ded","unstranded","stranded","unstranded","unstranded","unstranded","s tranded","unstranded","unstranded","unstranded")) > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition + libType) > colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PREC OMP","COMP","JUV","ADULT")) > design(dds)<-formula(~libType + condition ) > dds<-DESeq(dds) > rld <- rlogTransformation(dds, blind=TRUE) > vsd <- varianceStabilizingTransformation(dds, blind=TRUE) > vstMat <- assay(vsd) > rlogMat <- assay(rld) > px <- counts(dds)[,1] / sizeFactors(dds)[1] > ord <- order(px) > ord <- ord[px[ord] < 150] > ord <- ord[seq(1, length(ord), length=50)] > last <- ord[length(ord)] > vstcol <- c("blue", "black") > matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)") > legend("bottomright", legend = c(expression("variance stabilizing transformation"), expression(log[2](n/s[1]))),fill=vstcol) > library("vsn") > par(mfrow=c(1,3)) > notAllZero <- (rowSums(counts(dds))>0) > meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5)) > meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5)) > meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5)) > select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100] > hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6)) > heatmap.2(assay(rld)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6)) > heatmap.2(assay(vsd)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6)) > distsRL <- dist(t(assay(rld))) > select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30] > hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6)) > distsRL <- dist(t(assay(rld))) > mat <- as.matrix(distsRL) > rownames(mat) <- colnames(mat) <- with(colData(dds), > paste(condition)) > rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition, libType, sep=" : ")) > heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) > print(plotPCA(rld, intgroup=c("condition", "libType"))) > > See attached the plots I obtained (*_MULTIFACTORIAL.jpeg) > > Let me know what you think > >>To be honest, if I had a set of data that had three replicates per >>condition, and one set of data that had "singlicate" samples in a >>condition which came from an apparently wildly different protocol, I'd >>probably just use the triplicate data and forget about the outlier >>data if I'm just doing "vanilla DGE". > >>If the stranded data is giving you wildly different results, the >>simplest explanation might be that something may have gotten screwed >>up in the library prep -- but you have no way to tell since you have >>no replicates of those data. > >>As I said, you can include it in your linear model by using a >>multi-factorial design if you *really* want to use it, but why bother >>for a vanilla DGE analysis? > >>On the other hand, if you're using the stranded data to do something >>more than just vanilla DGE -- perhaps you're interested in quantifying >>anti-sense transcription, then I'd try to do a lot more exploratory >>analysis to see how I could explain large difference in quantitation >>that I'm seeing. I'd still be really pissed that I had no replicates, >>but it wouldn't be the first time a graduate student is stuck with an >>underpowered dataset that they are tasked to torture, and it sadly >>won't be the last ... still, in this situation, I'd go the >>multi-factorial design approach to combine both datasets for any >>analysis I'm asked to do. In the meantime, I'd also complain to my >>advisor/collaborator whatever to twist their arm enough to run >>replicates for the future. > > You got the situation quite right. > I agree with you, I wouldn't bother using the stranded data if I was doing a vanilla DGE but I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification. This is reason that brought me to include the untranded data in the analysis, to increase the number of replicates and test for DE to obtain a set of lncRNAs I can handle for downstream analysis. > > Let me know if you have any further idea > > Thanks a lot for help > > Federico > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, thanks for the reply and sorry for all the code. I'm still a beginner in this field so I'm still learning how to correctly formulate my questions/emails. I agree with you about the PCA plot analysis. Could you just explain better to me what you mean with " If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you." ?? So let's see if I got what you are saying. Are you suggesting I should try to do a DGE with the undtranded data with "condition" as the only level and then compare it to the DGE outcome using a multifactorial design? This would be the way I start the multifactorial analysis: dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~condition + libType) > colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PREC OMP","COMP","JUV","ADULT")) > design(dds)<-formula(~libType + condition) Am I getting it right? If so, I'll go ahead and keep you posted about the outcome Thanks again for help __________________________________ From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on behalf of Steve Lianoglou [lianoglou.steve@gene.com] Sent: Wednesday, 26 February 2014 6:31 PM To: Federico Gaiti Cc: bioconductor at r-project.org Subject: Re: [BioC] Low number of replicates DESeq Hi, In the future, you don't have to include all of the code you've ever written :-) For instance, you have code to write tables, do DGE, do a mean vs. sd plot, etc. but it's not really germane to this conversation. Also, your attachments didn't make it through to the list, but since I was cc'd on the email, I got them in my inbox. You are providing diagnostic plots (PCA and heatmap) from a mult-factorial design, but you should note that the fact that you encode the library type in the design will not change the output of these plots. That is to say that the "libType" hasn't been corrected for in the variance stabilized data. It is only accounted for in your results form the DGE analysis. Anyway: the PCA plot from the vst transformed data (your DESeq1 pca plot) seems to suggest that your conditions are lining up pretty reasonably along PC2 (vertically), while PC1 simply separates your samples by libType -- would you agree with that assessment? (Forget the pca from the rlog transform -- it seems the VST is doing better here). If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you. How can you tell? One way is to do a DGE analysis with just the unstranded data, then do it again with the unstranded data AND stranded data using a design matrix with libType + condition, and just test for condition. Are the results between the two comparable when you look at the same contrasts? HTH, -steve On Tue, Feb 25, 2014 at 12:42 AM, Federico Gaiti <f.gaiti at="" uq.edu.au=""> wrote: > Hi Steve, > > Please see my comments below > > Thanks > ________________________________________ > >>Thanks for pasting the code now -- without the images or results from >>your output, it's hard to see what is reasonably correlated or not, >>but let's continue ... > >>I see you're already doing PCA, and it would be interesting to see how >>your results compare to what you get by following along with the >>DESeq2 vignette and shooting your data through one of the variance >>stabilizing transforms (incidentally, the DESeq vignette also has a >>similar section, and it would have been worth while to browse that >>vignette and follow the advice outlined there as well) > > This is exactly what I've done. I started doing a simple correlation anylsis and then shot my data first through DESeq and then through DESeq2. > You can find atttached some PCA plots and heatmaps when I used the option -s no for the stranded data > > In DESeq2: > > CountTable=read.table("DGE.txt",header=TRUE,row.names=1) > samples<-data.frame(row.names=c("ADULT","ADULT1","ADULT2","ADULT3"," JUV","JUV1","JUV2","JUV3","COMP","COMP1","COMP2","COMP3","PRECOMP","PR ECOMP1","PRECOMP2","PRECOMP3"),condition=as.factor(c(rep("ADULT",4),re p("JUV",4),rep("COMP",4),rep("PRECOMP",4)))) > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=samples,des ign=~condition) > colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PREC OMP","COMP","JUV","ADULT")) > dds<-DESeq(dds) > rld <- rlogTransformation(dds, blind=TRUE) > rlogMat <- assay(rld) > vstMat <- assay(vsd) > vsd <- varianceStabilizingTransformation(dds, blind=TRUE) > px <- counts(dds)[,1] / sizeFactors(dds)[1] > ord <- order(px) > ord <- ord[px[ord] < 150] > ord <- ord[seq(1, length(ord), length=50)] > last <- ord[length(ord)] > vstcol <- c("blue", "black") > matplot(px[ord],cbind(assay(vsd)[, 1], log2(px))[ord, ],type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)") > legend("bottomright",legend = c(expression("variance stabilizing transformation"),expression(log[2](n/s[1]))),fill=vstcol) > library("vsn") > par(mfrow=c(1,3)) > notAllZero <- (rowSums(counts(dds))>0) > meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1),ylim = c(0,2.5)) > meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5)) > meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5)) > library("RColorBrewer") > library("gplots") > select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100] > hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10,6)) > heatmap.2(assay(rld)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6)) > heatmap.2(assay(vsd)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6)) > distsRL <- dist(t(assay(rld))) > mat <- as.matrix(distsRL) > rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition)) > heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) > print(plotPCA(rld, intgroup=c("condition"))) > > In DESeq: > > data<-read.table("Counts",header=TRUE,row.names=1) > Design=data.frame(row.names=colnames(data),condition=c('1','1','1',' 1','2','2','2','2','3','3','3','3','4','4','4','4')) > cds=newCountDataSet(data,Design) > cds=estimateSizeFactors(cds) > cdsBlind=estimateDispersions(cds,method='blind') > vsd=varianceStabilizingTransformation(cdsBlind) > library("RColorBrewer") > library("gplots") > select = order(rowMeans(counts(cdsBlind)), decreasing=TRUE)[1:30] > hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6)) > heatmap.2(counts(cdsBlind)[select,], col = hmcol, trace="none", margin=c(10,6)) > dists = dist( t( exprs(vsd) ) ) > mat = as.matrix( dists ) > rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition)) > heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) > print(plotPCA(vsd, intgroup=c("condition"))) > >>Are you saying that if you use the stranded data but do the counting >>as if it were unstranded, it looks "just fine," but counting the >>stranded data as stranded data makes it look quite bad? > >>Interesting ....I've actually never used stranded RNAseq data before, >>but that's a bit surprising ... > > Exactly. it looks way better than counting the stranded data as stranded data. > >>What if you only consider a subset of genes that are in regions that >>do not have annotated genes on the opposite strands -- the data for >>these genes from the stranded and unstranded should be quite similar, >>no? How do these look? > > I'll have a look, good idea! > >>But why? If you're just doing "vanilla DGE" and you have data that you >>feel like you can use (the unstranded data) in triplicate, and a >>source of singlicate data (the stranded one), why do you want to >>torture some DGE analysis by cramming the stranded data through it >>assuming <something> from the unstranded. > >>If you really want to use the stranded data, I'd encode the "library >>type" as a covariate in the linear model while doing differential >>expression. Both the DESeq and DESeq2 vignettes have a section on >>multi-factorial designs where they incorporate data from single and >>paired-end runs to perform a DGE analysis -- the way you would combine >>unstranded and stranded counts for a gene would be rather similar. > > I followed the multifactorial design section in both DESeq and DESeq2 vignette and then checked the quality by sample clustering and visualization. > > In DESeq: > > CountTable=read.table("DGE.txt",header=TRUE,row.names=1) > head(CountTable) > Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT" ,"ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP", "COMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded"," unstranded","unstranded","unstranded","stranded","unstranded","unstran ded","unstranded","stranded","unstranded","unstranded","unstranded","s tranded","unstranded","unstranded","unstranded")) > cdsFull=newCountDataSet(CountTable,Design) > head(cdsFull) > cdsFull=estimateSizeFactors(cdsFull) > sizeFactors(cdsFull) > cdsFull=estimateDispersions(cdsFull) > plotDispEsts(cdsFull) > fit1=fitNbinomGLMs(cdsFull,count ~ libType + condition) > str(fit1) > fit0=fitNbinomGLMs(cdsFull,count ~ libType ) > pvalsGLM = nbinomGLMTest( fit1, fit0 ) > padjGLM = p.adjust( pvalsGLM, method="BH" ) > head(fit1) > cdsFullB=newCountDataSet(CountTable,Design$condition) > cdsFullB=estimateSizeFactors(cdsFullB) > cdsFullB=estimateDispersions(cdsFullB) > rs = rowSums ( counts ( cdsFull )) > theta = 0.4 > use = (rs > quantile(rs, probs=theta)) > table(use) > cdsFilt = cdsFull[ use, ] > fitFilt1 = fitNbinomGLMs( cdsFilt, count ~ libType + condition ) > fitFilt0 = fitNbinomGLMs( cdsFilt, count ~ libType ) > pvalsFilt = nbinomGLMTest( fitFilt1, fitFilt0 ) > padjFilt = p.adjust(pvalsFilt, method="BH" ) > padjFiltForComparison = rep(+Inf, length(padjGLM)) > padjFiltForComparison[use] = padjFilt > tab3 = table( `no filtering` = padjGLM < .05,`with filtering` = padjFiltForComparison < .05 ) > addmargins(tab3) > res=cbind(fit1,pvalsGLM=pvalsGLM,padjGLM=padjGLM) > resSig=res[which(res$padjGLM<0.05),] > table(res$padjGLM<0.05) > write.csv(res,file='res_Multifactor_DESeq.csv') > hist(res$pvalsGLM,breaks=100) > resFilt=cbind(fitFilt1,pvalsFilt=pvalsFilt,padjFilt=padjFilt) > resSigFilt=resFilt[which(resFilt$padjFilt<0.05),] > table(resFilt$padjFilt<0.05) > write.csv(resFilt,file='resFilt_Multifactor_DESeq.csv') > plot(rank(rs)/length(rs), -log10(pvalsGLM), pch=16, cex=0.45) > h1 = hist(pvalsGLM[!use], breaks=50, plot=FALSE) > h2 = hist(pvalsGLM[use], breaks=50, plot=FALSE) > colori = c(`do not pass`="khaki", `pass`="powderblue") > barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,space = 0, main = "", ylab="frequency") > text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA) > legend("topright", fill=rev(colori), legend=rev(names(colori))) > cdsFullBlind = estimateDispersions( cdsFull, method = "blind" ) > vsdFull = varianceStabilizingTransformation( cdsFullBlind ) > select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:100] > hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6)) > heatmap.2(counts(cdsFull)[select,], col = hmcol, trace="none", margin=c(10,6)) > dists = dist( t( exprs(vsdFull) ) ) > mat = as.matrix( dists ) > rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : ")) > heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) > print(plotPCA(vsdFull, intgroup=c("condition", "libType"))) > > In DESeq2: > > CountTable=read.table("DGE.txt",header=TRUE,row.names=1) > head(CountTable) > Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT" ,"ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP", "COMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded"," unstranded","unstranded","unstranded","stranded","unstranded","unstran ded","unstranded","stranded","unstranded","unstranded","unstranded","s tranded","unstranded","unstranded","unstranded")) > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition + libType) > colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PREC OMP","COMP","JUV","ADULT")) > design(dds)<-formula(~libType + condition ) > dds<-DESeq(dds) > rld <- rlogTransformation(dds, blind=TRUE) > vsd <- varianceStabilizingTransformation(dds, blind=TRUE) > vstMat <- assay(vsd) > rlogMat <- assay(rld) > px <- counts(dds)[,1] / sizeFactors(dds)[1] > ord <- order(px) > ord <- ord[px[ord] < 150] > ord <- ord[seq(1, length(ord), length=50)] > last <- ord[length(ord)] > vstcol <- c("blue", "black") > matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)") > legend("bottomright", legend = c(expression("variance stabilizing transformation"), expression(log[2](n/s[1]))),fill=vstcol) > library("vsn") > par(mfrow=c(1,3)) > notAllZero <- (rowSums(counts(dds))>0) > meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5)) > meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5)) > meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5)) > select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100] > hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6)) > heatmap.2(assay(rld)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6)) > heatmap.2(assay(vsd)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6)) > distsRL <- dist(t(assay(rld))) > select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30] > hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) > heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6)) > distsRL <- dist(t(assay(rld))) > mat <- as.matrix(distsRL) > rownames(mat) <- colnames(mat) <- with(colData(dds), > paste(condition)) > rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition, libType, sep=" : ")) > heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) > print(plotPCA(rld, intgroup=c("condition", "libType"))) > > See attached the plots I obtained (*_MULTIFACTORIAL.jpeg) > > Let me know what you think > >>To be honest, if I had a set of data that had three replicates per >>condition, and one set of data that had "singlicate" samples in a >>condition which came from an apparently wildly different protocol, I'd >>probably just use the triplicate data and forget about the outlier >>data if I'm just doing "vanilla DGE". > >>If the stranded data is giving you wildly different results, the >>simplest explanation might be that something may have gotten screwed >>up in the library prep -- but you have no way to tell since you have >>no replicates of those data. > >>As I said, you can include it in your linear model by using a >>multi-factorial design if you *really* want to use it, but why bother >>for a vanilla DGE analysis? > >>On the other hand, if you're using the stranded data to do something >>more than just vanilla DGE -- perhaps you're interested in quantifying >>anti-sense transcription, then I'd try to do a lot more exploratory >>analysis to see how I could explain large difference in quantitation >>that I'm seeing. I'd still be really pissed that I had no replicates, >>but it wouldn't be the first time a graduate student is stuck with an >>underpowered dataset that they are tasked to torture, and it sadly >>won't be the last ... still, in this situation, I'd go the >>multi-factorial design approach to combine both datasets for any >>analysis I'm asked to do. In the meantime, I'd also complain to my >>advisor/collaborator whatever to twist their arm enough to run >>replicates for the future. > > You got the situation quite right. > I agree with you, I wouldn't bother using the stranded data if I was doing a vanilla DGE but I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification. This is reason that brought me to include the untranded data in the analysis, to increase the number of replicates and test for DE to obtain a set of lncRNAs I can handle for downstream analysis. > > Let me know if you have any further idea > > Thanks a lot for help > > Federico > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.2 years ago Federico Gaiti ▴ 130

0

Entering edit mode

Hi, On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti at="" uq.edu.au=""> wrote: > Hi Steve, > > thanks for the reply and sorry for all the code. > I'm still a beginner in this field so I'm still learning how to correctly formulate my questions/emails. Yeah, no problem, just pointing these things out -- keep in mind that it takes even experienced people time to wade through lots of code, so it's best to keep things short and sweet (with sufficient detail, of course ;-) > I agree with you about the PCA plot analysis. > Could you just explain better to me what you mean with " If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you." ?? > > So let's see if I got what you are saying. > Are you suggesting I should try to do a DGE with the undtranded data with "condition" as the only level and then compare it to the DGE outcome using a multifactorial design? > > This would be the way I start the multifactorial analysis: > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition + libType) >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE COMP","COMP","JUV","ADULT")) >> design(dds)<-formula(~libType + condition) > > Am I getting it right? > If so, I'll go ahead and keep you posted about the outcome Yes, you are getting it right -- I'd put the `condition` data on the Design data.frame before you create the dds, but I'm not if it will matter. Just follow closely the example in the deseq2 vignette. Read the entire vignette actually, so you understand how to get the particular results you are after out of your objects (ie. what the things are that you should pass into a call to `results` for instance). You will be working with two models, say `dds1` which was built with *only* the unstranded data and your design is ~ condition. dds2 will be the model with the unstranded and stranded along with the `~ libType + condition` design. Once you have those, look at the output from `resultsNames(dds1)` and `resultsNames(dds2)` and see that you compare the same results between dds1 and dds2. This should become more clear to you as you read the deseq2 vignette (read it again if you think you already read it once) and when you look with your data. Note that the DESeq2 folks recently posted an early version of the paper detaling the deseq2 method here: http://biorxiv.org/content/early/2014/02/19/002832 Which would be helpful to read. -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, I carefully read the DESeq2 vignette (February 19, 2014) and then did the DGE using two different models as you suggested and then performed different contrasts. Multifactorial > head(CountTable) ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 30 0 24 48 84 5 1 1 47 15 8 6 47 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 0 0 4 0 0 0 1 2 3 2 5 5 7 asmbl_1000 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 asmbl_10000 11 75 0 73 103 12 112 51 65 43 17 57 56 18 23 63 asmbl_10001 0 1 0 0 3 11 6 4 4 4 11 1 5 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~libType + condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "libTypestranded" [3] "libTypeunstranded" "conditionPRECOMP" [5] "conditionCOMP" "conditionJUV" [7] "conditionADULT" ONE FACTOR > head(CountTable) ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 0 24 48 5 1 1 15 8 6 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 4 0 0 1 2 3 5 5 7 asmbl_1000 8 0 7 5 19 7 4 4 7 4 12 1 asmbl_10000 75 0 73 12 112 51 43 17 57 18 23 63 asmbl_10001 1 0 0 11 6 4 4 11 1 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "conditionPRECOMP" [3] "conditionCOMP" "conditionJUV" [5] "conditionADULT" Here is the number of DE genes at a threshold of 0.05 (padj<0.05) PreComp-Comp Comp-Juv Juv-Adult Shared 1400 5541 5733 Multifactorial specific 98 1584 1304 One-factor specific 1436 1658 2480 As you can see considering *only* unstranded data in the analysis detected more DE genes but they seem comparable (at least to me). Any thougths on this? Should I rely on the multifactorial design? Thanks for help Fede ________________________________________ From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on behalf of Steve Lianoglou [lianoglou.steve@gene.com] Sent: Wednesday, 26 February 2014 7:03 PM To: Federico Gaiti Cc: bioconductor@r-project.org Subject: Re: [BioC] Low number of replicates DESeq Hi, On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti@uq.edu.au> wrote: > Hi Steve, > > thanks for the reply and sorry for all the code. > I'm still a beginner in this field so I'm still learning how to correctly formulate my questions/emails. Yeah, no problem, just pointing these things out -- keep in mind that it takes even experienced people time to wade through lots of code, so it's best to keep things short and sweet (with sufficient detail, of course ;-) > I agree with you about the PCA plot analysis. > Could you just explain better to me what you mean with " If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you." ?? > > So let's see if I got what you are saying. > Are you suggesting I should try to do a DGE with the undtranded data with "condition" as the only level and then compare it to the DGE outcome using a multifactorial design? > > This would be the way I start the multifactorial analysis: > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition + libType) >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE COMP","COMP","JUV","ADULT")) >> design(dds)<-formula(~libType + condition) > > Am I getting it right? > If so, I'll go ahead and keep you posted about the outcome Yes, you are getting it right -- I'd put the `condition` data on the Design data.frame before you create the dds, but I'm not if it will matter. Just follow closely the example in the deseq2 vignette. Read the entire vignette actually, so you understand how to get the particular results you are after out of your objects (ie. what the things are that you should pass into a call to `results` for instance). You will be working with two models, say `dds1` which was built with *only* the unstranded data and your design is ~ condition. dds2 will be the model with the unstranded and stranded along with the `~ libType + condition` design. Once you have those, look at the output from `resultsNames(dds1)` and `resultsNames(dds2)` and see that you compare the same results between dds1 and dds2. This should become more clear to you as you read the deseq2 vignette (read it again if you think you already read it once) and when you look with your data. Note that the DESeq2 folks recently posted an early version of the paper detaling the deseq2 method here: http://biorxiv.org/content/early/2014/02/19/002832 Which would be helpful to read. -steve -- Steve Lianoglou Computational Biologist Genentech [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Federico Gaiti ▴ 130

0

Entering edit mode

hi Federico, The question of design falls on what you are looking for. The multifactorial design gets at differences which are consistent for both stranded and unstranded experiments (though the unstranded has 3 times more samples, so contributes more to a gene's likelihood of being detected DE here). But to go back to an earlier point. You mentioned earlier: "I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification." Your current multifactorial analysis is looking for consistent differences across developmental stages, between sense transcription (your stranded experiment counted with -s yes rather than -s reverse) and when you combine sense and anti-sense transcription (your unstranded experiments counted with -s no). Anti-sense transcription plays little role here if we assume that more reads are coming from sense than anti-sense. Note that if you are looking in particular for differences in anti- sense transcription across developmental stages, you need to use the -s reverse option to htseq-count, and peform biological replicates. I don't see any way around requiring more replicates, as there are both technical and biological sources of variation which will be different in the stranded and unstranded experiments. Adding in the unstranded data seems not so helpful, as you are mixing a small signal of interest (anti-sense transcription) with most likely a lot more reads coming from sense transcription. You mentioned, "I tried to use the option -s reverse for the stranded data and still got really low correlation." Wouldn't this makes sense, because you are comparing anti-sense transcription to the unstranded protocol which is likely capturing mostly sense transcription? Mike On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti <f.gaiti@uq.edu.au> wrote: > Hi Steve, > > I carefully read the DESeq2 vignette (February 19, 2014) and then did the > DGE using two different models as you suggested and then performed > different contrasts. > > Multifactorial > > > > head(CountTable) > ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 > COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 > asmbl_1 30 0 24 48 84 5 1 1 47 15 8 > 6 47 28 27 47 > asmbl_10 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 > asmbl_100 0 0 0 0 0 4 0 0 0 1 2 > 3 2 5 5 7 > asmbl_1000 0 8 0 7 5 5 19 7 14 4 4 > 7 9 4 12 1 > asmbl_10000 11 75 0 73 103 12 112 51 65 43 17 > 57 56 18 23 63 > asmbl_10001 0 1 0 0 3 11 6 4 4 4 11 > 1 5 0 3 0 > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~libType > + condition) > > colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PREC OMP","COMP","JUV","ADULT")) > design(dds)<-formula(~libType + condition ) > dds<-DESeq(dds) > res<-results(dds) > resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) > resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) > resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > > > resultsNames(dds) > [1] "Intercept" "libTypestranded" > [3] "libTypeunstranded" "conditionPRECOMP" > [5] "conditionCOMP" "conditionJUV" > [7] "conditionADULT" > > > ONE FACTOR > > > > head(CountTable) > ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 PRECOMP1 > PRECOMP2 PRECOMP3 > asmbl_1 0 24 48 5 1 1 15 8 6 28 > 27 47 > asmbl_10 0 0 0 0 0 0 0 0 0 0 > 0 0 > asmbl_100 0 0 0 4 0 0 1 2 3 5 > 5 7 > asmbl_1000 8 0 7 5 19 7 4 4 7 4 > 12 1 > asmbl_10000 75 0 73 12 112 51 43 17 57 18 > 23 63 > asmbl_10001 1 0 0 11 6 4 4 11 1 0 > 3 0 > > > > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition) > > colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PREC OMP","COMP","JUV","ADULT")) > design(dds)<-formula(~condition ) > dds<-DESeq(dds) > res<-results(dds) > resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) > resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) > resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > > > resultsNames(dds) > [1] "Intercept" "conditionPRECOMP" > [3] "conditionCOMP" "conditionJUV" > [5] "conditionADULT" > > Here is the number of DE genes at a threshold of 0.05 (padj<0.05) > > PreComp-Comp Comp-Juv Juv- Adult > Shared 1400 5541 5733 > Multifactorial specific 98 1584 1304 > One-factor specific 1436 1658 2480 > > As you can see considering *only* unstranded data in the analysis detected > more DE genes but they seem comparable (at least to me). > > Any thougths on this? Should I rely on the multifactorial design? > > Thanks for help > Fede > > ________________________________________ > From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on > behalf of Steve Lianoglou [lianoglou.steve@gene.com] > Sent: Wednesday, 26 February 2014 7:03 PM > To: Federico Gaiti > Cc: bioconductor@r-project.org > Subject: Re: [BioC] Low number of replicates DESeq > > Hi, > > On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti@uq.edu.au> > wrote: > > Hi Steve, > > > > thanks for the reply and sorry for all the code. > > I'm still a beginner in this field so I'm still learning how to > correctly formulate my questions/emails. > > Yeah, no problem, just pointing these things out -- keep in mind that > it takes even experienced people time to wade through lots of code, so > it's best to keep things short and sweet (with sufficient detail, of > course ;-) > > > I agree with you about the PCA plot analysis. > > Could you just explain better to me what you mean with " If this is the > case, then encoding the libType as a main effect in your model (as you've > done) should go a long ways in dealing with this issue for you." ?? > > > > So let's see if I got what you are saying. > > Are you suggesting I should try to do a DGE with the undtranded data > with "condition" as the only level and then compare it to the DGE outcome > using a multifactorial design? > > > > This would be the way I start the multifactorial analysis: > > > > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition > + libType) > >> > colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PREC OMP","COMP","JUV","ADULT")) > >> design(dds)<-formula(~libType + condition) > > > > Am I getting it right? > > If so, I'll go ahead and keep you posted about the outcome > > Yes, you are getting it right -- I'd put the `condition` data on the > Design data.frame before you create the dds, but I'm not if it will > matter. > > Just follow closely the example in the deseq2 vignette. Read the > entire vignette actually, so you understand how to get the particular > results you are after out of your objects (ie. what the things are > that you should pass into a call to `results` for instance). > > You will be working with two models, say `dds1` which was built with > *only* the unstranded data and your design is ~ condition. > > dds2 will be the model with the unstranded and stranded along with the > `~ libType + condition` design. > > Once you have those, look at the output from `resultsNames(dds1)` and > `resultsNames(dds2)` and see that you compare the same results between > dds1 and dds2. > > This should become more clear to you as you read the deseq2 vignette > (read it again if you think you already read it once) and when you > look with your data. > > Note that the DESeq2 folks recently posted an early version of the > paper detaling the deseq2 method here: > http://biorxiv.org/content/early/2014/02/19/002832 > > Which would be helpful to read. > > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Michael Love 41k

0

Entering edit mode

Hi It seems that we have this thread twice, once here and once on SeqAnswers, with somewhat diverging argumentation. This is rather unfortunate... Hence, for completeness, here is the link to the other half of the discussion: http://seqanswers.com/forums/showthread.php?t=41155 Simon

ADD REPLY • link 10.1 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Mike, Thanks for reply. I see what you mean but I'm a bit confused about ht- seq count now. Please see also an open thread with Simon Anders where I'm discussing this in details: http://seqanswers.com/forums/showthread.php?p=133959& posted=1#post133959 Sorry for the crosspost I know, I just didn't know it's not a good idea. It's actually one of the first question I post online so I'm still learning how all this works. I am investgating lncRNAs, which can be intronic, intergenic, can overlap on the same strand of another gene or have anti-sense orientation. That's what I meant with "I need to have anti-sense transcription". I need the stranded data to account for this. As for htseq-count I thought that depending on the library preparation for stranded libraries I could select -s reverse or -s yes. And so to be sure I did a quick test on the stranded libraries using infer_experiment.py, it is indeed forward-reverse. This is PairEnd Data Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189 Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811 Fraction of reads explained by other combinations: 0.0000 1++,1,2+-,2-+ read1 mapped to + strand indicates parental gene on + strand read1 mapped to - strand indicates parental gene on - strand read2 mapped to + strand indicates parental gene on - strand read2 mapped to - strand indicates parental gene on + strand Based on this I ran TOPHAT with fr-secondstrand option and htseq-count with -s yes As Simon said in the other thread "if a read maps to a position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand, then this read will be counted as ambiguous if you have set "stranded" to "no", because there is no information to decide whether the read originated from the sense of from the antisense transcript. For "stranded=yes", however, the read will be counted for the feature that is on the same strand as the read" So why would my stranded experiment counted with -s yes capture only sense trasncription? Shouldn't my stranded experiment counted with -s yes capture both sense and anti-sense transcription based on where the reads map? Also, if selecting this option depends on the library prepration protocol and not on the DGE design, shouldn't -s reverse be "wrong" in my case? Thanks for clarifications and help Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Friday, 28 February 2014 2:23 AM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org Subject: Re: [BioC] Low number of replicates DESeq hi Federico, The question of design falls on what you are looking for. The multifactorial design gets at differences which are consistent for both stranded and unstranded experiments (though the unstranded has 3 times more samples, so contributes more to a gene's likelihood of being detected DE here). But to go back to an earlier point. You mentioned earlier: "I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification." Your current multifactorial analysis is looking for consistent differences across developmental stages, between sense transcription (your stranded experiment counted with -s yes rather than -s reverse) and when you combine sense and anti-sense transcription (your unstranded experiments counted with -s no). Anti- sense transcription plays little role here if we assume that more reads are coming from sense than anti-sense. Note that if you are looking in particular for differences in anti- sense transcription across developmental stages, you need to use the -s reverse option to htseq-count, and peform biological replicates. I don't see any way around requiring more replicates, as there are both technical and biological sources of variation which will be different in the stranded and unstranded experiments. Adding in the unstranded data seems not so helpful, as you are mixing a small signal of interest (anti-sense transcription) with most likely a lot more reads coming from sense transcription. You mentioned, "I tried to use the option -s reverse for the stranded data and still got really low correlation." Wouldn't this makes sense, because you are comparing anti-sense transcription to the unstranded protocol which is likely capturing mostly sense transcription? Mike On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: Hi Steve, I carefully read the DESeq2 vignette (February 19, 2014) and then did the DGE using two different models as you suggested and then performed different contrasts. Multifactorial > head(CountTable) ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 30 0 24 48 84 5 1 1 47 15 8 6 47 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 0 0 4 0 0 0 1 2 3 2 5 5 7 asmbl_1000 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 asmbl_10000 11 75 0 73 103 12 112 51 65 43 17 57 56 18 23 63 asmbl_10001 0 1 0 0 3 11 6 4 4 4 11 1 5 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~libType + condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "libTypestranded" [3] "libTypeunstranded" "conditionPRECOMP" [5] "conditionCOMP" "conditionJUV" [7] "conditionADULT" ONE FACTOR > head(CountTable) ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 0 24 48 5 1 1 15 8 6 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 4 0 0 1 2 3 5 5 7 asmbl_1000 8 0 7 5 19 7 4 4 7 4 12 1 asmbl_10000 75 0 73 12 112 51 43 17 57 18 23 63 asmbl_10001 1 0 0 11 6 4 4 11 1 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "conditionPRECOMP" [3] "conditionCOMP" "conditionJUV" [5] "conditionADULT" Here is the number of DE genes at a threshold of 0.05 (padj<0.05) PreComp-Comp Comp-Juv Juv-Adult Shared 1400 5541 5733 Multifactorial specific 98 1584 1304 One-factor specific 1436 1658 2480 As you can see considering *only* unstranded data in the analysis detected more DE genes but they seem comparable (at least to me). Any thougths on this? Should I rely on the multifactorial design? Thanks for help Fede ________________________________________ From: mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> [mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> ] on behalf of Steve Lianoglou [lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>] Sent: Wednesday, 26 February 2014 7:03 PM To: Federico Gaiti Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] Low number of replicates DESeq Hi, On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: > Hi Steve, > > thanks for the reply and sorry for all the code. > I'm still a beginner in this field so I'm still learning how to correctly formulate my questions/emails. Yeah, no problem, just pointing these things out -- keep in mind that it takes even experienced people time to wade through lots of code, so it's best to keep things short and sweet (with sufficient detail, of course ;-) > I agree with you about the PCA plot analysis. > Could you just explain better to me what you mean with " If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you." ?? > > So let's see if I got what you are saying. > Are you suggesting I should try to do a DGE with the undtranded data with "condition" as the only level and then compare it to the DGE outcome using a multifactorial design? > > This would be the way I start the multifactorial analysis: > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition + libType) >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE COMP","COMP","JUV","ADULT")) >> design(dds)<-formula(~libType + condition) > > Am I getting it right? > If so, I'll go ahead and keep you posted about the outcome Yes, you are getting it right -- I'd put the `condition` data on the Design data.frame before you create the dds, but I'm not if it will matter. Just follow closely the example in the deseq2 vignette. Read the entire vignette actually, so you understand how to get the particular results you are after out of your objects (ie. what the things are that you should pass into a call to `results` for instance). You will be working with two models, say `dds1` which was built with *only* the unstranded data and your design is ~ condition. dds2 will be the model with the unstranded and stranded along with the `~ libType + condition` design. Once you have those, look at the output from `resultsNames(dds1)` and `resultsNames(dds2)` and see that you compare the same results between dds1 and dds2. This should become more clear to you as you read the deseq2 vignette (read it again if you think you already read it once) and when you look with your data. Note that the DESeq2 folks recently posted an early version of the paper detaling the deseq2 method here: http://biorxiv.org/content/early/2014/02/19/002832 Which would be helpful to read. -steve -- Steve Lianoglou Computational Biologist Genentech [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Federico Gaiti ▴ 130

0

Entering edit mode

Hi Federico, Yes, Simon is right. Please ignore my previous email. Sorry for adding to the confusion. Mike On Feb 27, 2014 5:51 PM, "Federico Gaiti" <f.gaiti@uq.edu.au> wrote: > Hi Mike, > > Thanks for reply. I see what you mean but I'm a bit confused about ht-seq > count now. > > Please see also an open thread with Simon Anders where I'm discussing this > in details: > http://seqanswers.com/forums/showthread.php?p=133959&posted=1#post13 3959 > Sorry for the crosspost I know, I just didn't know it's not a good idea. > It's actually one of the first question I post online so I'm still learning > how all this works. > > I am investgating lncRNAs, which can be intronic, intergenic, can overlap > on the same strand of another gene or have anti-sense orientation. That's > what I meant with "I need to have anti-sense transcription". I need the > stranded data to account for this. > > As for htseq-count I thought that depending on the library preparation for > stranded libraries I could select -s reverse or -s yes. And so to be sure > I did a quick test on the stranded libraries using infer_experiment.py, it > is indeed forward-reverse. > > > > > > > > > > > > > > *This is PairEnd Data Fraction of reads explained by "1++,1--,2+-,2-+": > 0.9189 Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811 Fraction of > reads explained by other combinations: 0.0000 1++,1â,2+-,2-+ read1 mapped > to â+â strand indicates parental gene on â+â strand read1 mapped to â-â > strand indicates parental gene on â-â strand read2 mapped to â+â strand > indicates parental gene on â-â strand read2 mapped to â-â strand indicates > parental gene on â+â strand *Based on this I ran TOPHAT with > fr-secondstrand option and htseq-count with -s yes > > As Simon said in the other thread "if a read maps to a position which is > covered by a regular gene on one strand and by an overlapping antisense > transcript on the other strand, then this read will be counted as ambiguous > if you have set "stranded" to "no", because there is no information to > decide whether the read originated from the sense of from the antisense > transcript. > For "stranded=yes", however, the read will be counted for the feature that > is on the same strand as the read" > > So why would my stranded experiment counted with -s yes capture only > sense trasncription? Shouldn't my stranded experiment counted with -s yes capture > both sense and anti-sense transcription based on where the reads map? > Also, if selecting this option depends on the library prepration protocol > and not on the DGE design, shouldn't -s reverse be "wrong" in my case? > > Thanks for clarifications and help > Federico > > > > ------------------------------ > *From:* Michael Love [michaelisaiahlove@gmail.com] > *Sent:* Friday, 28 February 2014 2:23 AM > *To:* Federico Gaiti > *Cc:* Steve Lianoglou; bioconductor@r-project.org > *Subject:* Re: [BioC] Low number of replicates DESeq > > hi Federico, > > The question of design falls on what you are looking for. The > multifactorial design gets at differences which are consistent for both > stranded and unstranded experiments (though the unstranded has 3 times more > samples, so contributes more to a gene's likelihood of being detected DE > here). > > But to go back to an earlier point. You mentioned earlier: "I am > investigating long non-coding RNAs and so I need to have anti-sense > transcription quantification." Your current multifactorial analysis is > looking for consistent differences across developmental stages, between > sense transcription (your stranded experiment counted with -s yes rather > than -s reverse) and when you combine sense and anti-sense transcription > (your unstranded experiments counted with -s no). Anti-sense transcription > plays little role here if we assume that more reads are coming from sense > than anti-sense. > > Note that if you are looking in particular for differences in anti- sense > transcription across developmental stages, you need to use the -s reverse > option to htseq-count, and peform biological replicates. I don't see any > way around requiring more replicates, as there are both technical and > biological sources of variation which will be different in the stranded and > unstranded experiments. Adding in the unstranded data seems not so helpful, > as you are mixing a small signal of interest (anti-sense transcription) > with most likely a lot more reads coming from sense transcription. > > You mentioned, "I tried to use the option -s reverse for the stranded > data and still got really low correlation." Wouldn't this makes sense, > because you are comparing anti-sense transcription to the unstranded > protocol which is likely capturing mostly sense transcription? > > Mike > > > > > On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti <f.gaiti@uq.edu.au> wrote: > >> Hi Steve, >> >> I carefully read the DESeq2 vignette (February 19, 2014) and then did the >> DGE using two different models as you suggested and then performed >> different contrasts. >> >> Multifactorial >> >> >> > head(CountTable) >> ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 >> COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 >> asmbl_1 30 0 24 48 84 5 1 1 47 15 >> 8 6 47 28 27 47 >> asmbl_10 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 >> asmbl_100 0 0 0 0 0 4 0 0 0 1 >> 2 3 2 5 5 7 >> asmbl_1000 0 8 0 7 5 5 19 7 14 4 >> 4 7 9 4 12 1 >> asmbl_10000 11 75 0 73 103 12 112 51 65 43 >> 17 57 56 18 23 63 >> asmbl_10001 0 1 0 0 3 11 6 4 4 4 >> 11 1 5 0 3 0 >> >> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,des ign=~libType >> + condition) >> >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE COMP","COMP","JUV","ADULT")) >> design(dds)<-formula(~libType + condition ) >> dds<-DESeq(dds) >> res<-results(dds) >> resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) >> resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) >> resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) >> >> > resultsNames(dds) >> [1] "Intercept" "libTypestranded" >> [3] "libTypeunstranded" "conditionPRECOMP" >> [5] "conditionCOMP" "conditionJUV" >> [7] "conditionADULT" >> >> >> ONE FACTOR >> >> >> > head(CountTable) >> ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 >> PRECOMP1 PRECOMP2 PRECOMP3 >> asmbl_1 0 24 48 5 1 1 15 8 6 >> 28 27 47 >> asmbl_10 0 0 0 0 0 0 0 0 0 >> 0 0 0 >> asmbl_100 0 0 0 4 0 0 1 2 3 >> 5 5 7 >> asmbl_1000 8 0 7 5 19 7 4 4 7 >> 4 12 1 >> asmbl_10000 75 0 73 12 112 51 43 17 57 >> 18 23 63 >> asmbl_10001 1 0 0 11 6 4 4 11 1 >> 0 3 0 >> >> >> >> >> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,des ign=~condition) >> >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE COMP","COMP","JUV","ADULT")) >> design(dds)<-formula(~condition ) >> dds<-DESeq(dds) >> res<-results(dds) >> resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) >> resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) >> resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) >> >> > resultsNames(dds) >> [1] "Intercept" "conditionPRECOMP" >> [3] "conditionCOMP" "conditionJUV" >> [5] "conditionADULT" >> >> Here is the number of DE genes at a threshold of 0.05 (padj<0.05) >> >> PreComp-Comp Comp-Juv Juv- Adult >> Shared 1400 5541 5733 >> Multifactorial specific 98 1584 1304 >> One-factor specific 1436 1658 2480 >> >> As you can see considering *only* unstranded data in the analysis >> detected more DE genes but they seem comparable (at least to me). >> >> Any thougths on this? Should I rely on the multifactorial design? >> >> Thanks for help >> Fede >> >> ________________________________________ >> From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on >> behalf of Steve Lianoglou [lianoglou.steve@gene.com] >> Sent: Wednesday, 26 February 2014 7:03 PM >> To: Federico Gaiti >> Cc: bioconductor@r-project.org >> Subject: Re: [BioC] Low number of replicates DESeq >> >> Hi, >> >> On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti@uq.edu.au> >> wrote: >> > Hi Steve, >> > >> > thanks for the reply and sorry for all the code. >> > I'm still a beginner in this field so I'm still learning how to >> correctly formulate my questions/emails. >> >> Yeah, no problem, just pointing these things out -- keep in mind that >> it takes even experienced people time to wade through lots of code, so >> it's best to keep things short and sweet (with sufficient detail, of >> course ;-) >> >> > I agree with you about the PCA plot analysis. >> > Could you just explain better to me what you mean with " If this is the >> case, then encoding the libType as a main effect in your model (as you've >> done) should go a long ways in dealing with this issue for you." ?? >> > >> > So let's see if I got what you are saying. >> > Are you suggesting I should try to do a DGE with the undtranded data >> with "condition" as the only level and then compare it to the DGE outcome >> using a multifactorial design? >> > >> > This would be the way I start the multifactorial analysis: >> > >> > >> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,des ign=~condition >> + libType) >> >> >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE COMP","COMP","JUV","ADULT")) >> >> design(dds)<-formula(~libType + condition) >> > >> > Am I getting it right? >> > If so, I'll go ahead and keep you posted about the outcome >> >> Yes, you are getting it right -- I'd put the `condition` data on the >> Design data.frame before you create the dds, but I'm not if it will >> matter. >> >> Just follow closely the example in the deseq2 vignette. Read the >> entire vignette actually, so you understand how to get the particular >> results you are after out of your objects (ie. what the things are >> that you should pass into a call to `results` for instance). >> >> You will be working with two models, say `dds1` which was built with >> *only* the unstranded data and your design is ~ condition. >> >> dds2 will be the model with the unstranded and stranded along with the >> `~ libType + condition` design. >> >> Once you have those, look at the output from `resultsNames(dds1)` and >> `resultsNames(dds2)` and see that you compare the same results between >> dds1 and dds2. >> >> This should become more clear to you as you read the deseq2 vignette >> (read it again if you think you already read it once) and when you >> look with your data. >> >> Note that the DESeq2 folks recently posted an early version of the >> paper detaling the deseq2 method here: >> http://biorxiv.org/content/early/2014/02/19/002832 >> >> Which would be helpful to read. >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Michael Love 41k

0

Entering edit mode

No worries at all. It's all a good exercise for me. I'm learning a lot just from this email exchange. Back to the DGE and considering the situation, what would you raccomend for the DGE on DESeq2? Thanks again Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Friday, 28 February 2014 10:27 AM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org Subject: RE: [BioC] Low number of replicates DESeq Hi Federico, Yes, Simon is right. Please ignore my previous email. Sorry for adding to the confusion. Mike On Feb 27, 2014 5:51 PM, "Federico Gaiti" <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: Hi Mike, Thanks for reply. I see what you mean but I'm a bit confused about ht- seq count now. Please see also an open thread with Simon Anders where I'm discussing this in details: http://seqanswers.com/forums/showthread.php?p=133959& posted=1#post133959 Sorry for the crosspost I know, I just didn't know it's not a good idea. It's actually one of the first question I post online so I'm still learning how all this works. I am investgating lncRNAs, which can be intronic, intergenic, can overlap on the same strand of another gene or have anti-sense orientation. That's what I meant with "I need to have anti-sense transcription". I need the stranded data to account for this. As for htseq-count I thought that depending on the library preparation for stranded libraries I could select -s reverse or -s yes. And so to be sure I did a quick test on the stranded libraries using infer_experiment.py, it is indeed forward-reverse. This is PairEnd Data Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189 Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811 Fraction of reads explained by other combinations: 0.0000 1++,1,2+-,2-+ read1 mapped to + strand indicates parental gene on + strand read1 mapped to - strand indicates parental gene on - strand read2 mapped to + strand indicates parental gene on - strand read2 mapped to - strand indicates parental gene on + strand Based on this I ran TOPHAT with fr-secondstrand option and htseq-count with -s yes As Simon said in the other thread "if a read maps to a position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand, then this read will be counted as ambiguous if you have set "stranded" to "no", because there is no information to decide whether the read originated from the sense of from the antisense transcript. For "stranded=yes", however, the read will be counted for the feature that is on the same strand as the read" So why would my stranded experiment counted with -s yes capture only sense trasncription? Shouldn't my stranded experiment counted with -s yes capture both sense and anti-sense transcription based on where the reads map? Also, if selecting this option depends on the library prepration protocol and not on the DGE design, shouldn't -s reverse be "wrong" in my case? Thanks for clarifications and help Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>] Sent: Friday, 28 February 2014 2:23 AM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] Low number of replicates DESeq hi Federico, The question of design falls on what you are looking for. The multifactorial design gets at differences which are consistent for both stranded and unstranded experiments (though the unstranded has 3 times more samples, so contributes more to a gene's likelihood of being detected DE here). But to go back to an earlier point. You mentioned earlier: "I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification." Your current multifactorial analysis is looking for consistent differences across developmental stages, between sense transcription (your stranded experiment counted with -s yes rather than -s reverse) and when you combine sense and anti-sense transcription (your unstranded experiments counted with -s no). Anti- sense transcription plays little role here if we assume that more reads are coming from sense than anti-sense. Note that if you are looking in particular for differences in anti- sense transcription across developmental stages, you need to use the -s reverse option to htseq-count, and peform biological replicates. I don't see any way around requiring more replicates, as there are both technical and biological sources of variation which will be different in the stranded and unstranded experiments. Adding in the unstranded data seems not so helpful, as you are mixing a small signal of interest (anti-sense transcription) with most likely a lot more reads coming from sense transcription. You mentioned, "I tried to use the option -s reverse for the stranded data and still got really low correlation." Wouldn't this makes sense, because you are comparing anti-sense transcription to the unstranded protocol which is likely capturing mostly sense transcription? Mike On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: Hi Steve, I carefully read the DESeq2 vignette (February 19, 2014) and then did the DGE using two different models as you suggested and then performed different contrasts. Multifactorial > head(CountTable) ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 30 0 24 48 84 5 1 1 47 15 8 6 47 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 0 0 4 0 0 0 1 2 3 2 5 5 7 asmbl_1000 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 asmbl_10000 11 75 0 73 103 12 112 51 65 43 17 57 56 18 23 63 asmbl_10001 0 1 0 0 3 11 6 4 4 4 11 1 5 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~libType + condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "libTypestranded" [3] "libTypeunstranded" "conditionPRECOMP" [5] "conditionCOMP" "conditionJUV" [7] "conditionADULT" ONE FACTOR > head(CountTable) ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 0 24 48 5 1 1 15 8 6 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 4 0 0 1 2 3 5 5 7 asmbl_1000 8 0 7 5 19 7 4 4 7 4 12 1 asmbl_10000 75 0 73 12 112 51 43 17 57 18 23 63 asmbl_10001 1 0 0 11 6 4 4 11 1 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "conditionPRECOMP" [3] "conditionCOMP" "conditionJUV" [5] "conditionADULT" Here is the number of DE genes at a threshold of 0.05 (padj<0.05) PreComp-Comp Comp-Juv Juv-Adult Shared 1400 5541 5733 Multifactorial specific 98 1584 1304 One-factor specific 1436 1658 2480 As you can see considering *only* unstranded data in the analysis detected more DE genes but they seem comparable (at least to me). Any thougths on this? Should I rely on the multifactorial design? Thanks for help Fede ________________________________________ From: mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> [mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> ] on behalf of Steve Lianoglou [lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>] Sent: Wednesday, 26 February 2014 7:03 PM To: Federico Gaiti Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] Low number of replicates DESeq Hi, On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: > Hi Steve, > > thanks for the reply and sorry for all the code. > I'm still a beginner in this field so I'm still learning how to correctly formulate my questions/emails. Yeah, no problem, just pointing these things out -- keep in mind that it takes even experienced people time to wade through lots of code, so it's best to keep things short and sweet (with sufficient detail, of course ;-) > I agree with you about the PCA plot analysis. > Could you just explain better to me what you mean with " If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you." ?? > > So let's see if I got what you are saying. > Are you suggesting I should try to do a DGE with the undtranded data with "condition" as the only level and then compare it to the DGE outcome using a multifactorial design? > > This would be the way I start the multifactorial analysis: > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition + libType) >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE COMP","COMP","JUV","ADULT")) >> design(dds)<-formula(~libType + condition) > > Am I getting it right? > If so, I'll go ahead and keep you posted about the outcome Yes, you are getting it right -- I'd put the `condition` data on the Design data.frame before you create the dds, but I'm not if it will matter. Just follow closely the example in the deseq2 vignette. Read the entire vignette actually, so you understand how to get the particular results you are after out of your objects (ie. what the things are that you should pass into a call to `results` for instance). You will be working with two models, say `dds1` which was built with *only* the unstranded data and your design is ~ condition. dds2 will be the model with the unstranded and stranded along with the `~ libType + condition` design. Once you have those, look at the output from `resultsNames(dds1)` and `resultsNames(dds2)` and see that you compare the same results between dds1 and dds2. This should become more clear to you as you read the deseq2 vignette (read it again if you think you already read it once) and when you look with your data. Note that the DESeq2 folks recently posted an early version of the paper detaling the deseq2 method here: http://biorxiv.org/content/early/2014/02/19/002832 Which would be helpful to read. -steve -- Steve Lianoglou Computational Biologist Genentech [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Federico Gaiti ▴ 130

0

Entering edit mode

Then I would recommend the multifactorial design, as it's the best you can do without stranded replicates. it will be underpowered for transcripts which are mostly made up of those positions which Simon described: "position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand". Because for such transcripts, only the single replicate for the stranded experiments will contribute signal (if I am getting it correct this time). On Thu, Feb 27, 2014 at 7:36 PM, Federico Gaiti <f.gaiti@uq.edu.au> wrote: > No worries at all. > > It's all a good exercise for me. I'm learning a lot just from this email > exchange. > > Back to the DGE and considering the situation, what would you raccomend > for the DGE on DESeq2? > > Thanks again > Federico > > > ------------------------------ > *From:* Michael Love [michaelisaiahlove@gmail.com] > *Sent:* Friday, 28 February 2014 10:27 AM > > *To:* Federico Gaiti > *Cc:* Steve Lianoglou; bioconductor@r-project.org > *Subject:* RE: [BioC] Low number of replicates DESeq > > Hi Federico, > > Yes, Simon is right. Please ignore my previous email. Sorry for adding to > the confusion. > > Mike > On Feb 27, 2014 5:51 PM, "Federico Gaiti" <f.gaiti@uq.edu.au> wrote: > >> Hi Mike, >> >> Thanks for reply. I see what you mean but I'm a bit confused about ht-seq >> count now. >> >> Please see also an open thread with Simon Anders where I'm discussing >> this in details: >> http://seqanswers.com/forums/showthread.php?p=133959&posted=1#post1 33959 >> Sorry for the crosspost I know, I just didn't know it's not a good idea. >> It's actually one of the first question I post online so I'm still learning >> how all this works. >> >> I am investgating lncRNAs, which can be intronic, intergenic, can overlap >> on the same strand of another gene or have anti-sense orientation. That's >> what I meant with "I need to have anti-sense transcription". I need the >> stranded data to account for this. >> >> As for htseq-count I thought that depending on the library preparation >> for stranded libraries I could select -s reverse or -s yes. And so to be >> sure I did a quick test on the stranded libraries using >> infer_experiment.py, it is indeed forward-reverse. >> >> >> >> >> >> >> >> >> >> >> >> >> >> *This is PairEnd Data Fraction of reads explained by "1++,1--,2+-,2-+": >> 0.9189 Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811 Fraction of >> reads explained by other combinations: 0.0000 1++,1â,2+-,2-+ read1 mapped >> to â+â strand indicates parental gene on â+â strand read1 mapped to â-â >> strand indicates parental gene on â-â strand read2 mapped to â+â strand >> indicates parental gene on â-â strand read2 mapped to â-â strand indicates >> parental gene on â+â strand *Based on this I ran TOPHAT with >> fr-secondstrand option and htseq-count with -s yes >> >> As Simon said in the other thread "if a read maps to a position which is >> covered by a regular gene on one strand and by an overlapping antisense >> transcript on the other strand, then this read will be counted as ambiguous >> if you have set "stranded" to "no", because there is no information to >> decide whether the read originated from the sense of from the antisense >> transcript. >> For "stranded=yes", however, the read will be counted for the feature >> that is on the same strand as the read" >> >> So why would my stranded experiment counted with -s yes capture only >> sense trasncription? Shouldn't my stranded experiment counted with -s >> yes capture both sense and anti-sense transcription based on where the >> reads map? >> Also, if selecting this option depends on the library prepration protocol >> and not on the DGE design, shouldn't -s reverse be "wrong" in my case? >> >> Thanks for clarifications and help >> Federico >> >> >> >> ------------------------------ >> *From:* Michael Love [michaelisaiahlove@gmail.com] >> *Sent:* Friday, 28 February 2014 2:23 AM >> *To:* Federico Gaiti >> *Cc:* Steve Lianoglou; bioconductor@r-project.org >> *Subject:* Re: [BioC] Low number of replicates DESeq >> >> hi Federico, >> >> The question of design falls on what you are looking for. The >> multifactorial design gets at differences which are consistent for both >> stranded and unstranded experiments (though the unstranded has 3 times more >> samples, so contributes more to a gene's likelihood of being detected DE >> here). >> >> But to go back to an earlier point. You mentioned earlier: "I am >> investigating long non-coding RNAs and so I need to have anti-sense >> transcription quantification." Your current multifactorial analysis is >> looking for consistent differences across developmental stages, between >> sense transcription (your stranded experiment counted with -s yes rather >> than -s reverse) and when you combine sense and anti-sense transcription >> (your unstranded experiments counted with -s no). Anti-sense transcription >> plays little role here if we assume that more reads are coming from sense >> than anti-sense. >> >> Note that if you are looking in particular for differences in >> anti-sense transcription across developmental stages, you need to use the >> -s reverse option to htseq-count, and peform biological replicates. I don't >> see any way around requiring more replicates, as there are both technical >> and biological sources of variation which will be different in the stranded >> and unstranded experiments. Adding in the unstranded data seems not so >> helpful, as you are mixing a small signal of interest (anti-sense >> transcription) with most likely a lot more reads coming from sense >> transcription. >> >> You mentioned, "I tried to use the option -s reverse for the stranded >> data and still got really low correlation." Wouldn't this makes sense, >> because you are comparing anti-sense transcription to the unstranded >> protocol which is likely capturing mostly sense transcription? >> >> Mike >> >> >> >> >> On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti <f.gaiti@uq.edu.au>wrote: >> >>> Hi Steve, >>> >>> I carefully read the DESeq2 vignette (February 19, 2014) and then did >>> the DGE using two different models as you suggested and then performed >>> different contrasts. >>> >>> Multifactorial >>> >>> >>> > head(CountTable) >>> ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 >>> COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 >>> asmbl_1 30 0 24 48 84 5 1 1 47 15 >>> 8 6 47 28 27 47 >>> asmbl_10 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 >>> asmbl_100 0 0 0 0 0 4 0 0 0 1 >>> 2 3 2 5 5 7 >>> asmbl_1000 0 8 0 7 5 5 19 7 14 4 >>> 4 7 9 4 12 1 >>> asmbl_10000 11 75 0 73 103 12 112 51 65 43 >>> 17 57 56 18 23 63 >>> asmbl_10001 0 1 0 0 3 11 6 4 4 4 >>> 11 1 5 0 3 0 >>> >>> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,de sign=~libType >>> + condition) >>> >>> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PR ECOMP","COMP","JUV","ADULT")) >>> design(dds)<-formula(~libType + condition ) >>> dds<-DESeq(dds) >>> res<-results(dds) >>> resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) >>> resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) >>> resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) >>> >>> > resultsNames(dds) >>> [1] "Intercept" "libTypestranded" >>> [3] "libTypeunstranded" "conditionPRECOMP" >>> [5] "conditionCOMP" "conditionJUV" >>> [7] "conditionADULT" >>> >>> >>> ONE FACTOR >>> >>> >>> > head(CountTable) >>> ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 >>> PRECOMP1 PRECOMP2 PRECOMP3 >>> asmbl_1 0 24 48 5 1 1 15 8 6 >>> 28 27 47 >>> asmbl_10 0 0 0 0 0 0 0 0 0 >>> 0 0 0 >>> asmbl_100 0 0 0 4 0 0 1 2 3 >>> 5 5 7 >>> asmbl_1000 8 0 7 5 19 7 4 4 7 >>> 4 12 1 >>> asmbl_10000 75 0 73 12 112 51 43 17 57 >>> 18 23 63 >>> asmbl_10001 1 0 0 11 6 4 4 11 1 >>> 0 3 0 >>> >>> >>> >>> >>> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,de sign=~condition) >>> >>> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PR ECOMP","COMP","JUV","ADULT")) >>> design(dds)<-formula(~condition ) >>> dds<-DESeq(dds) >>> res<-results(dds) >>> resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) >>> resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) >>> resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) >>> >>> > resultsNames(dds) >>> [1] "Intercept" "conditionPRECOMP" >>> [3] "conditionCOMP" "conditionJUV" >>> [5] "conditionADULT" >>> >>> Here is the number of DE genes at a threshold of 0.05 (padj<0.05) >>> >>> PreComp-Comp Comp-Juv Juv- Adult >>> Shared 1400 5541 5733 >>> Multifactorial specific 98 1584 1304 >>> One-factor specific 1436 1658 2480 >>> >>> As you can see considering *only* unstranded data in the analysis >>> detected more DE genes but they seem comparable (at least to me). >>> >>> Any thougths on this? Should I rely on the multifactorial design? >>> >>> Thanks for help >>> Fede >>> >>> ________________________________________ >>> From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] >>> on behalf of Steve Lianoglou [lianoglou.steve@gene.com] >>> Sent: Wednesday, 26 February 2014 7:03 PM >>> To: Federico Gaiti >>> Cc: bioconductor@r-project.org >>> Subject: Re: [BioC] Low number of replicates DESeq >>> >>> Hi, >>> >>> On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti@uq.edu.au> >>> wrote: >>> > Hi Steve, >>> > >>> > thanks for the reply and sorry for all the code. >>> > I'm still a beginner in this field so I'm still learning how to >>> correctly formulate my questions/emails. >>> >>> Yeah, no problem, just pointing these things out -- keep in mind that >>> it takes even experienced people time to wade through lots of code, so >>> it's best to keep things short and sweet (with sufficient detail, of >>> course ;-) >>> >>> > I agree with you about the PCA plot analysis. >>> > Could you just explain better to me what you mean with " If this is >>> the case, then encoding the libType as a main effect in your model (as >>> you've done) should go a long ways in dealing with this issue for you." ?? >>> > >>> > So let's see if I got what you are saying. >>> > Are you suggesting I should try to do a DGE with the undtranded data >>> with "condition" as the only level and then compare it to the DGE outcome >>> using a multifactorial design? >>> > >>> > This would be the way I start the multifactorial analysis: >>> > >>> > >>> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,de sign=~condition >>> + libType) >>> >> >>> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PR ECOMP","COMP","JUV","ADULT")) >>> >> design(dds)<-formula(~libType + condition) >>> > >>> > Am I getting it right? >>> > If so, I'll go ahead and keep you posted about the outcome >>> >>> Yes, you are getting it right -- I'd put the `condition` data on the >>> Design data.frame before you create the dds, but I'm not if it will >>> matter. >>> >>> Just follow closely the example in the deseq2 vignette. Read the >>> entire vignette actually, so you understand how to get the particular >>> results you are after out of your objects (ie. what the things are >>> that you should pass into a call to `results` for instance). >>> >>> You will be working with two models, say `dds1` which was built with >>> *only* the unstranded data and your design is ~ condition. >>> >>> dds2 will be the model with the unstranded and stranded along with the >>> `~ libType + condition` design. >>> >>> Once you have those, look at the output from `resultsNames(dds1)` and >>> `resultsNames(dds2)` and see that you compare the same results between >>> dds1 and dds2. >>> >>> This should become more clear to you as you read the deseq2 vignette >>> (read it again if you think you already read it once) and when you >>> look with your data. >>> >>> Note that the DESeq2 folks recently posted an early version of the >>> paper detaling the deseq2 method here: >>> http://biorxiv.org/content/early/2014/02/19/002832 >>> >>> Which would be helpful to read. >>> >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Computational Biologist >>> Genentech >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Michael Love 41k

0

Entering edit mode

Hi Mike, It's nice to hear this from you!I feel I'm not too far to get the DGE done So back to DGE as per my previouos email, I carefully read the DESeq2 vignette (February 19, 2014) and then did the DGE using two different models as previously suggested. Multifactorial (considering both stranded and unstranded data) > head(CountTable) ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 30 0 24 48 84 5 1 1 47 15 8 6 47 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 0 0 4 0 0 0 1 2 3 2 5 5 7 asmbl_1000 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 asmbl_10000 11 75 0 73 103 12 112 51 65 43 17 57 56 18 23 63 asmbl_10001 0 1 0 0 3 11 6 4 4 4 11 1 5 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~libType + condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "libTypestranded" [3] "libTypeunstranded" "conditionPRECOMP" [5] "conditionCOMP" "conditionJUV" [7] "conditionADULT" ONE FACTOR (only the 3 replicates from unstranded data) > head(CountTable) ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 0 24 48 5 1 1 15 8 6 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 4 0 0 1 2 3 5 5 7 asmbl_1000 8 0 7 5 19 7 4 4 7 4 12 1 asmbl_10000 75 0 73 12 112 51 43 17 57 18 23 63 asmbl_10001 1 0 0 11 6 4 4 11 1 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "conditionPRECOMP" [3] "conditionCOMP" "conditionJUV" [5] "conditionADULT" Here is the number of DE genes at a threshold of 0.05 (padj<0.05) PreComp-Comp Comp-Juv Juv-Adult Shared 1400 5541 5733 Multifactorial specific 98 1584 1304 One-factor specific 1436 1658 2480 As you can see considering *only* unstranded data in the analysis detected more DE genes but they seem comparable (at least to me). As you said though the multifactorial design would be the best. Any thougths on this? Should I lock the multifactorial design in and work with these DE genes for my downstream analysis or do you have any further suggestions? Thanks for help Fede Federico Gaiti Ph.D. Candidate Goddard Building (8), Room 339 School of Biological Sciences University of Queensland St Lucia QLD 4072 Australia f.gaiti@uq.edu.au ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Friday, 28 February 2014 12:11 PM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org Subject: Re: [BioC] Low number of replicates DESeq Then I would recommend the multifactorial design, as it's the best you can do without stranded replicates. it will be underpowered for transcripts which are mostly made up of those positions which Simon described: "position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand". Because for such transcripts, only the single replicate for the stranded experiments will contribute signal (if I am getting it correct this time). On Thu, Feb 27, 2014 at 7:36 PM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: No worries at all. It's all a good exercise for me. I'm learning a lot just from this email exchange. Back to the DGE and considering the situation, what would you raccomend for the DGE on DESeq2? Thanks again Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>] Sent: Friday, 28 February 2014 10:27 AM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: RE: [BioC] Low number of replicates DESeq Hi Federico, Yes, Simon is right. Please ignore my previous email. Sorry for adding to the confusion. Mike On Feb 27, 2014 5:51 PM, "Federico Gaiti" <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: Hi Mike, Thanks for reply. I see what you mean but I'm a bit confused about ht- seq count now. Please see also an open thread with Simon Anders where I'm discussing this in details: http://seqanswers.com/forums/showthread.php?p=133959& posted=1#post133959 Sorry for the crosspost I know, I just didn't know it's not a good idea. It's actually one of the first question I post online so I'm still learning how all this works. I am investgating lncRNAs, which can be intronic, intergenic, can overlap on the same strand of another gene or have anti-sense orientation. That's what I meant with "I need to have anti-sense transcription". I need the stranded data to account for this. As for htseq-count I thought that depending on the library preparation for stranded libraries I could select -s reverse or -s yes. And so to be sure I did a quick test on the stranded libraries using infer_experiment.py, it is indeed forward-reverse. This is PairEnd Data Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189 Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811 Fraction of reads explained by other combinations: 0.0000 1++,1,2+-,2-+ read1 mapped to + strand indicates parental gene on + strand read1 mapped to - strand indicates parental gene on - strand read2 mapped to + strand indicates parental gene on - strand read2 mapped to - strand indicates parental gene on + strand Based on this I ran TOPHAT with fr-secondstrand option and htseq-count with -s yes As Simon said in the other thread "if a read maps to a position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand, then this read will be counted as ambiguous if you have set "stranded" to "no", because there is no information to decide whether the read originated from the sense of from the antisense transcript. For "stranded=yes", however, the read will be counted for the feature that is on the same strand as the read" So why would my stranded experiment counted with -s yes capture only sense trasncription? Shouldn't my stranded experiment counted with -s yes capture both sense and anti-sense transcription based on where the reads map? Also, if selecting this option depends on the library prepration protocol and not on the DGE design, shouldn't -s reverse be "wrong" in my case? Thanks for clarifications and help Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>] Sent: Friday, 28 February 2014 2:23 AM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] Low number of replicates DESeq hi Federico, The question of design falls on what you are looking for. The multifactorial design gets at differences which are consistent for both stranded and unstranded experiments (though the unstranded has 3 times more samples, so contributes more to a gene's likelihood of being detected DE here). But to go back to an earlier point. You mentioned earlier: "I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification." Your current multifactorial analysis is looking for consistent differences across developmental stages, between sense transcription (your stranded experiment counted with -s yes rather than -s reverse) and when you combine sense and anti-sense transcription (your unstranded experiments counted with -s no). Anti- sense transcription plays little role here if we assume that more reads are coming from sense than anti-sense. Note that if you are looking in particular for differences in anti- sense transcription across developmental stages, you need to use the -s reverse option to htseq-count, and peform biological replicates. I don't see any way around requiring more replicates, as there are both technical and biological sources of variation which will be different in the stranded and unstranded experiments. Adding in the unstranded data seems not so helpful, as you are mixing a small signal of interest (anti-sense transcription) with most likely a lot more reads coming from sense transcription. You mentioned, "I tried to use the option -s reverse for the stranded data and still got really low correlation." Wouldn't this makes sense, because you are comparing anti-sense transcription to the unstranded protocol which is likely capturing mostly sense transcription? Mike On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: Hi Steve, I carefully read the DESeq2 vignette (February 19, 2014) and then did the DGE using two different models as you suggested and then performed different contrasts. Multifactorial > head(CountTable) ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 30 0 24 48 84 5 1 1 47 15 8 6 47 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 0 0 4 0 0 0 1 2 3 2 5 5 7 asmbl_1000 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 asmbl_10000 11 75 0 73 103 12 112 51 65 43 17 57 56 18 23 63 asmbl_10001 0 1 0 0 3 11 6 4 4 4 11 1 5 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~libType + condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "libTypestranded" [3] "libTypeunstranded" "conditionPRECOMP" [5] "conditionCOMP" "conditionJUV" [7] "conditionADULT" ONE FACTOR > head(CountTable) ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 0 24 48 5 1 1 15 8 6 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 4 0 0 1 2 3 5 5 7 asmbl_1000 8 0 7 5 19 7 4 4 7 4 12 1 asmbl_10000 75 0 73 12 112 51 43 17 57 18 23 63 asmbl_10001 1 0 0 11 6 4 4 11 1 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "conditionPRECOMP" [3] "conditionCOMP" "conditionJUV" [5] "conditionADULT" Here is the number of DE genes at a threshold of 0.05 (padj<0.05) PreComp-Comp Comp-Juv Juv-Adult Shared 1400 5541 5733 Multifactorial specific 98 1584 1304 One-factor specific 1436 1658 2480 As you can see considering *only* unstranded data in the analysis detected more DE genes but they seem comparable (at least to me). Any thougths on this? Should I rely on the multifactorial design? Thanks for help Fede ________________________________________ From: mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> [mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> ] on behalf of Steve Lianoglou [lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>] Sent: Wednesday, 26 February 2014 7:03 PM To: Federico Gaiti Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] Low number of replicates DESeq Hi, On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote: > Hi Steve, > > thanks for the reply and sorry for all the code. > I'm still a beginner in this field so I'm still learning how to correctly formulate my questions/emails. Yeah, no problem, just pointing these things out -- keep in mind that it takes even experienced people time to wade through lots of code, so it's best to keep things short and sweet (with sufficient detail, of course ;-) > I agree with you about the PCA plot analysis. > Could you just explain better to me what you mean with " If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you." ?? > > So let's see if I got what you are saying. > Are you suggesting I should try to do a DGE with the undtranded data with "condition" as the only level and then compare it to the DGE outcome using a multifactorial design? > > This would be the way I start the multifactorial analysis: > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi gn=~condition + libType) >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE COMP","COMP","JUV","ADULT")) >> design(dds)<-formula(~libType + condition) > > Am I getting it right? > If so, I'll go ahead and keep you posted about the outcome Yes, you are getting it right -- I'd put the `condition` data on the Design data.frame before you create the dds, but I'm not if it will matter. Just follow closely the example in the deseq2 vignette. Read the entire vignette actually, so you understand how to get the particular results you are after out of your objects (ie. what the things are that you should pass into a call to `results` for instance). You will be working with two models, say `dds1` which was built with *only* the unstranded data and your design is ~ condition. dds2 will be the model with the unstranded and stranded along with the `~ libType + condition` design. Once you have those, look at the output from `resultsNames(dds1)` and `resultsNames(dds2)` and see that you compare the same results between dds1 and dds2. This should become more clear to you as you read the deseq2 vignette (read it again if you think you already read it once) and when you look with your data. Note that the DESeq2 folks recently posted an early version of the paper detaling the deseq2 method here: http://biorxiv.org/content/early/2014/02/19/002832 Which would be helpful to read. -steve -- Steve Lianoglou Computational Biologist Genentech [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Federico Gaiti ▴ 130

0

Entering edit mode

hi Federico, I would second Steve's recommendation to use DESeq2 and to take a look at the vignette section on sample clustering and visualization. For instance, it appears you are looking at correlations in the raw count scale, while we recommend some kind of transformation into the log scale (we offer two such transformations in DESeq2 which are described in the vignette). Even log(x + 1) would be better than the raw count scale, where inter-sample distances will be dominated by a few large counts. As far as DGE analysis, you can add covariates to the design formula -- as discussed in the vignette -- which would account for differences due to the change in protocol. This would look like : design(dds) <- ~ protocol + condition This setup will look for consistent log fold changes due to condition, while accounting for the stranded/unstranded differences. However, I would not recommend using the dispersions estimated from the unstranded protocol to perform inference on the single replicate stranded dataset, because it might be the case that the stranded protocol results in more highly dispersed counts. Then large differences would appear to be unlikely due to chance, when they really are due to chance, i.e. generating false positives. You really need replicates to perform statistical inference on the stranded experiment. Mike On Mon, Feb 24, 2014 at 8:18 PM, Federico Gaiti <f.gaiti@uq.edu.au> wrote: > Thanks Steve -- I'll have a look at the DESeq2 vignette. > > Here is what I've done, let me know if it's still unclear > > samtools view TOPHAT_STRANDED_sorted.bam | python -m HTSeq.scripts.count > -s no - gtfFile > stranded_counts_noS.txt > > head(Counts) > > 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 3.1 > 3.2 3.3 3.4 4.1 4.2 4.3 4.4 > > XXXX 9 0 24 48 30 5 1 1 21 > 15 8 6 28 28 27 47 > > XXXX 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 > > XXXX 16 0 0 0 19 4 0 0 40 > 1 2 3 78 5 5 7 > > XXXX 0 8 0 7 5 5 19 7 14 > 4 4 7 9 4 12 1 > ...... > > and then in R: > > data<-read.table("Counts",header=TRUE,row.names=1) > head(data) > Design=data.frame( > row.names=colnames(data), > > condition=c('1','1','1','1','2','2','2','2','3','3','3','3','4','4' ,'4','4')) > cds=newCountDataSet(data,Design) > cds=estimateSizeFactors(cds) > sizeFactors(cds) > All_samples_normalized<-counts(cds,normalized=TRUE) > table(rowSums(All_samples_normalized)==0) > data<-All_samples_normalized[rowSums(All_samples_normalized)>0,] > data_subset_matrix<-as.matrix(data) > Spearman_normalized<-rcorr(data_subset_matrix,type="spearman") > A<-Spearman_normalized$r > write.table(A,file="Spearman_normalized_allsamples.txt") > Pearson_normalized<-rcorr(data_subset_matrix,type="pearson") > B<-Pearson_normalized$r > write.table(B,file="Pearson_normalized_allsamples.txt") > A_matrix<-as.matrix(A) > B_matrix<-as.matrix(B) > corrgram(B_matrix,order="PCA") > corrgram(A_matrix,order="PCA") > pca3=PCA(A_matrix,graph=TRUE) > pca3=PCA(B_matrix,graph=TRUE) > > This gave me a nice correlation but I'd still like to to use the stranded > counts for the DGE. > The issue is that if I only use stranded data I don't have replicates. If > I include the 3 unstranded replicates I need to use the option -s no for > the stranded data because otherwise stranded and unstranded do not > correlate. > > So my ideas was to use the unstranded data to estimate the level of > variation to get a threshold for DE detection but still use the stranded > data as expression values. > > Would it be possible? Or would it be better to stick to the -s no option > for the DGE? > > Thanks > Federico > > > > > ________________________________________ > From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on > behalf of Steve Lianoglou [lianoglou.steve@gene.com] > Sent: Tuesday, 25 February 2014 10:34 AM > To: Federico Gaiti > Cc: bioconductor@r-project.org > Subject: Re: [BioC] Low number of replicates DESeq > > Hi, > > Since you are just starting your analysis and are in the world of > DESeq, you should probably switch to DESeq2. > > You mention things about "low correlation" but it's not clear what > conditions you are comparing where. Instead of describing your > analysis at a high level, showing the code that you used would be more > helpful. > > That having been said, the first thing I would do is to perform the > steps outlined in the DESeq2 vignette under the section "Data quality > assessment by sample clustering and visualization" to see if your > replicate data cluster closely together in meaningful ways using the > heatmaps and PCA plots outlined there. > > HTH, > -steve > > On Mon, Feb 24, 2014 at 3:31 PM, Federico Gaiti <f.gaiti@uq.edu.au> wrote: > > Hi all, > > > > I am using DESEq for a DGE analysis. > > > > I have STRANDED RNA-Seq data for 4 developmental stages with no > replicates but I know that to have a more reliable DGE I should have > replicates. So I got (from another lab member) UNSTRANDED RNA-Seq data with > 3 replicates per stage. > > > > So my data situation at the moment is: > > > > STAGE 1 stranded > > STAGE 1.1 unstranded > > STAGE 1.2 unstranded > > STAGE 1.3 unstranded > > STAGE 2 stranded > > STAGE 2.1 unstranded > > STAGE 2.2 unstranded > > STAGE 2.3 unstranded > > STAGE 3 stranded > > STAGE 3.1 unstranded > > STAGE 3.2 unstranded > > STAGE 3.3 unstranded > > STAGE 4 stranded > > STAGE 4.1 unstranded > > STAGE 4.2 unstranded > > STAGE 4.3 unstranded > > > > Before doing a DGE, I thought to test the correlation between these > samples, just to show that similar samples "cluster" together. If so, I > thought to use the unstranded data for my DGE analysis to reach the final > number of 4 replicates per stage. > > > > I mapped the raw reads to the genome using TOPHAT (v2.0.9) > (fr-unstranded for unstranded data and fr-secondstrand for stranded data), > used htseq-count (HTSeq 0.5.4p5) to get the raw reads counts for both the > data. For the stranded data I used the option -s yes and for the unstranded > data I used -s no. I then used DESeq (v1.14.0) to include metadata and for > normalization, and I removed the genes that always have a 0 value. I then > calcualted the correlation which was really low. > > > > I tried to use the option -s reverse for the stranded data and still got > really low correlation. So I reran htseq-count on the stranded data > selecting the option -s no and in this way I got a very similar number of > total counts between the unstranded and stranded data, around 4-5M counts > each stage (while both cases before the stranded ones were double in > number). > > > > I included the metadata > > > > > >> Design > > condition > > ADULT ADULT > > ADULT1 ADULT > > ADULT2 ADULT > > ADULT3 ADULT > > JUV JUV > > JUV1 JUV > > JUV2 JUV > > JUV3 JUV > > COMP COMP > > COMP1 COMP > > COMP2 COMP > > COMP3 COMP > > PRECOMP PRECOMP > > PRECOMP1 PRECOMP > > PRECOMP2 PRECOMP > > PRECOMP3 PRECOMP > > > > and estimated the new size factors, normalized and calculated the new > correlation. Pearson performed pretty well, confirmed by both a PCA and > correlogram. So my initial idea was to do a DGE "treating" the stranded > data as unstranded, having 4 replicates per stage. Though, I'd still like > to figure out a way to use the stranded counts since I am not sure if I am > losing some information running htseq-count using -s no on the stranded > data. > > > > > > What I had in mind was using unstranded data to estimate the level of > variation to get a threshold for DE detection but still use the stranded > data as expression values. Not sure if I can do that though given one is > stranded and the other is not. > > > > > > I would like to hear from you if you have any thoughts about this. > > > > > > Let me know if you need any further details to better understand the > issue. > > > > > > Thanks in advance, > > > > Federico > > > > Federico Gaiti > > Ph.D. Candidate > > School of Biological Sciences > > University of Queensland > > St Lucia QLD 4072 > > Australia > > f.gaiti@uq.edu.au > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Steve Lianoglou > Computational Biologist > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Michael Love 41k

0

Entering edit mode

Hi MiKe, thanks for useful suggestion. As you and Steve suggested, I followed the multifactorial design section in both DESeq and DESeq2 vignette and then checked the quality by sample clustering and visualization. In DESeq 1.14: CountTable=read.table("DGE.txt",header=TRUE,row.names=1) head(CountTable) Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT"," ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP","C OMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded","un stranded","unstranded","unstranded","stranded","unstranded","unstrande d","unstranded","stranded","unstranded","unstranded","unstranded","str anded","unstranded","unstranded","unstranded")) cdsFull=newCountDataSet(CountTable,Design) head(cdsFull) cdsFull=estimateSizeFactors(cdsFull) sizeFactors(cdsFull) cdsFull=estimateDispersions(cdsFull) plotDispEsts(cdsFull) fit1=fitNbinomGLMs(cdsFull,count ~ libType + condition) str(fit1) fit0=fitNbinomGLMs(cdsFull,count ~ libType ) pvalsGLM = nbinomGLMTest( fit1, fit0 ) padjGLM = p.adjust( pvalsGLM, method="BH" ) head(fit1) cdsFullB=newCountDataSet(CountTable,Design$condition) cdsFullB=estimateSizeFactors(cdsFullB) cdsFullB=estimateDispersions(cdsFullB) rs = rowSums ( counts ( cdsFull )) theta = 0.4 use = (rs > quantile(rs, probs=theta)) table(use) cdsFilt = cdsFull[ use, ] fitFilt1 = fitNbinomGLMs( cdsFilt, count ~ libType + condition ) fitFilt0 = fitNbinomGLMs( cdsFilt, count ~ libType ) pvalsFilt = nbinomGLMTest( fitFilt1, fitFilt0 ) padjFilt = p.adjust(pvalsFilt, method="BH" ) padjFiltForComparison = rep(+Inf, length(padjGLM)) padjFiltForComparison[use] = padjFilt tab3 = table( `no filtering` = padjGLM < .05,`with filtering` = padjFiltForComparison < .05 ) addmargins(tab3) res=cbind(fit1,pvalsGLM=pvalsGLM,padjGLM=padjGLM) resSig=res[which(res$padjGLM<0.05),] table(res$padjGLM<0.05) write.csv(res,file='res_Multifactor_DESeq.csv') hist(res$pvalsGLM,breaks=100) resFilt=cbind(fitFilt1,pvalsFilt=pvalsFilt,padjFilt=padjFilt) resSigFilt=resFilt[which(resFilt$padjFilt<0.05),] table(resFilt$padjFilt<0.05) write.csv(resFilt,file='resFilt_Multifactor_DESeq.csv') plot(rank(rs)/length(rs), -log10(pvalsGLM), pch=16, cex=0.45) h1 = hist(pvalsGLM[!use], breaks=50, plot=FALSE) h2 = hist(pvalsGLM[use], breaks=50, plot=FALSE) colori = c(`do not pass`="khaki", `pass`="powderblue") barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,space = 0, main = "", ylab="frequency") text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA) legend("topright", fill=rev(colori), legend=rev(names(colori))) cdsFullBlind = estimateDispersions( cdsFull, method = "blind" ) vsdFull = varianceStabilizingTransformation( cdsFullBlind ) select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:100] hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100) heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6)) heatmap.2(counts(cdsFull)[select,], col = hmcol, trace="none", margin=c(10,6)) dists = dist( t( exprs(vsdFull) ) ) mat = as.matrix( dists ) rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : ")) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) print(plotPCA(vsdFull, intgroup=c("condition", "libType"))) In DESeq2 1.3.41: CountTable=read.table("DGE.txt",header=TRUE,row.names=1) head(CountTable) Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT"," ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP","C OMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded","un stranded","unstranded","unstranded","stranded","unstranded","unstrande d","unstranded","stranded","unstranded","unstranded","unstranded","str anded","unstranded","unstranded","unstranded")) dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design =~condition + libType) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM P","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) rld <- rlogTransformation(dds, blind=TRUE) vsd <- varianceStabilizingTransformation(dds, blind=TRUE) vstMat <- assay(vsd) rlogMat <- assay(rld) px <- counts(dds)[,1] / sizeFactors(dds)[1] ord <- order(px) ord <- ord[px[ord] < 150] ord <- ord[seq(1, length(ord), length=50)] last <- ord[length(ord)] vstcol <- c("blue", "black") matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)") legend("bottomright", legend = c(expression("variance stabilizing transformation"), expression(log[2](n/s[1]))),fill=vstcol) library("vsn") par(mfrow=c(1,3)) notAllZero <- (rowSums(counts(dds))>0) meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5)) meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5)) meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5)) select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100] hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6)) heatmap.2(assay(rld)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6)) heatmap.2(assay(vsd)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6)) distsRL <- dist(t(assay(rld))) select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30] hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6)) distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition)) rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition, libType, sep=" : ")) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13)) print(plotPCA(rld, intgroup=c("condition", "libType"))) See attached the plots I obtained (*_MULTIFACTORIAL.jpeg) Let me know what you think I wouldn't bother using the stranded data if I was doing a vanilla DGE but I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification. This is reason that brought me to include the unstranded data in the analysis, to increase the number of replicates and test for DE to obtain a set of lncRNAs I can handle for downstream analysis. I really need to have replicates! Let me know if you have any further idea Thanks a lot for help Federico 6 February 2014 1:03 AM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor at r-project.org Subject: Re: [BioC] Low number of replicates DESeq hi Federico, I would second Steve's recommendation to use DESeq2 and to take a look at the vignette section on sample clustering and visualization. For instance, it appears you are looking at correlations in the raw count scale, while we recommend some kind of transformation into the log scale (we offer two such transformations in DESeq2 which are described in the vignette). Even log(x + 1) would be better than the raw count scale, where inter-sample distances will be dominated by a few large counts. As far as DGE analysis, you can add covariates to the design formula -- as discussed in the vignette -- which would account for differences due to the change in protocol. This would look like : design(dds) <- ~ protocol + condition This setup will look for consistent log fold changes due to condition, while accounting for the stranded/unstranded differences. However, I would not recommend using the dispersions estimated from the unstranded protocol to perform inference on the single replicate stranded dataset, because it might be the case that the stranded protocol results in more highly dispersed counts. Then large differences would appear to be unlikely due to chance, when they really are due to chance, i.e. generating false positives. You really need replicates to perform statistical inference on the stranded experiment. Mike On Mon, Feb 24, 2014 at 8:18 PM, Federico Gaiti <f.gaiti at="" uq.edu.au<mailto:f.gaiti="" at="" uq.edu.au="">> wrote: Thanks Steve -- I'll have a look at the DESeq2 vignette. Here is what I've done, let me know if it's still unclear samtools view TOPHAT_STRANDED_sorted.bam | python -m HTSeq.scripts.count -s no - gtfFile > stranded_counts_noS.txt head(Counts) 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 XXXX 9 0 24 48 30 5 1 1 21 15 8 6 28 28 27 47 XXXX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XXXX 16 0 0 0 19 4 0 0 40 1 2 3 78 5 5 7 XXXX 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 ...... and then in R: data<-read.table("Counts",header=TRUE,row.names=1) head(data) Design=data.frame( row.names=colnames(data), condition=c('1','1','1','1','2','2','2','2','3','3','3','3','4','4' ,'4','4')) cds=newCountDataSet(data,Design) cds=estimateSizeFactors(cds) sizeFactors(cds) All_samples_normalized<-counts(cds,normalized=TRUE) table(rowSums(All_samples_normalized)==0) data<-All_samples_normalized[rowSums(All_samples_normalized)>0,] data_subset_matrix<-as.matrix(data) Spearman_normalized<-rcorr(data_subset_matrix,type="spearman") A<-Spearman_normalized$r write.table(A,file="Spearman_normalized_allsamples.txt") Pearson_normalized<-rcorr(data_subset_matrix,type="pearson") B<-Pearson_normalized$r write.table(B,file="Pearson_normalized_allsamples.txt") A_matrix<-as.matrix(A) B_matrix<-as.matrix(B) corrgram(B_matrix,order="PCA") corrgram(A_matrix,order="PCA") pca3=PCA(A_matrix,graph=TRUE) pca3=PCA(B_matrix,graph=TRUE) This gave me a nice correlation but I'd still like to to use the stranded counts for the DGE. The issue is that if I only use stranded data I don't have replicates. If I include the 3 unstranded replicates I need to use the option -s no for the stranded data because otherwise stranded and unstranded do not correlate. So my ideas was to use the unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Would it be possible? Or would it be better to stick to the -s no option for the DGE? Thanks Federico ________________________________________ From: mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> [mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> ] on behalf of Steve Lianoglou [lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>] Sent: Tuesday, 25 February 2014 10:34 AM To: Federico Gaiti Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> Subject: Re: [BioC] Low number of replicates DESeq Hi, Since you are just starting your analysis and are in the world of DESeq, you should probably switch to DESeq2. You mention things about "low correlation" but it's not clear what conditions you are comparing where. Instead of describing your analysis at a high level, showing the code that you used would be more helpful. That having been said, the first thing I would do is to perform the steps outlined in the DESeq2 vignette under the section "Data quality assessment by sample clustering and visualization" to see if your replicate data cluster closely together in meaningful ways using the heatmaps and PCA plots outlined there. HTH, -steve On Mon, Feb 24, 2014 at 3:31 PM, Federico Gaiti <f.gaiti at="" uq.edu.au<mailto:f.gaiti="" at="" uq.edu.au="">> wrote: > Hi all, > > I am using DESEq for a DGE analysis. > > I have STRANDED RNA-Seq data for 4 developmental stages with no replicates but I know that to have a more reliable DGE I should have replicates. So I got (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage. > > So my data situation at the moment is: > > STAGE 1 stranded > STAGE 1.1 unstranded > STAGE 1.2 unstranded > STAGE 1.3 unstranded > STAGE 2 stranded > STAGE 2.1 unstranded > STAGE 2.2 unstranded > STAGE 2.3 unstranded > STAGE 3 stranded > STAGE 3.1 unstranded > STAGE 3.2 unstranded > STAGE 3.3 unstranded > STAGE 4 stranded > STAGE 4.1 unstranded > STAGE 4.2 unstranded > STAGE 4.3 unstranded > > Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples "cluster" together. If so, I thought to use the unstranded data for my DGE analysis to reach the final number of 4 replicates per stage. > > I mapped the raw reads to the genome using TOPHAT (v2.0.9) (fr- unstranded for unstranded data and fr-secondstrand for stranded data), used htseq-count (HTSeq 0.5.4p5) to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no. I then used DESeq (v1.14.0) to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low. > > I tried to use the option -s reverse for the stranded data and still got really low correlation. So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data, around 4-5M counts each stage (while both cases before the stranded ones were double in number). > > I included the metadata > > >> Design > condition > ADULT ADULT > ADULT1 ADULT > ADULT2 ADULT > ADULT3 ADULT > JUV JUV > JUV1 JUV > JUV2 JUV > JUV3 JUV > COMP COMP > COMP1 COMP > COMP2 COMP > COMP3 COMP > PRECOMP PRECOMP > PRECOMP1 PRECOMP > PRECOMP2 PRECOMP > PRECOMP3 PRECOMP > > and estimated the new size factors, normalized and calculated the new correlation. Pearson performed pretty well, confirmed by both a PCA and correlogram. So my initial idea was to do a DGE "treating" the stranded data as unstranded, having 4 replicates per stage. Though, I'd still like to figure out a way to use the stranded counts since I am not sure if I am losing some information running htseq-count using -s no on the stranded data. > > > What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not. > > > I would like to hear from you if you have any thoughts about this. > > > Let me know if you need any further details to better understand the issue. > > > Thanks in advance, > > Federico > > Federico Gaiti > Ph.D. Candidate > School of Biological Sciences > University of Queensland > St Lucia QLD 4072 > Australia > f.gaiti at uq.edu.au<mailto:f.gaiti at="" uq.edu.au=""> > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Genentech _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.2 years ago Federico Gaiti ▴ 130

Login before adding your answer.