Checking RNAseq DESeq pipeline

0

Entering edit mode

nac ▴ 280

@nac-4545

Last seen 9.6 years ago

Hi all, I have done an differential expression analysis using DESeq on a set of data and got a lot of differential expressed genes at the end. I would like people to check the code and the assumptions I made. A-Samples I am analysing 6 samples in total from 3 different cell types (E, S and H), S1and S2 ( biological replicates) H1, H2 and H3 ( biological replicate) E (no replicate) We took from one animal cells called S ( sample S1) we passaged several time these cells and isolated samples we called S2 ( sample S2) from the cell population S ,we took subclones which we cultivated with factors to obtain 3 independant cell lines of the same cell type which were derived independently from the S population, called H1 H2 and H3 (biological replicates). AT last, another cell line was derived from clone H1 which we call E1 (only sample). B-standard RNA seq was performed in these 6 samples AFter sequencing, we aligned the samples using topHat and count reads with HTSEQ python script. Duplicates haven't been removed in this process. I plan to do one analysis with removed duplicates. C-then once in DESeq I have created my count table and followed the vignette-I have test the diff Exp of condition F vs H > conds [1] E F F H H H Levels: E F H > head(countsTable) E F F.1 H H.1 H.2 1 789 195 235 1019 938 1048 2 1 0 0 11 17 31 3 543 325 460 818 883 644 4 103 89 103 114 87 96 5 281 41 102 328 245 387 6 1 2 1 9 8 7 > cds <- newCountDataSet( countsTable, conds ) > sizeFactors(cds) E F F.1 H H.1 H.2 1.0417242 0.6740896 0.8994357 1.2268668 1.1499580 1.1831515 > cds<-estimateDispersions(cds) > str( fitInfo(cds) ) List of 5 $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245 0.055592 ... $ dispFunc :function (q) ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325 .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois" ..- attr(*, "fitType")= chr "parametric" $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ... $ df : int 3 $ sharingMode : chr "maximum" >head(fData(cds)) disp_pooled 1 0.06096110 2 0.59234620 3 0.06124489 4 0.07657296 5 0.06688364 > str( fitInfo(cds) ) List of 5 $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245 0.055592 ... $ dispFunc :function (q) ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325 .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois" ..- attr(*, "fitType")= chr "parametric" $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ... $ df : int 3 $ sharingMode : chr "maximum" > res <- nbinomTest( cds, "H", "F" ) > head(res) id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval 1 1 616.515373 844.007629 275.276988 0.3261546 -1.6163720 6.817493e-06 2 2 9.990057 16.650096 0.000000 0.0000000 -Inf 2.662914e-03 3 3 594.493135 659.634053 496.781759 0.7531172 -0.4090537 2.522510e-01 4 4 99.251992 83.237929 123.273085 1.4809725 0.5665449 1.666241e-01 5 5 196.343734 269.163821 87.113605 0.3236453 -1.6275146 6.472665e-05 6 6 4.857542 6.736313 2.039386 0.3027452 -1.7238239 2.449359e-01 padj 1 4.775233e-05 2 1.088826e-02 3 5.047015e-01 4 3.642458e-01 5 3.807362e-04 6 4.934803e-01 plotDE <- function( res ) plot( res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.3, col = ifelse( res$padj < .1, "red", "black" ) ) I have attached As pdf the plotDE( res ) , hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="") and plotDispEsts(cds) for a basic idea of the data. I have ~10k genes which show a v low pvalue for DExpression which seem a lot. I have several questions, 1-Concerning the experimental design I have described in A , can the samples be considered as real bio replicate? 2-Would it be "Better" to get rid of duplicated reads prior to HTSeq 3-How can I be more stringent on the expression? many thanks Nat -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- A non-text attachment was scrubbed... Name: test.pdf Type: application/pdf Size: 307013 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120306="" ffd12aef="" attachment.pdf="">

Sequencing PROcess DESeq Sequencing PROcess DESeq • 1.3k views

ADD COMMENT • link updated 12.1 years ago by Wolfgang Huber ★ 13k • written 12.1 years ago by nac ▴ 280

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 15 days ago

EMBL European Molecular Biology Laborat…

Dear Nat the work flow you describe looks OK. I would recommend doing more visualisation and QA/QC on the data to make sure that what you are looking at is not confounded with something else (like reagent batches or protocol variation). Also, can you please make sure that you are using the most recent version of DESeq, in particular for what I propose below: http://bioconductor.org/packages/2.10/bioc/html/DESeq.html Can you try something like vsd <- getVarianceStabilizedData(cds) pairs(vsd, pch=".") library("arrayQualityMetrics") arrayQualityMetrics(vsd) Re PCR duplicates - did you use extensive PCR-amplication? How do the read alignments look along the gene models? Others may have more insight here, but if standard RNA-Seq was done, and read alignment profiles are not excessively dominated by duplicates, I think it is OK to use the data without duplicate removel. OTOH, it shouldn't make too much of a difference. Hope this helps, Wolfgang nathalie scripsit 03/06/2012 04:54 PM: > Hi all, > I have done an differential expression analysis using DESeq on a set of > data and got a lot of differential expressed genes at the end. I would > like people to check the code and the assumptions I made. > > A-Samples > I am analysing 6 samples in total from 3 different cell types (E, S and H), > S1and S2 ( biological replicates) H1, H2 and H3 ( biological replicate) > E (no replicate) > We took from one animal cells called S ( sample S1) we passaged several > time these cells and isolated samples we called S2 ( sample S2) > from the cell population S ,we took subclones which we cultivated with > factors to obtain 3 independant cell lines of the same cell type which > were derived independently from the S population, called H1 H2 and H3 > (biological replicates). > AT last, another cell line was derived from clone H1 which we call E1 > (only sample). > > B-standard RNA seq was performed in these 6 samples > AFter sequencing, we aligned the samples using topHat and count reads > with HTSEQ python script. > Duplicates haven't been removed in this process. I plan to do one > analysis with removed duplicates. > > C-then once in DESeq I have created my count table and followed the > vignette-I have test the diff Exp of condition F vs H > > conds > [1] E F F H H H > Levels: E F H > > head(countsTable) > E F F.1 H H.1 H.2 > 1 789 195 235 1019 938 1048 > 2 1 0 0 11 17 31 > 3 543 325 460 818 883 644 > 4 103 89 103 114 87 96 > 5 281 41 102 328 245 387 > 6 1 2 1 9 8 7 > > cds <- newCountDataSet( countsTable, conds ) > > > sizeFactors(cds) > E F F.1 H H.1 H.2 > 1.0417242 0.6740896 0.8994357 1.2268668 1.1499580 1.1831515 > > > cds<-estimateDispersions(cds) > > str( fitInfo(cds) ) > List of 5 > $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245 > 0.055592 ... > $ dispFunc :function (q) > ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325 > .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois" > ..- attr(*, "fitType")= chr "parametric" > $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ... > $ df : int 3 > $ sharingMode : chr "maximum" > > >head(fData(cds)) > disp_pooled > 1 0.06096110 > 2 0.59234620 > 3 0.06124489 > 4 0.07657296 > 5 0.06688364 > > > str( fitInfo(cds) ) > List of 5 > $ perGeneDispEsts: num [1:54665] 0.000954 0.592346 0.023319 0.000245 > 0.055592 ... > $ dispFunc :function (q) > ..- attr(*, "coefficients")= Named num [1:2] 0.0581 1.8325 > .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois" > ..- attr(*, "fitType")= chr "parametric" > $ fittedDispEsts : num [1:54665] 0.061 0.2741 0.0612 0.0766 0.0669 ... > $ df : int 3 > $ sharingMode : chr "maximum" > > > res <- nbinomTest( cds, "H", "F" ) > > head(res) > id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval > 1 1 616.515373 844.007629 275.276988 0.3261546 -1.6163720 6.817493e-06 > 2 2 9.990057 16.650096 0.000000 0.0000000 -Inf 2.662914e-03 > 3 3 594.493135 659.634053 496.781759 0.7531172 -0.4090537 2.522510e-01 > 4 4 99.251992 83.237929 123.273085 1.4809725 0.5665449 1.666241e-01 > 5 5 196.343734 269.163821 87.113605 0.3236453 -1.6275146 6.472665e-05 > 6 6 4.857542 6.736313 2.039386 0.3027452 -1.7238239 2.449359e-01 > padj > 1 4.775233e-05 > 2 1.088826e-02 > 3 5.047015e-01 > 4 3.642458e-01 > 5 3.807362e-04 > 6 4.934803e-01 > > > plotDE <- function( res ) > plot( > res$baseMean, > res$log2FoldChange, > log="x", pch=20, cex=.3, > col = ifelse( res$padj < .1, "red", "black" ) ) > > > I have attached As pdf the plotDE( res ) , hist(res$pval, breaks=100, > col="skyblue", border="slateblue", main="") and plotDispEsts(cds) for a > basic idea of the data. > > > I have ~10k genes which show a v low pvalue for DExpression which seem a > lot. > > I have several questions, > 1-Concerning the experimental design I have described in A , can the > samples be considered as real bio replicate? > 2-Would it be "Better" to get rid of duplicated reads prior to HTSeq > 3-How can I be more stringent on the expression? > > > > many thanks > Nat > > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 12.1 years ago Wolfgang Huber ★ 13k

Login before adding your answer.