Too many DEG with edgeR output?
2
0
Entering edit mode
biotech ▴ 160
@biotech-6194
Last seen 7.7 years ago
European Union
Hi there, I'm dealing with bacterial RNA-seq analysis. My experiment is very simple. Two samples to compare and no replicates. Reads were generated with Ion Torrent PGM using 316 chip. One for each sample and performed in different days. Since I had too many differentially expressed genes, ¿Should I be more conservative assigning edgeR dispersion value? Also, there are considerable more up-regulated genes in *exvivo* than in plate sample. See logFC_vs_logCPM figure: https://docs.google.com/file/d/0B8-ZAuZe8jldY0Y5SWJSdXh1WGc/edit?usp=s haring Thanks for you help, Bernardo - tmap code: tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam --verbose tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose - Flasgstat: exvivo: >3240242 + 0 in total (QC-passed reads + QC-failed reads) >2132481 + 0 mapped (65.81%:nan%) plate: >3774075 + 0 in total (QC-passed reads + QC-failed reads) >3510438 + 0 mapped (93.01%:nan%) - count: python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID exvivo.sam HPNK.gff > exvivo.counts python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID plate.sam HPNK.gff > plate.counts - count stats: ex-vivo stats >no_feature 777946 >ambiguous 1 >too_low_aQual 0 >not_aligned 1107761 >alignment_not_unique 0 plate stats >no_feature 776707 >ambiguous 47 >too_low_aQual 0 >not_aligned 263637 >alignment_not_unique 0 - edgeR code: library(edgeR) files <- dir(pattern="*\\.counts$") RG <- readDGE(files, header=FALSE) RG keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one count per million (cpm) in at least three samples RG <- RG[keep,] dim(RG) RG <- calcNormFactors(RG) RG$samples plotMDS(RG) bcv <- 0.2 #Assigned dispersion value of 0.2 m <- as.matrix(RG) d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample number. Also can be adapted to replicated samples, see'?DGEList'. d et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG, pair=(1:2),dispersion=bcv^2) et top <- topTags(et) top cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes: summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR is given by'decideTestsDGE'. [,1] -1 200 0 1176 1 769 Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and 200 are down-regulated. detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5% FDR plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5% FDR abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes. title("plate vs ex-vivo") dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf## -- *Bernardo Bello Ortí* PhD student CReSA-IRTA Campus de Bellaterra-Universitat Autònoma de Barcelona Edifici CReSA 08193 Bellaterra (Barcelona, Spain) Tel.: 647 42 52 63 *www.cresa.es * * * [[alternative HTML version deleted]]
edgeR edgeR • 1.7k views
ADD COMMENT
0
Entering edit mode
Mark Robinson ▴ 880
@mark-robinson-4908
Last seen 6.1 years ago
Hi Bernardo, A couple quick notes. 1. Filtering > keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one > count per million (cpm) in at least three samples Despite what the comment says, this keeps only features where CPM=1 or higher in BOTH of your samples. But, maybe you'd be interested in features that are sufficiently high in 1 condition and more or less absent in the other ? Using your filter, these are removed. 2. Statistics > bcv <- 0.2 #Assigned dispersion value of 0.2 I'm assuming this is just a random guess as to what the real dispersion is. One suggestion in the User's guide is this: "Be satis ed with a descriptive analysis, that might include an MDS plot and an analysis of fold changes. Do not attempt a signi ficance analysis. This may be the best advice." (2.9 What to do if you have no replicates; http://www.bioconductor.org /packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) If you feel that you have "too many" DE genes (I'm not sure exactly what you mean and/or how you know this), you could increase the dispersion, but overall, I wouldn't read too much into the statistics (i.e. P-values), if you don't have replicates. Best, Mark On 21.10.2013, at 11:23, Bernardo Bello <popnard at="" gmail.com=""> wrote: > Hi there, > > I'm dealing with bacterial RNA-seq analysis. My experiment is very simple. > Two samples to compare and no replicates. Reads were generated with Ion > Torrent PGM using 316 chip. One for each sample and performed in different > days. > > Since I had too many differentially expressed genes, ?Should I be more > conservative assigning edgeR dispersion value? Also, there are considerable > more up-regulated genes in *exvivo* than in plate sample. > > See logFC_vs_logCPM figure: > > https://docs.google.com/file/d/0B8-ZAuZe8jldY0Y5SWJSdXh1WGc/edit?usp =sharing > > Thanks for you help, Bernardo > > > > - tmap code: > > tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam --verbose > > tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose > > - Flasgstat: > > exvivo: > >> 3240242 + 0 in total (QC-passed reads + QC-failed reads) > > >> 2132481 + 0 mapped (65.81%:nan%) > > plate: > >> 3774075 + 0 in total (QC-passed reads + QC-failed reads) > > >> 3510438 + 0 mapped (93.01%:nan%) > > > - count: > > python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID > exvivo.sam HPNK.gff > exvivo.counts > > python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID > plate.sam HPNK.gff > plate.counts > > > - count stats: > > ex-vivo stats > >> no_feature 777946 > >> ambiguous 1 > >> too_low_aQual 0 > >> not_aligned 1107761 > >> alignment_not_unique 0 > > plate stats > >> no_feature 776707 > >> ambiguous 47 > >> too_low_aQual 0 > >> not_aligned 263637 > >> alignment_not_unique 0 > > - edgeR code: > > library(edgeR) > > files <- dir(pattern="*\\.counts$") > > RG <- readDGE(files, header=FALSE) > > RG > > keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one > count per million (cpm) in at least three samples > > RG <- RG[keep,] > > dim(RG) > > > RG <- calcNormFactors(RG) > > RG$samples > > > plotMDS(RG) > > > bcv <- 0.2 #Assigned dispersion value of 0.2 > > m <- as.matrix(RG) > > d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample > number. Also can be adapted to replicated samples, see'?DGEList'. > > d > > et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG, > pair=(1:2),dispersion=bcv^2) > > et > > top <- topTags(et) > > top > > cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes: > > summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR > is given by'decideTestsDGE'. > > > [,1] > > -1 200 > > 0 1176 > > 1 769 > > Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and 200 > are down-regulated. > > detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5% > FDR > > plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5% > FDR > > abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes. > > title("plate vs ex-vivo") > > dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf## > > -- > > *Bernardo Bello Ort?* > > PhD student > > CReSA-IRTA > > Campus de Bellaterra-Universitat Aut?noma de Barcelona > > Edifici CReSA > > 08193 Bellaterra (Barcelona, Spain) > > Tel.: 647 42 52 63 *www.cresa.es * > > * > * > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------- Prof. Dr. Mark Robinson Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://tiny.cc/mrobin
ADD COMMENT
0
Entering edit mode
Hi Mark, Thanks for this tips. Best, Bernardo 2013/10/22 Mark Robinson <mark.robinson@imls.uzh.ch> > Hi Bernardo, > > A couple quick notes. > > 1. Filtering > > > keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one > > count per million (cpm) in at least three samples > > Despite what the comment says, this keeps only features where CPM=1 or > higher in BOTH of your samples. But, maybe you'd be interested in features > that are sufficiently high in 1 condition and more or less absent in the > other ? Using your filter, these are removed. > > > 2. Statistics > > > bcv <- 0.2 #Assigned dispersion value of 0.2 > > I'm assuming this is just a random guess as to what the real dispersion is. > > One suggestion in the User's guide is this: "Be satis ed with a > descriptive analysis, that might include an MDS plot and an analysis of > fold changes. Do not attempt a signi ficance analysis. This may be the best > advice." > > (2.9 What to do if you have no replicates; > http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/in st/doc/edgeRUsersGuide.pdf > ) > > If you feel that you have "too many" DE genes (I'm not sure exactly what > you mean and/or how you know this), you could increase the dispersion, but > overall, I wouldn't read too much into the statistics (i.e. P-values), if > you don't have replicates. > > > Best, Mark > > > On 21.10.2013, at 11:23, Bernardo Bello <popnard@gmail.com> wrote: > > > Hi there, > > > > I'm dealing with bacterial RNA-seq analysis. My experiment is very > simple. > > Two samples to compare and no replicates. Reads were generated with Ion > > Torrent PGM using 316 chip. One for each sample and performed in > different > > days. > > > > Since I had too many differentially expressed genes, ¿Should I be more > > conservative assigning edgeR dispersion value? Also, there are > considerable > > more up-regulated genes in *exvivo* than in plate sample. > > > > See logFC_vs_logCPM figure: > > > > > https://docs.google.com/file/d/0B8-ZAuZe8jldY0Y5SWJSdXh1WGc/edit?usp =sharing > > > > Thanks for you help, Bernardo > > > > > > > > - tmap code: > > > > tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam > --verbose > > > > tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam > --verbose > > > > - Flasgstat: > > > > exvivo: > > > >> 3240242 + 0 in total (QC-passed reads + QC-failed reads) > > > > > >> 2132481 + 0 mapped (65.81%:nan%) > > > > plate: > > > >> 3774075 + 0 in total (QC-passed reads + QC-failed reads) > > > > > >> 3510438 + 0 mapped (93.01%:nan%) > > > > > > - count: > > > > python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID > > exvivo.sam HPNK.gff > exvivo.counts > > > > python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID > > plate.sam HPNK.gff > plate.counts > > > > > > - count stats: > > > > ex-vivo stats > > > >> no_feature 777946 > > > >> ambiguous 1 > > > >> too_low_aQual 0 > > > >> not_aligned 1107761 > > > >> alignment_not_unique 0 > > > > plate stats > > > >> no_feature 776707 > > > >> ambiguous 47 > > > >> too_low_aQual 0 > > > >> not_aligned 263637 > > > >> alignment_not_unique 0 > > > > - edgeR code: > > > > library(edgeR) > > > > files <- dir(pattern="*\\.counts$") > > > > RG <- readDGE(files, header=FALSE) > > > > RG > > > > keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one > > count per million (cpm) in at least three samples > > > > RG <- RG[keep,] > > > > dim(RG) > > > > > > RG <- calcNormFactors(RG) > > > > RG$samples > > > > > > plotMDS(RG) > > > > > > bcv <- 0.2 #Assigned dispersion value of 0.2 > > > > m <- as.matrix(RG) > > > > d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample > > number. Also can be adapted to replicated samples, see'?DGEList'. > > > > d > > > > et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG, > > pair=(1:2),dispersion=bcv^2) > > > > et > > > > top <- topTags(et) > > > > top > > > > cpm(RG)[rownames(top), ] #Check the individual cpm values for the top > genes: > > > > summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR > > is given by'decideTestsDGE'. > > > > > > [,1] > > > > -1 200 > > > > 0 1176 > > > > 1 769 > > > > Of the 'number' tags identified as DE, 769 are up-regulated ex- vivo and > 200 > > are down-regulated. > > > > detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at > 5% > > FDR > > > > plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at > 5% > > FDR > > > > abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes. > > > > title("plate vs ex-vivo") > > > > dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf## > > > > -- > > > > *Bernardo Bello Ortí* > > > > PhD student > > > > CReSA-IRTA > > > > Campus de Bellaterra-Universitat Autònoma de Barcelona > > > > Edifici CReSA > > > > 08193 Bellaterra (Barcelona, Spain) > > > > Tel.: 647 42 52 63 *www.cresa.es * > > > > * > > * > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics, Institute of Molecular Life Sciences > University of Zurich > http://tiny.cc/mrobin > > > > -- *Bernardo Bello Ortí* PhD student CReSA-IRTA Campus de Bellaterra-Universitat Autònoma de Barcelona Edifici CReSA 08193 Bellaterra (Barcelona, Spain) Tel.: 647 42 52 63 *www.cresa.es * * * [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 3 minutes ago
WEHI, Melbourne, Australia
Dear Bernardo, > Date: Mon, 21 Oct 2013 11:23:57 +0200 > From: Bernardo Bello <popnard at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] Too many DEG with edgeR output? > > Hi there, > > I'm dealing with bacterial RNA-seq analysis. My experiment is very simple. > Two samples to compare and no replicates. Simplest to do but hardest to analyse. > Reads were generated with Ion Torrent PGM using 316 chip. One for each > sample and performed in different days. > > Since I had too many differentially expressed genes, Should I be more > conservative assigning edgeR dispersion value? Well, in the absence of replicates you are simply picking a dispersion value out of the air as it were. So, if you get more DE gene than you want, why not use a larger dispersion value? Otherwise you might try a new feature of edgeR to try to estimate the dispersions. First, make a DGEList object with both libraries as one group: d2 <- d d2$samples$group <- c(1,1) Then estimate the dispersion robustly: d2 <- estimateDisp(d2, robust=TRUE, winsor.tail.p=c(0.05,0.4)) plotBCV(d2) Now copy the trended dispersion to your original object: d$trended.dispersion <- d2$trended.dispersion Now do the tests: et <- exactTest(d) Best wishes Gordon > Also, there are considerable more up-regulated genes in *exvivo* than in > plate sample. I can't help you with that, if it is indeed a problem. Best wishes Gordon > See logFC_vs_logCPM figure: > > https://docs.google.com/file/d/0B8-ZAuZe8jldY0Y5SWJSdXh1WGc/edit?usp =sharing > > Thanks for you help, Bernardo > > > > - tmap code: > > tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam --verbose > > tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose > > - Flasgstat: > > exvivo: > >> 3240242 + 0 in total (QC-passed reads + QC-failed reads) > > >> 2132481 + 0 mapped (65.81%:nan%) > > plate: > >> 3774075 + 0 in total (QC-passed reads + QC-failed reads) > > >> 3510438 + 0 mapped (93.01%:nan%) > > > - count: > > python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID > exvivo.sam HPNK.gff > exvivo.counts > > python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID > plate.sam HPNK.gff > plate.counts > > > - count stats: > > ex-vivo stats > >> no_feature 777946 > >> ambiguous 1 > >> too_low_aQual 0 > >> not_aligned 1107761 > >> alignment_not_unique 0 > > plate stats > >> no_feature 776707 > >> ambiguous 47 > >> too_low_aQual 0 > >> not_aligned 263637 > >> alignment_not_unique 0 > > - edgeR code: > > library(edgeR) > > files <- dir(pattern="*\\.counts$") > > RG <- readDGE(files, header=FALSE) > > RG > > keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one > count per million (cpm) in at least three samples > > RG <- RG[keep,] > > dim(RG) > > > RG <- calcNormFactors(RG) > > RG$samples > > > plotMDS(RG) > > > bcv <- 0.2 #Assigned dispersion value of 0.2 > > m <- as.matrix(RG) > > d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample > number. Also can be adapted to replicated samples, see'?DGEList'. > > d > > et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG, > pair=(1:2),dispersion=bcv^2) > > et > > top <- topTags(et) > > top > > cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes: > > summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR > is given by'decideTestsDGE'. > > > [,1] > > -1 200 > > 0 1176 > > 1 769 > > Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and 200 > are down-regulated. > > detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5% > FDR > > plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5% > FDR > > abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes. > > title("plate vs ex-vivo") > > dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf## > > -- > > *Bernardo Bello Ort?* > > PhD student > > CReSA-IRTA > > Campus de Bellaterra-Universitat Aut?noma de Barcelona > > Edifici CReSA > > 08193 Bellaterra (Barcelona, Spain) > > Tel.: 647 42 52 63 *www.cresa.es * > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}}
ADD COMMENT
0
Entering edit mode
Dear Gordon, I ran the code you suggested to estimate dispersion in my non replicated experiment. In summary, I ended with the same MA plot but only 2 genes were tagged as statistically differentially expressed. At this point, and following your advice in in edgeR vignette: 'Be satisfied with a descriptive analysis, that might include an MDS plot and an analysis of fold changes. Do not attempt a significance analysis.' I decide that I have no more alternatives to see the real amount of DEG in my data that going to qPCR, the way my supervisor wants we to check if fold changes are true or not. I will choose genes from all fold change range in order to see to which point my fold changes match the qPCR results. Hope I will remember to update this thread when having the results. Thank you so much for all your advices. Sincerely, Bernardo library(edgeR) #Read all .counts files from current working directory files <- dir(pattern="*\\.counts$") RG <- readDGE(files, header=FALSE) RG m <- as.matrix(RG) d <- DGEList(counts=m, group=(1:2)) #Well, in the absence of replicates you are simply picking a dispersion value out of the air as it were. So, if you get more DE gene than you want, why not use a larger dispersion value? #Otherwise you might try a new feature of edgeR to try to estimate the dispersions. First, make a DGEList object with both libraries as one group: d2 <- d d2$samples$group <- c(1,1) #Then estimate the dispersion robustly: d2 <- estimateDisp(d2, robust=TRUE, winsor.tail.p=c(0.05,0.4)) plotBCV(d2) #Now copy the trended dispersion to your original object: d$trended.dispersion <- d2$trended.dispersion #Now do the tests: et <- exactTest(d) top <- topTags(et) top cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes: summary(de1 <- decideTestsDGE(et)) # [,1] -1 0 0 2258 1 2 2013/10/24 Gordon K Smyth <smyth@wehi.edu.au> > Dear Bernardo, > > Date: Mon, 21 Oct 2013 11:23:57 +0200 >> From: Bernardo Bello <popnard@gmail.com> >> To: bioconductor@r-project.org >> Subject: [BioC] Too many DEG with edgeR output? >> >> Hi there, >> >> I'm dealing with bacterial RNA-seq analysis. My experiment is very simple. >> Two samples to compare and no replicates. >> > > Simplest to do but hardest to analyse. > > Reads were generated with Ion Torrent PGM using 316 chip. One for each >> sample and performed in different days. >> >> Since I had too many differentially expressed genes, Should I be more >> conservative assigning edgeR dispersion value? >> > > Well, in the absence of replicates you are simply picking a dispersion > value out of the air as it were. So, if you get more DE gene than you > want, why not use a larger dispersion value? > > Otherwise you might try a new feature of edgeR to try to estimate the > dispersions. First, make a DGEList object with both libraries as one group: > > d2 <- d > d2$samples$group <- c(1,1) > > Then estimate the dispersion robustly: > > d2 <- estimateDisp(d2, robust=TRUE, winsor.tail.p=c(0.05,0.4)) > plotBCV(d2) > > Now copy the trended dispersion to your original object: > > d$trended.dispersion <- d2$trended.dispersion > > Now do the tests: > > et <- exactTest(d) > > Best wishes > Gordon > > > Also, there are considerable more up-regulated genes in *exvivo* than in >> plate sample. >> > > I can't help you with that, if it is indeed a problem. > > Best wishes > Gordon > > See logFC_vs_logCPM figure: >> >> https://docs.google.com/file/**d/0B8-**ZAuZe8jldY0Y5SWJSdXh1WGc/edi t?** >> usp=sharing<https: docs.google.com="" file="" d="" 0b8-zauze8jldy0y5swjsdxh="" 1wgc="" edit?usp="sharing"> >> >> Thanks for you help, Bernardo >> >> >> >> - tmap code: >> >> tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam >> --verbose >> >> tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose >> >> - Flasgstat: >> >> exvivo: >> >> 3240242 + 0 in total (QC-passed reads + QC-failed reads) >>> >> >> >> 2132481 + 0 mapped (65.81%:nan%) >>> >> >> plate: >> >> 3774075 + 0 in total (QC-passed reads + QC-failed reads) >>> >> >> >> 3510438 + 0 mapped (93.01%:nan%) >>> >> >> >> - count: >> >> python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID >> exvivo.sam HPNK.gff > exvivo.counts >> >> python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID >> plate.sam HPNK.gff > plate.counts >> >> >> - count stats: >> >> ex-vivo stats >> >> no_feature 777946 >>> >> >> ambiguous 1 >>> >> >> too_low_aQual 0 >>> >> >> not_aligned 1107761 >>> >> >> alignment_not_unique 0 >>> >> >> plate stats >> >> no_feature 776707 >>> >> >> ambiguous 47 >>> >> >> too_low_aQual 0 >>> >> >> not_aligned 263637 >>> >> >> alignment_not_unique 0 >>> >> >> - edgeR code: >> >> library(edgeR) >> >> files <- dir(pattern="*\\.counts$") >> >> RG <- readDGE(files, header=FALSE) >> >> RG >> >> keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one >> count per million (cpm) in at least three samples >> >> RG <- RG[keep,] >> >> dim(RG) >> >> >> RG <- calcNormFactors(RG) >> >> RG$samples >> >> >> plotMDS(RG) >> >> >> bcv <- 0.2 #Assigned dispersion value of 0.2 >> >> m <- as.matrix(RG) >> >> d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample >> number. Also can be adapted to replicated samples, see'?DGEList'. >> >> d >> >> et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG, >> pair=(1:2),dispersion=bcv^2) >> >> et >> >> top <- topTags(et) >> >> top >> >> cpm(RG)[rownames(top), ] #Check the individual cpm values for the top >> genes: >> >> summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR >> is given by'decideTestsDGE'. >> >> >> [,1] >> >> -1 200 >> >> 0 1176 >> >> 1 769 >> >> Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and >> 200 >> are down-regulated. >> >> detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5% >> FDR >> >> plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5% >> FDR >> >> abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes. >> >> title("plate vs ex-vivo") >> >> dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf## >> >> -- >> >> *Bernardo Bello Ortí* >> >> PhD student >> >> CReSA-IRTA >> >> Campus de Bellaterra-Universitat Autònoma de Barcelona >> >> Edifici CReSA >> >> 08193 Bellaterra (Barcelona, Spain) >> >> Tel.: 647 42 52 63 *www.cresa.es * >> >> > ______________________________**______________________________**____ ______ > The information in this email is confidential and inte...{{dropped:31}}
ADD REPLY

Login before adding your answer.

Traffic: 647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6