Question: Questions about GOseq after DESeq2
0
gravatar for amandine.fournier@chu-lyon.fr
5.2 years ago by
Dear Bioconductor users and developers, I have 2 questions regarding a gene ontology analysis of RNA-Seq data. I first used the DESeq2 package to perform the differential analysis of my data and it resulted in a list of about 500 DEG. Now I would like to perform a gene ontology analysis on this dataset with the GOseq package. According to the GOseq user manual, this package requires 2 pieces of information : the differentially expressed genes (= my list of 500 genes) and the measured genes. These are defined as "all genes for which RNA-seq data was gathered for your experiment". As you known, DESeq2 filters out low counting genes and outliers. So I wonder if I should consider : - all sequenced genes including low expressed genes and outliers, - or the sequenced genes whithout outliers (but including low expressed genes) - or only remaining genes after low count filtering (final assayed genes) ? I don't know which step filters outliers out. It seems that they are taking into account to estimate library sizes, but I am not sure if they are filtered before or after estimating dispersion values. Low expressed genes are filtered out at the end of the analysis, juste before calculating FDR, so they are taking into account for calculating library sizes, estimating dispersion values, fitting the GLM model and testing, but not in the final FDR. In brief, I wonder if outliers and low expressed genes can be considered as "assayed" in the DESeq2 analysis. Do you have an opinion about this ? Is there a commonly accepted / advised methodology for that ? The second question is : should I analyse over-expressed genes and under-expressed genes together or separately ? Many thanks in advance. Best regards, Amandine ----- Amandine Fournier Lyon Neuroscience Research Center & Lyon Civil Hospital (France)
goseq deseq2 • 2.7k views
ADD COMMENTlink modified 5.2 years ago by Michael Love23k • written 5.2 years ago by amandine.fournier@chu-lyon.fr80
Answer: Questions about GOseq after DESeq2
0
gravatar for Michael Love
5.2 years ago by
Michael Love23k
United States
Michael Love23k wrote:
hi Amandine, I'll defer to the GOseq authors for the recommendation but can answer your questions about DESeq2 On Thu, Feb 20, 2014 at 9:03 AM, <amandine.fournier@chu-lyon.fr> wrote: > > > I don't know which step filters outliers out. It seems that they are > taking into account to estimate library sizes, but I am not sure if they > are filtered before or after estimating dispersion values. > Low expressed genes are filtered out at the end of the analysis, juste > before calculating FDR, so they are taking into account for calculating > library sizes, estimating dispersion values, fitting the GLM model and > testing, but not in the final FDR. > > Filtering out outliers and genes with low mean count are both performed by the results() function, after estimation of size factors, dispersion, and the GLM. (estimates of dispersion and estimates of the GLM coefficients are necessary to test for outliers, as here we use Cook's distance, which is based on the influence of individual samples on the GLM coefficients). I would suggest that the background not include the outlier-containing or low mean count genes (so just use the genes with adjusted p-values that are not NA). This seems to be analogous to the filtering of genes with logFC of 0 in the GOseq vignette: Finally, we Format the DE genes into a vector suitable for use with goseq > > genes=as.integer(p.adjust(tested$table$PValue[tested$table$logFC!=0], > + method="BH")<.05) But again, I would defer to the package authors on this question. Mike [[alternative HTML version deleted]]
ADD COMMENTlink written 5.2 years ago by Michael Love23k
Answer: Questions about GOseq after DESeq2
0
gravatar for Bernd Klaus
5.2 years ago by
Bernd Klaus550
Germany
Bernd Klaus550 wrote:
Dear Amadine, I guess the most important thing in an enrichment analysis is to choose a proper "background" set of genes. ("measured genes" in the GOSeq parlance). In order to choose an appropriate set, you could adapt the code below. It first extracts the differentially expressed genes and then uses the function matchit to compile a background set that has a comparable distribution. This way, you can be sure that your background does not have a dramatically different distribution from your DE genes. Best wishes, Bernd ### extract differentially expressed genes genes <- rownames(subset(DESeqResults, pval.adj <0.1)) #Now, we generate a background set of genes matched for #expression strength, avoiding potential biases. library(MatchIt) backM <- c() ## prepare data frame for matching, sign indicates wheather ## the gene is differentially expressed or not df <- data.frame( sign=as.numeric( rownames(DESeqResults) %in% as.character(genes)), IBSres["baseMean"]) df$baseMean <- round( df$baseMean, 0) ## repeat matching multiple times since ## each differentially expressed gene is ## matched by exactly one non-expressed gene for( i in 1:3 ){ mm <- matchit( sign ~ baseMean, df, method="nearest", distance="mahalanobis") backM <- c(backM, mm$match.matrix[,1]) df <- df[which(!rownames(df) %in% backM),] } backM <- unique( na.exclude(backM) ) ## total number of matched genes length(backM ) ## no DE genes in Background: intersect(backM, genes) #---------------------------------------------------------- ###### Compare distributions of background and significant genes #---------------------------------------------------------- #We check whether our background has the same distribution #of expression strength. pdf("matching.pdf" , height = 5 , width = 5) library(geneplotter) multidensity( list( all= DESeqResults[,"baseMean"] , fore=DESeqResults[rownames(DESeqResults) %in% genes, "baseMean"], back=DESeqResults[rownames(DESeqResults) %in% backM, "baseMean"]), xlab="mean counts", xlim=c(0, 150)) dev.off() ################################ On Feb 20, 2014, at 3:03 PM, <amandine.fournier at="" chu-lyon.fr=""> wrote: > > Dear Bioconductor users and developers, > > I have 2 questions regarding a gene ontology analysis of RNA-Seq data. > > I first used the DESeq2 package to perform the differential analysis of my data and it resulted in a list of about 500 DEG. Now I would like to perform a gene ontology analysis on this dataset with the GOseq package. > > According to the GOseq user manual, this package requires 2 pieces of information : the differentially expressed genes (= my list of 500 genes) and the measured genes. These are defined as "all genes for which RNA-seq data was gathered for your experiment". > > As you known, DESeq2 filters out low counting genes and outliers. So I wonder if I should consider : > - all sequenced genes including low expressed genes and outliers, > - or the sequenced genes whithout outliers (but including low expressed genes) > - or only remaining genes after low count filtering (final assayed genes) ? > > I don't know which step filters outliers out. It seems that they are taking into account to estimate library sizes, but I am not sure if they are filtered before or after estimating dispersion values. > Low expressed genes are filtered out at the end of the analysis, juste before calculating FDR, so they are taking into account for calculating library sizes, estimating dispersion values, fitting the GLM model and testing, but not in the final FDR. > In brief, I wonder if outliers and low expressed genes can be considered as "assayed" in the DESeq2 analysis. > > Do you have an opinion about this ? Is there a commonly accepted / advised methodology for that ? > > The second question is : should I analyse over-expressed genes and under-expressed genes together or separately ? > > Many thanks in advance. > Best regards, > Amandine > > ----- > Amandine Fournier > Lyon Neuroscience Research Center > & Lyon Civil Hospital (France) > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 5.2 years ago by Bernd Klaus550

Dear Bernd,

Sorry to revive an old thread, but your approach here looks very interesting! I was just wondering if you could briefly explain what "fore" and "back" represents?

Thanks, Jon

ADD REPLYlink written 3.4 years ago by Jon BrĂ¥te150
Answer: Questions about GOseq after DESeq2
0
gravatar for Bernd Klaus
5.2 years ago by
Bernd Klaus550
Germany
Bernd Klaus550 wrote:
Dear Amadine, concerning your second question: I would start with all DE genes, If it makes sense in your assay then you could also test up- and down regulated genes separately, using the same methodology. GO analysis is very exploratory anyway, so I guess you have a lot of "liberties' there. Best wishes, Bernd On Feb 20, 2014, at 3:03 PM, amandine.fournier at chu-lyon.fr wrote: > > Dear Bioconductor users and developers, > > I have 2 questions regarding a gene ontology analysis of RNA-Seq data. > > I first used the DESeq2 package to perform the differential analysis of my data and it resulted in a list of about 500 DEG. Now I would like to perform a gene ontology analysis on this dataset with the GOseq package. > > According to the GOseq user manual, this package requires 2 pieces of information : the differentially expressed genes (= my list of 500 genes) and the measured genes. These are defined as "all genes for which RNA-seq data was gathered for your experiment". > > As you known, DESeq2 filters out low counting genes and outliers. So I wonder if I should consider : > - all sequenced genes including low expressed genes and outliers, > - or the sequenced genes whithout outliers (but including low expressed genes) > - or only remaining genes after low count filtering (final assayed genes) ? > > I don't know which step filters outliers out. It seems that they are taking into account to estimate library sizes, but I am not sure if they are filtered before or after estimating dispersion values. > Low expressed genes are filtered out at the end of the analysis, juste before calculating FDR, so they are taking into account for calculating library sizes, estimating dispersion values, fitting the GLM model and testing, but not in the final FDR. > In brief, I wonder if outliers and low expressed genes can be considered as "assayed" in the DESeq2 analysis. > > Do you have an opinion about this ? Is there a commonly accepted / advised methodology for that ? > > The second question is : should I analyse over-expressed genes and under-expressed genes together or separately ? > > Many thanks in advance. > Best regards, > Amandine > > ----- > Amandine Fournier > Lyon Neuroscience Research Center > & Lyon Civil Hospital (France) > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 5.2 years ago by Bernd Klaus550
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 165 users visited in the last hour