Question

genefilter filtered out 60% of the original genes?

1

Entering edit mode

cafelumiere12 ▴ 20

@cafelumiere12-7513

Last seen 6.2 years ago

United States

I'm using edgeR + genefilter. I have used it for a few different projects already but for this one I'm getting a chosen theta = 0.6. I wonder if this is something that I need to be concerned of? It just feels like a lot of genes!

Session Info here:

sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252  
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                         
[5] LC_TIME=English_United States.1252   

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base    

other attached packages:
 [1] plyr_1.8.3           pvclust_2.0-0        statmod_1.4.24     
 [4] org.Hs.eg.db_3.2.3   RSQLite_1.0.0        DBI_0.4            
 [7] AnnotationDbi_1.32.3 IRanges_2.4.8        S4Vectors_0.8.11   
[10] Biobase_2.30.0       BiocGenerics_0.16.1  edgeR_3.12.1       
[13] limma_3.26.9         genefilter_1.52.1    BiocInstaller_1.20.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4     lattice_0.20-33 XML_3.98-1.4    grid_3.2.5    
 [5] xtable_1.8-2    annotate_1.48.0 Matrix_1.2-6    splines_3.2.5 
 [9] tools_3.2.5     survival_2.39-2

Here is the code:

library(edgeR)
library(genefilter)

countsFile <- file.path("htseqcount_result.txt")
sampleInfoFile <-  file.path("groupinfo.csv")
outdir <- file.path("edgeR")

counts <- read.table(countsFile, sep = "\t", header = TRUE, row.names=1,check.names = FALSE)
sampleInfo$Group <- sampleInfo$condition

#################
##  Run edgeR  ##
#################
## Construct DGEList
d <- DGEList(counts=counts)

## Make design matrix
Group = factor(sampleInfo$Group)
design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)
rownames(design) <- colnames(counts)

my.contrasts = makeContrasts(Treated_vs_Untreated=Treated-Untreated,levels=design)

d <- calcNormFactors(d)
d <- estimateGLMCommonDisp(d, design)
d <- estimateGLMTagwiseDisp(d, design)

## Fit to model
fit <- glmFit(d, design)

alpha = 0.1
FDRcutoff = 0.05
logFCCutoff = 1

####################
##  Run analysis  ##
####################
lrt <- glmLRT(fit, contrast=my.contrasts)

dgeTable = topTags(lrt, n=nrow(d))

## try genefilter:

filtResult = tryGeneFilter(forcedManualFiltering = FALSE, min_reads = 5, nGroup = 4,
                           dgeList = d,testResultDgeTable =dgeTable, alpha = alpha,
                           contrast = contrast, designMatrix = design,
                           processPlots2screen = TRUE,
                           processPlots2pdf = TRUE,
                           pdf_filepath = file.path(output.dir,paste(compPair,"_genefilterProcess.pdf",sep="")))

if (filtResult$method == "manualFilter") ## manual filtering
{
    d2 <- filtResult$dgeFiltered
    d2 <- calcNormFactors(d2)
    d2 <- estimateGLMCommonDisp(d2, design)
    d2 <- estimateGLMTagwiseDisp(d2, design)
    fit2 <- glmFit(d2, design)
    lrt2 <- glmLRT(fit2, contrast=contrast)
    dgeTable2 = topTags(lrt2, n=nrow(d))
    resultTable = dgeTable2
} else if((filtResult$method == "genefilter"))
{
    resultTable <- filtResult$adjustedTestResultTable
}

##############################################
# Function
##############################################
tryGeneFilter <- function( forcedManualFiltering = FALSE, min_reads = 5, nGroup = 3, dgeList,
                           testResultDgeTable, filter_method = "rowSums",
                           contrast = NA, designMatrix = NA,
                           theta = seq(from=0, to=0.8, by=0.01),alpha = 0.1,
                           processPlots2screen = TRUE, processPlots2pdf = FALSE, pdf_filepath = NA
                          )
{
    #################################################################################
    ##  Args for manual filtering:                                                 ##
    ##  Filtering genes with low expression / zero count. At least 1/nGroup        ##
    ##  of the samples have at least min_reads or more reads. Filter base on       ##
    ##  the lib.size                                                               ##
    ##      - min_reads                                                            ##
    ##      - nGroup : generally the number of different groups                    ##
    ##      - dgeList                                                              ##
    ##  Args for genefilter:                                                       ##
    ##      - dgeList                                                              ##
    ##      - testResultDgeTable                                                   ##
    ##      - filter_method                                                        ##
    ##      - contrast: if not NA, only calculate filterstats based on selected    ##
    ##                  samples that are used to run the test. e.g., contrast used ##
    ##                  for design matrix. If provided, design matrix should also  ##
    ##                  be provided.It's a vector of 0, -1, 1, with names matching ##
    ##                  column names of the design matrix.                         ##
    ##      - designMatrix: the design matrix used for running the glm test.       ##
    ##      - theta: the sequence of tested percentage for gene filter.            ##
    ##      - alpha: FDR cut-off used for maximizing rejected genes                ##                                                                ##
    ##  Return:                                                                    ##
    ##      - If using manual filter: return new filtered dge list for fitting to  ##
    ##        the model and the "keep" vector                                      ##
    ##      - If using genefilter: return the new result table and also the keep   ##
    ##        vector.                                                              ##
    #################################################################################

    ## ------------------------- ##
    ##     Manual filtering      ##
    ## ------------------------- ##
    n_samples = ncol(dgeList)
    cpm_cutoff = min_reads/(max(dgeList$samples$lib.size)/1000000)
    keep_manual = rowSums(cpm(dgeList)>cpm_cutoff) >=  round(n_samples/nGroup)

    if (forcedManualFiltering)
    {
        ## Use manual filtering:
        d <- dge[keep_manual,]
        print("Forced manual filtering. Filtered dge object returned. Fit model again.")
        return(list("dgeFiltered" = d, "keep" = keep_manual, "method" = "manualFilter"))

    } else {

        ## ------------------------- ##
        ##     Gene filter           ##
        ## ------------------------- ##
        require(genefilter)
        require(edgeR)

        filter_methods <- c("rowSums")
        if (!filter_method %in% filter_methods)
        {
            stop(paste("filter_methods should be one of the following:",paste(filter_methods, collapse=","),sep=""))
        } else if (filter_method == "rowSums")
        {
            if(length(contrast)==1)
            {
                ## All samples used for making filterstats. E.g. when exact test is used, or when there are only 2 groups
                if(is.na(contrast))
                {
                    filterstats = rowSums(dgeList$counts[rownames(testResultDgeTable),])
                } else
                {
                    warning("Input contrast is not NA but has only length=1")
                }
            } else
            {
                ## Only use a subset of samples corresponding to the one that the test was run on.
                ## First check:
                if (class(designMatrix)!="matrix")
                {
                    stop("designMatrix needs to be a matrix when contrast is provided.")
                } else
                {
                    if (sum(names(contrast) %in% colnames(design))<length(contrast))
                    {
                        stop("Names of contrast needs to be in column names of design matrix")
                    } else
                    {
                        includedGroups = names(contrast)[contrast!=0]
                        includedSamplesVector = rep(0,length = nrow(design))
                        names(includedSamplesVector) <- rownames(design)
                        for (g in includedGroups)
                        {
                            includedSamplesVector = includedSamplesVector + design[,g]
                        }
                        includedSamples = names(includedSamplesVector)[includedSamplesVector>0]
                        filterstats = rowSums(d$counts[rownames(dgeTable),includedSamples])
                    }
                }
            }
        }
        res = data.frame(
            filterstat = filterstats,
            pvalue = as.data.frame(testResultDgeTable)$PValue,
            row.names = rownames(testResultDgeTable))

        # pdf(paste("genefilterFilterstat_",contrast[2],"_vs_",contrast[1],".pdf", sep="") , w=11, h=8.5)
        # print(with(res,plot(rank(filterstat)/length(filterstat), -log10(pvalue), pch=16, cex=0.45)))
        # dev.off()
        trsf = function(n) log10(n+1)
        # plot(ecdf(trsf(res$filterstat)), xlab=body(trsf), main="")

        pBH = filtered_p(filter=res$filterstat, test=res$pvalue, theta=theta, method="BH")
        rejBH = filtered_R(alpha=alpha, filter=res$filterstat, test=res$pvalue, theta=theta, method="BH")

        #####################
        ##   Print plots   ##
        #####################
        if (processPlots2screen)
        {
            rejection_plot(pBH, at="sample",
                xlim=c(min(theta), max(theta)), ylim=c(0, 2000),
                xlab="FDR cutoff (Benjamini & Hochberg adjusted p-value)", main="")

            plot(theta, rejBH, type="l",xlab=expression(theta), ylab= paste("number of rejections (alpha=",alpha,")",sep=""))
        }
        if(processPlots2pdf)
        {
            if(is.na(pdf_filepath))
            {
                pdf("genefilter.pdf")
            } else
            {
                pdf(pdf_filepath)
            }
            rejection_plot(pBH, at="sample",
                xlim=c(min(theta), max(theta)), ylim=c(0, 2000),
                xlab="FDR cutoff (Benjamini & Hochberg adjusted p-value)", main="")

            plot(theta, rejBH, type="l",xlab=expression(theta), ylab= paste("number of rejections (alpha=",alpha,")",sep=""))
            dev.off()
        }

        #######################################################################
        ##   Decide which theta to choose based on maximum rejection number  ##
        #######################################################################
        # if(max(rejBH)==0 | rejBH["0%"]== max(rejBH))
        if(max(rejBH)==0)
        {
            ## Use manual filtering:
            d <- dge[keep_manual,]
            notes = "Manual filtering chosen. Filtered dge object returned. Fit model again."
            print(notes)
            return(list("dgeFiltered" = d, "keep" = keep_manual, "method" = "manualFilter", "notes" = notes))
        } else {
            theta_chosen = theta[max(which(rejBH==max(rejBH)))]
            use = res$filterstat > quantile(res$filterstat, probs=theta_chosen)
            names(use) <- rownames(res)

            adjustedTestResult = as.data.frame(testResultDgeTable[use, ])
            newFDR <- p.adjust(adjustedTestResult[,"PValue"], method = "BH")
            adjustedTestResult[,"FDR"] <- newFDR
            notes = "Genefilter used. adjusted test result (dgeTable) returned with new FDR."
            print(notes)
            return(list("adjustedTestResultTable" = adjustedTestResult, "keep" = use, "method" = "genefilter", "theta_chosen" = theta_chosen, "notes" = notes))
            # use <- use[rownames(dge)]
            # dgeFilt=dge[use,]
            # d <- dgeFilt
        }
    }
}

genefilter edgeR • 1.9k views

ADD COMMENT • link updated 8.0 years ago by Nikos Ignatiadis ▴ 180 • written 8.0 years ago by cafelumiere12 ▴ 20

0

Entering edit mode

A vague question like that can only be responded to with a vague answer. If you want actual help, you need to say what exactly you mean by 'using edgeR + genefilter', preferably by showing your code.

ADD REPLY • link 8.0 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks! Just edited the question and attached the code I used.

ADD REPLY • link 8.0 years ago cafelumiere12 ▴ 20

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

OK. If I were to restate your question, it would go something like this:

I am using edgeR and a home-brewed function I wrote that uses some functionality from the genefilter package. This has worked for me in the past, but now it is filtering out more genes than I would normally expect. Any ideas?

And the problem with this is that your function is, like, your function, and as such is your responsibility! This site is primarily intended to help people with questions about Bioconductor functions get help from people who either wrote those functions or have expertise using them. Helping people figure out why their own functions behave differently than they expect is way beyond the scope of this site.

ADD COMMENT • link 8.0 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for the comment. Your rephrasing is exactly right; however, the function I wrote was really simply a wrapper based on the manual of genefilter. The only customized functionality I added was the option to do the manual filtering. And this was option was not used in the example .I am not looking to ask why my function behaved differently. I was wondering if someone else have seen a large proportion of genes filtered out by genefilter & whether there is any obvious aspect of the data anyone would suggest looking into.

ADD REPLY • link 8.0 years ago cafelumiere12 ▴ 20

1

Entering edit mode

Well, a general comment would be that genefilter wasn't originally intended for RNA-Seq, being written back in the old days, and I'm not sure it has been updated to be applicable. I haven't personally used genefilter since the aughts.

A second general comment would be that simple filtering is usually a better way to go. Something like ensuring at least N of your M samples (where N is equal to the size of your smallest group) have CPM values > some cutoff that corresponds roughly to say, 5 or 10 counts per gene is one way to go. Lately, Gordon Smyth has been advocating a simple cutoff based on the row average from the CPM values which is even simpler.

The rationale for filtering is to remove genes that probably aren't being expressed in any sample, or at such a low level that their signal is mostly noise. Just looking at the results of plotBCV and iteratively filtering to get rid of the majority of the noisy points on the left end of the plot is probably going to give reasonable results, even if it is completely ad hoc.

ADD REPLY • link 8.0 years ago James W. MacDonald 65k

score 2 · Accepted Answer · 2016-05-06

A key diagnostic to figure out whether the cutoff of 60% is reasonable would be to check how many rejections you get after applying this procedure (and possibly also compare to the baseline without any filtering). If the number of rejections is very low, then I would worry. If the number of rejections is high, then the FDR estimator should be quite stable and your procedure statistically valid in terms of controlling the FDR and determining an appropriate cutoff.

I don't see anything wrong with a cutoff of 60% a priori. For microarrays, cutoffs of 50% were quite common [1] and it also appears to be the case for RNASeq datasets. For example, for the airway dataset (https://www.bioconductor.org/packages/3.3/data/experiment/html/airway.html) I get a cutoff filtering out 54% of the genes. Note that a procedure such as Independent Filtering with data-driven cutoff will look at your hypotheses globally. In other words, it might remove some of your "significant" genes (by setting a high filter), as long as this removal adds additional discoveries for genes with high counts.

Motivated by these two short-comings of Independent Filtering (with a maximization of rejections), namely that it might lose FDR control when the number of rejections is very low and that a binary cutoff might remove a hypothesis with a very low p-value if its filter statistic is low, we developed the Independent Hypothesis Weighting (IHW) method. It assigns a weight to each hypothesis (higher weight means that it is more likely to be rejected) and thus can lead to more nuanced decision boundaries than Independent Filtering. This method is also available on Bioconductor at https://www.bioconductor.org/packages/3.3/bioc/html/IHW.html.

[1] Bourgon, Richard, Robert Gentleman, and Wolfgang Huber. "Independent filtering increases detection power for high-throughput experiments."Proceedings of the National Academy of Sciences 107.21 (2010): 9546-9551.