Question

GOseq Probability Weight Function bad fit

0

Entering edit mode

francesca • 0

@6d92f3cd

Last seen 7 weeks ago

Germany

I am trying to perform a GO analysis of differentially expressed genes with GOseq, but I get a very poor fit of pwf by the nullp function.

A bit of context:

I used bowtie to map the reads onto the transcriptome. Then salmon in alignment mode, tximport to import the results into R, and DESeq2 for the DGE analysis (FDR p< 0.001)
I use all the genes obtained from salmon as assayed genes (which are actually all the genes in my transcriptome since I used the same data to assemble it)
since my organism is not supported by GOseq, I created a vector with gene lengths from tximport count matrix with lengthData <- rowMeans(count.matrix$length) which gives the mean across sample of transcript length in nucleotides

pwf <- nullp(DEgenes = go.seq.v,bias.data = lengthData) gives a weird fit pwf

I tried to round the length values to the integer

lengthData.round <- as.integer(round(lengthData,-0))

pwf.r <- nullp(DEgenes = go.seq.v,bias.data = lengthData.round)

the fit is better, but the function shape is not what I expected at all.

pwf_round

Hence my questions:

Is it correct to use float or integer for the length values? Why does it change the plot so much? Should I change something here
Could my DE genes not suffer from length bias? Maybe because DESeq2 is already accounting for it.
What shall I change to get a better fit? Can I change something in the model? Or should I work on the data?
Given that for whatever reason the pwf is not fitting the data well, would it make more sense not to account for gene length at all? Can I do this in GOseq or should I use another software like ClusterProfiler?

Thanks in advance for any input on the matter

goseq • 215 views

ADD COMMENT • link updated 5 hours ago by Kevin Blighe ★ 4.0k • written 8 weeks ago by francesca • 0

score 0 · Answer 1 · 2025-11-18

My sincere apologies that my colleagues had ignored your question.

The probability weighting function (PWF) in GOseq is fitted using a monotonic spline to account for length bias in RNA-seq data, where longer genes are more likely to be detected as differentially expressed. Your approach of using average effective lengths from tximport is reasonable, as these approximate gene lengths, but ensure that you summarized to the gene level via tximport (using a tx2gene mapping); if not, aggregate transcript lengths per gene to avoid mismatches.

Regarding your first question, bias.data in nullp expects a numeric vector of gene lengths in base pairs, typically integers. Floating-point values are acceptable since the argument is numeric, but rounding to integers aligns with standard gene length representations and may stabilize the fit by reducing precision artifacts in the loess smoothing (span=0.5 internally). The plot changes substantially because nullp sorts and bins the data; floating-point variations can alter binning or smoothing, leading to erratic curves if lengths have high precision or outliers.

For your second question, DESeq2 normalizes counts by effective length during differential expression analysis, but length bias persists in the detection power for differentially expressed genes - longer genes remain more likely to pass significance thresholds. If your differentially expressed genes show no length bias (e.g., due to specific biology), the PWF may appear flat or poorly fitted, which is valid and indicates minimal correction is needed.

To improve the fit, inspect your length data for outliers or errors (e.g., very short or long genes) and consider filtering them. Alternatively, use median instead of mean lengths across samples:

lengthData <- rowMedians(count.matrix$length)

You cannot directly modify the model in nullp without altering the source code, so focus on data quality. Plot the raw data points to assess if the spline deviates excessively.

If the PWF fits poorly despite adjustments, omitting length correction may be appropriate if no bias is evident. In GOseq, achieve this by setting bias.data=NULL in nullp, which assigns equal weights:

pwf <- nullp(DEgenes = go.seq.v, bias.data = NULL)

Alternatively, switch to clusterProfiler, which uses hypergeometric tests by default (no inherent length correction) and supports custom annotations for non-model organisms via org.db-like objects.

Kevin