I am trying to perform a GO analysis of differentially expressed genes with GOseq, but I get a very poor fit of pwf by the nullp function.
A bit of context:
- I used bowtie to map the reads onto the transcriptome. Then salmon in alignment mode, tximport to import the results into R, and DESeq2 for the DGE analysis (FDR p< 0.001)
- I use all the genes obtained from salmon as assayed genes (which are actually all the genes in my transcriptome since I used the same data to assemble it)
- since my organism is not supported by GOseq, I created a vector with gene lengths from tximport count matrix with lengthData <- rowMeans(count.matrix$length)which gives the mean across sample of transcript length in nucleotides
pwf <- nullp(DEgenes = go.seq.v,bias.data = lengthData) gives a weird fit 
I tried to round the length values to the integer
lengthData.round <- as.integer(round(lengthData,-0))
pwf.r <- nullp(DEgenes = go.seq.v,bias.data = lengthData.round)
the fit is better, but the function shape is not what I expected at all.
Hence my questions:
- Is it correct to use float or integer for the length values? Why does it change the plot so much? Should I change something here 
- Could my DE genes not suffer from length bias? Maybe because DESeq2 is already accounting for it. 
- What shall I change to get a better fit? Can I change something in the model? Or should I work on the data? 
- Given that for whatever reason the pwf is not fitting the data well, would it make more sense not to account for gene length at all? Can I do this in GOseq or should I use another software like ClusterProfiler? 
Thanks in advance for any input on the matter
