in my lab we have captured and sequenced L1 and ALU retrotransposons form many tissue samples from different donors/conditions.
We're now running GOstats using the list of detected somatic insertions withing Refseq genes +/- 1Kb in order to look for tissue-specific and condition-specific patterns for somatic retrotransposition events.
A known issue in the field is that since neuronal related genes on average longer that other annotated protein-coding genes, neuronal-related GO terms will show up as the most enriched in any case, no matter the tissue in examination, unless a proper background noise filtering is applied. This can be easily verified by generating a list of random bedtools intervals to simulate a set of insertions from a real experiment, intersecting the intervals with Refseq genes coordinates and running a GO analysis on the intersection, as explained and illustrated in a nice review by Thomas C.A. et al. (http://www.ncbi.nlm.nih.gov/pubmed/23057747, Fig.1).
What is in your opinion the best way to correct for this bias in this kind of analyses?
Thank you in advance.