I am working with bulk RNA-seq data from ~300 samples of human healthy and diseased tissue.
The data were mapped to the reference transcriptome using Salmon and were summarised to gene-level and transcript-level using tximport (LengthscaledTPM and DTUscaledTPM) functions respectively. The hidden variation was assessed using the gene-level expression matrix using svaseq.
The gene-level matrix was used for differential gene expression analysis using limma with voom transformation. The model for differential gene expression included the condition of interest (health/disease) and the surrogate variables (SVs) calculated by svaseq.
The transcript-level matrix was used for differential transcript usage analysis using the function limma::diffsplice to find out differences in transcript relative abundances of each gene between conditions. The model here also included the SVs calculated previously. (limma::diffsplice was the only one that scaled well with the number of my samples, other options like DEXSeq and DRIMseq were too slow).
To find out more about the different transcripts and how they are derived I checked differences in local splicing using Leafcutter method. For that I mapped the reads to the reference genome using STAR and extracted reads (Leafcutter) that are split in order to find introns sharing splicing sites. The matrix in this case is on the intron-level. When performing the differential splicing analysis (this is a Dirichlet multinomial model that Leafcutter uses) I also included the SVs in the model.
My question here is:
Can I include the surrogate variables calculated on the gene-level for all my analyses (gene, transcript, splicing) as indicative of the hidden variation in my data? Or should I run svaseq separately for all the different matrices and include different SVs in each of the three analyses?