Transcriptional amplification is the phenomenon where the majority of genes in a sample are increased in expression. An overview of it and the statistical implications is found in Cell. Recent research found that about 30% of primary cancers have evidence of genome doubling (tetraploidy). Now that I have a data set with matched RNA-seq and DNA WGS, I notice many samples have an estimated copy number of 4 for about 70% to 80% of their chromosomes from the DNA sequencing analysis.
I am considering how to estimate size factors for samples correctly before model fitting. edgeR has
calcNormFactors which tries to make some value like the median fold change between samples the same. Could a more complicated version of it be developed in future? For example, it might look like
# Assume that SCC15 cancer is tetraploid, SCC22 is diploid. genesList <- list(SCC15normal = allGenes, SCC15cancer = SCC15tetraploidGenes, SCC22normal = allGenes, SCC22cancer = allGenes) calcNormFactors(countMatrix, targetLFC = c(0, 1, 0, 0), whichGenes = genesList)
Similarly, DESeq2 has a function named
estimateSizeFactors. Would providing a matrix of genes' copy numbers (rows) and samples (columns) as
normMatrix allow the accurate estimation of size factors?
Could the vignettes of edgeR and DESeq2 packages have a section showing all of the functions which need parameters to be set by the user to correctly model transcriptional amplification? Neither vignette discusses crucial assumptions, such as most genes are not differentially expressed, for the default workflows to work well, and the impact of assumption violations.