Question

RNA-seq Statstical Models and Transcriptional Amplification

0

Entering edit mode

Dario Strbenac ★ 1.5k

@dario-strbenac-5916

Last seen 3 hours ago

Australia

Transcriptional amplification is the phenomenon where the majority of genes in a sample are increased in expression. An overview of it and the statistical implications is found in Cell. Recent research found that about 30% of primary cancers have evidence of genome doubling (tetraploidy). Now that I have a data set with matched RNA-seq and DNA WGS, I notice many samples have an estimated copy number of 4 for about 70% to 80% of their chromosomes from the DNA sequencing analysis.

I am considering how to estimate size factors for samples correctly before model fitting. edgeR has calcNormFactors which tries to make some value like the median fold change between samples the same. Could a more complicated version of it be developed in future? For example, it might look like

# Assume that SCC15 cancer is tetraploid, SCC22 is diploid.
genesList <- list(SCC15normal = allGenes, SCC15cancer = SCC15tetraploidGenes, SCC22normal = allGenes, SCC22cancer = allGenes)
calcNormFactors(countMatrix, targetLFC = c(0, 1, 0, 0), whichGenes = genesList)

Similarly, DESeq2 has a function named estimateSizeFactors. Would providing a matrix of genes' copy numbers (rows) and samples (columns) as normMatrix allow the accurate estimation of size factors?

Could the vignettes of edgeR and DESeq2 packages have a section showing all of the functions which need parameters to be set by the user to correctly model transcriptional amplification? Neither vignette discusses crucial assumptions, such as most genes are not differentially expressed, for the default workflows to work well, and the impact of assumption violations.

edgeR DESeq2 Transcriptional Amplification • 1.2k views

ADD COMMENT • link updated 5.1 years ago by Aaron Lun ★ 28k • written 5.1 years ago by Dario Strbenac ★ 1.5k

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 13 hours ago

United States

For DESeq2, normFactors was created from a request where a user wanted to control for a known matrix of copy number when computing size factors. But this is rarely a use case, that users would want to control away copy number. Much more common is that users want to specify a set of “unchanging” or control genes.

It’s more subtle than “most genes are not DE”. The robust size factors work even if all genes are DE, it’s more that the effects should be roughly centered around 0 (roughly because the median or trimmed mean is not too sensitive to imbalance).

ADD COMMENT • link 5.1 years ago Michael Love 41k

score 2 · Accepted Answer · 2019-03-25

Just use an offset matrix. Exactly how you formulate it depends on the question being asked.

Approach 1:

diploid.genes <- # index of genes with same copy number
y.diploid <- y[diploid.genes,]  # do not recompute library sizes
y.diploid <- calcNormFactors(y.diploid)
dip.offsets <- getOffset(y.diploid)

copy.lfc <- # per-gene copy number variation from diploid
offset <- outer(copy.lfc, dip.offsets, FUN="+")
y$offset <- offset

The null hypothesis here is that doubling (or any multiplication or division of) the gene copy results in doubling of gene expression. Rejections may be interesting as they represent some kind of negative/positive feedback - an extreme example would be X-chromosome inactivation.

Approach 2:

If copy number changes on a per-chromosome basis, you could do:

offsets <- matrix(0, nrow=ngenes, ncol=nsamples)

# gene.chrs is a vector of chromosomal locations for all genes.
for (chr in unique(gene.chrs)) {
    current.chr <- gene.chrs==chr
    y.chr <- y[current.chr,]
    y.chr <- calcNormFactors(y.chr)
    offsets[current.chr,] <- rep(getOffset(y.chr), each=sum(current.chr))
}

This aims to remove the effect of per-chromosome copy number changes, under the assumption that the copy number change will systematically affect all genes on the same chromosome in the same manner. It is more empirical than the first approach because it will accommodate situations where a change in copy number does not result in the same quantitative change in expression (e.g., doubling of copy number results in a 50% increase in expression in all genes).

If you have finer structural variants that are still highly stereotypical across samples, you can still use the above approach as provided; just replace gene.chrs with the variant-relevant location (e.g., chromosome + arm). However, if you have variants that differ wildly across samples, it would be better to manually loop across all samples when constructing the offset matrix.

Approach 3:

Normalization makes assumptions about the systematic behaviour of many genes. However, there's no biological reason for regulatory networks to behave in a consistent way upon changes to copy number. If you double the copy number of a chromosome, some genes will probably increase in their expression, but not necessarily double; other genes will exhibit positive feedback; other genes will exhibit negative feedback.

Thus, if you want to properly model copy number, the most correct analysis approach would be to include a copy-number term in your design matrix for each gene. You can achieve this by looping over all chromosomes and performing a separate DE analysis for each one; there should still be plenty of genes for empirical Bayes shrinkage to be effective.

If you use this approach and copy number changes are NOT of interest, then normalization doesn't really matter as any changes to the offsets will be absorbed into the coefficient estimates for the copy number blocking factor (assuming you've converted the copy number into a factor rather than leaving it continuous).

Approach 1:

Approach 2:

Approach 3:

Approach 2 clarification:

Approach 3 comments:

Additional remarks: