Question: RNA-seq Statstical Models and Transcriptional Amplification
0
28 days ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:

Transcriptional amplification is the phenomenon where the majority of genes in a sample are increased in expression. An overview of it and the statistical implications is found in Cell. Recent research found that about 30% of primary cancers have evidence of genome doubling (tetraploidy). Now that I have a data set with matched RNA-seq and DNA WGS, I notice many samples have an estimated copy number of 4 for about 70% to 80% of their chromosomes from the DNA sequencing analysis.

I am considering how to estimate size factors for samples correctly before model fitting. edgeR has calcNormFactors which tries to make some value like the median fold change between samples the same. Could a more complicated version of it be developed in future? For example, it might look like

# Assume that SCC15 cancer is tetraploid, SCC22 is diploid.
genesList <- list(SCC15normal = allGenes, SCC15cancer = SCC15tetraploidGenes, SCC22normal = allGenes, SCC22cancer = allGenes)
calcNormFactors(countMatrix, targetLFC = c(0, 1, 0, 0), whichGenes = genesList)


Similarly, DESeq2 has a function named estimateSizeFactors. Would providing a matrix of genes' copy numbers (rows) and samples (columns) as normMatrix allow the accurate estimation of size factors?

Could the vignettes of edgeR and DESeq2 packages have a section showing all of the functions which need parameters to be set by the user to correctly model transcriptional amplification? Neither vignette discusses crucial assumptions, such as most genes are not differentially expressed, for the default workflows to work well, and the impact of assumption violations.

modified 28 days ago by Aaron Lun23k • written 28 days ago by Dario Strbenac1.4k
Answer: RNA-seq Statstical Models and Transcriptional Amplification
2
28 days ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

Just use an offset matrix. Exactly how you formulate it depends on the question being asked.

### Approach 1:

diploid.genes <- # index of genes with same copy number
y.diploid <- y[diploid.genes,]  # do not recompute library sizes
y.diploid <- calcNormFactors(y.diploid)
dip.offsets <- getOffset(y.diploid)

copy.lfc <- # per-gene copy number variation from diploid
offset <- outer(copy.lfc, dip.offsets, FUN="+")
y\$offset <- offset


The null hypothesis here is that doubling (or any multiplication or division of) the gene copy results in doubling of gene expression. Rejections may be interesting as they represent some kind of negative/positive feedback - an extreme example would be X-chromosome inactivation.

### Approach 2:

If copy number changes on a per-chromosome basis, you could do:

offsets <- matrix(0, nrow=ngenes, ncol=nsamples)

# gene.chrs is a vector of chromosomal locations for all genes.
for (chr in unique(gene.chrs)) {
current.chr <- gene.chrs==chr
y.chr <- y[current.chr,]
y.chr <- calcNormFactors(y.chr)
offsets[current.chr,] <- rep(getOffset(y.chr), each=sum(current.chr))
}


This aims to remove the effect of per-chromosome copy number changes, under the assumption that the copy number change will systematically affect all genes on the same chromosome in the same manner. It is more empirical than the first approach because it will accommodate situations where a change in copy number does not result in the same quantitative change in expression (e.g., doubling of copy number results in a 50% increase in expression in all genes).

If you have finer structural variants that are still highly stereotypical across samples, you can still use the above approach as provided; just replace gene.chrs with the variant-relevant location (e.g., chromosome + arm). However, if you have variants that differ wildly across samples, it would be better to manually loop across all samples when constructing the offset matrix.

### Approach 3:

Normalization makes assumptions about the systematic behaviour of many genes. However, there's no biological reason for regulatory networks to behave in a consistent way upon changes to copy number. If you double the copy number of a chromosome, some genes will probably increase in their expression, but not necessarily double; other genes will exhibit positive feedback; other genes will exhibit negative feedback.

Thus, if you want to properly model copy number, the most correct analysis approach would be to include a copy-number term in your design matrix for each gene. You can achieve this by looping over all chromosomes and performing a separate DE analysis for each one; there should still be plenty of genes for empirical Bayes shrinkage to be effective.

If you use this approach and copy number changes are NOT of interest, then normalization doesn't really matter as any changes to the offsets will be absorbed into the coefficient estimates for the copy number blocking factor (assuming you've converted the copy number into a factor rather than leaving it continuous).

With some samples being almost entirely diploid and others being almost entirely tetraploid, Approach 1 wouldn't have many genes to use. It requires a subset of genes that are diploid in all samples. I am doubtful about looping across the samples for Approach 2. Wouldn't that mean using calcNormFactors on one sample at a time, so the resulting factor is always 1? Approach 3 seems like a good choice, but the duplications often don't span entire chromosomes, but a sizeable fraction of them. The partitioning of the genome into regions to loop over would be difficult.

Actually, my original interest for calculating normalisation factors was to scale raw counts between samples so that the pairs which are diploid and tetraploid would have most genes at a two-fold difference for exploratory data purposes such as drawing a heatmap (log-scale, of course) to highlight how transcriptional amplification affects the majority of genes. But, it's a good point you raise whether such an overall shift should be expected or not.

### Approach 2 clarification:

# Making up data.
copy <- matrix(sample(c(0.5, 1, 2), 10000, replace=TRUE), ncol=10)
copy[,1] <- 1 # assume that first sample is normal diploid.
y <- matrix(rnbinom(10000, mu=copy*10, size=10), ncol=10)

# Computing offsets.
library(edgeR)
d <- DGEList(y)
ref <- 1
offset <- matrix(0, nrow(y), ncol(y))

for (i in seq_len(ncol(y))) { # outer loop across samples.
d1 <- d[,c(i, ref)]
cur.copy <- copy[,i]

for (x in unique(cur.copy)) { # inner loop across copy number levels.
cur.genes <- x==cur.copy
d2 <- d1[cur.genes,]
d2 <- calcNormFactors(d2)
o <- getOffset(d2)
offset[cur.genes,i] <- o[1] - o[2]
}
}


I'd imagine that you'd probably want to use scaleOffset to add offsets to a DGEList, to get them on the same scale as the log-transformed effective library sizes for proper interpretation, e.g., by aveLogCPM.

Depending on exactly how far down the rabbit hole you're willing to go; it's theoretically possible to accommodate gene-by-gene differences in the copy number profile. For the NB dispersion estimation, you just loop across all genes, fit a GLM to each gene separately to get the adjusted profile likelihood values across the dispersion grid, and then aggregate this value across genes with WLEB to get the trend. Repeat this with the QL dispersion estimation, passing a vector of gene-specific residual d.f.'s to fitFDist(Robustly). Repeat again when testing the contrast of interest with glmQLFTest, picking up p-values for whatever non-copy-number-related contrast you're interested in. You'd have to dive deep into edgeR's guts to do this but it can be done by someone who is sufficiently determined/knowledgeable/desperate.

But, it's a good point you raise whether such an overall shift should be expected or not.

Indeed, copy number changes can have unpredictable effects. Take Hi-C, for example. If you double the copies of two pieces of DNA, how does their contact frequency change? One might expect 2-fold, if they were already in a tight interaction; but one might alternatively expect 4-fold under a random collision model. And this are just low-level physical interactions; one shudders to think of predicting the higher-order changes from a large-scale perturbation of the transcriptional regulatory network.

Answer: RNA-seq Statstical Models and Transcriptional Amplification
1
28 days ago by
Michael Love23k
United States
Michael Love23k wrote:

For DESeq2, normFactors was created from a request where a user wanted to control for a known matrix of copy number when computing size factors. But this is rarely a use case, that users would want to control away copy number. Much more common is that users want to specify a set of “unchanging” or control genes.

It’s more subtle than “most genes are not DE”. The robust size factors work even if all genes are DE, it’s more that the effects should be roughly centered around 0 (roughly because the median or trimmed mean is not too sensitive to imbalance).