Question

Testing for DE genes between samples of isolated cells, accounting for expression in whole tissue

2

Entering edit mode

smurray ▴ 10

@smurray-13731

Last seen 5.1 years ago

Seattle

Hi all,

I'm hoping to get your advice on how to test for DE genes in samples of cells isolated from a tissue, accounting for the "background" expression in the whole tissue that each sample was isolated from. Essentially, I'd like help designing a model matrix.

The particular laboratory method we using is Ribo-tagging or TRAP-seq, in which ribosome-associated mRNA from specific cells is isolated from a whole tissue (which contains free mRNA and ribo-associated mRNA from other cells), but I think this discussion may also apply to any kind of cell isolation, including cell sorting or laser micro-dissection.

Ribo-tagging and TRAP-seq have been used to identify transcripts that come from particular cell types, to better characterize cell populations or discover new cell populations. Because the isolation of the cell-type-specific mRNA is not 100% efficient, the field often uses the ratio of cell-type-specific expression over whole tissue expression to more accurately identify genes that are truly specific to the cell type of interest. I've linked a couple papers if you're interested:

https://academic.oup.com/nar/article/38/13/4218/2409074/Analytical-approaches-to-RNA-profiling-data-for

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3902954/

However, we are taking things a bit further than identifying cell-specific genes, and would like to identify cell-specific genes that are DE between treatment and control.

In our experiments we have several animals in a treatment group and several in a control group, and each animal yields two samples/RNAseq libraries, the isolated cell-type-specific gene expression levels ("cell expr") and the whole tissue gene expression levels ("tissue expr"). On top of that, we have several time points for each group (different animals in each time point). We want to ask how treatment effects gene expression within each time point, which genes change over time within each group, and which genes are different over time between treatment and control, always accounting for background expression in the tissue to reduce noise.

1. Analysis option 1 is to take the ratio of cell expr/tissue expr and perform DE gene analysis on the ratio values as usual with limma. But, there's some squirrely things that happen to ratios with low-abundance genes, and I'm not certain that limma will enjoy working with ratios of counts when it's used to working with counts. (Maybe this is wrong- I'm of the RNAseq generation, but I know limma was developed for microarrays.)

2. Analysis option 2 is to keep the cell expr data and the tissue expr data separate, perform DE analysis on them individually, and compare the resulting lists of DE genes afterward, noting which genes appear on the cell-specific list and which appear on both the cell-specific and tissue-specific lists.

3. Analysis option 3 is to set up a model matrix that includes the paired cell and tissue samples from each animal and looks at the difference between treatment and control within time points and across time points. I can't wrap my head around what the model matrix would look like. In the case of a within-time point analysis, I could treat the tissue expression of each sample like a baseline expression, but how to test for genes that are differentially regulated between time 0 and time 1 and between treatment and control, accounting for background tissue expression eludes me.

Thanks so much for any thoughts/designs you have!

limma rnaseq rna-seq model-matrix multiple factor design • 1.5k views

ADD COMMENT • link updated 6.7 years ago by Aaron Lun ★ 28k • written 6.7 years ago by smurray ▴ 10

score 2 · Accepted Answer · 2017-08-11

Option 3. Let's start with a simple example that captures most of your experimental design:

treatment <- rep(c("con", "tr"), each=8)
mode <- rep(c("cell", "tissue"), 4)
time <- rep(rep(c("24", "48"), each=4), 2)
animal <- factor(rep(1:8, each=2))

The easiest thing to do is this:

grouping <- paste(treatment, time, mode, sep=".")
design <- model.matrix(~0 + animal + grouping)
design <- design[,!grepl("cell", colnames(design))] # get to full rank

The first 8 coefficients are the animal-specific blocking factors and are not of scientific interest. The last four coefficients represent the log-fold change of the tissue samples over the cell-type specific samples for each treatment/time group. Comparisons between these coefficients will represent comparisons between the log-fold changes, i.e., is the tissue/cell log-fold change the same between treatment and control conditions? This is basically a smarter way of doing option 1 without explicitly computing the ratios beforehand.

Having said all that, there are several mathematical quirks to be aware of. The first is that this approach will happily reject the null hypothesis when one log-fold change is large and the other log-fold change is very large. From a statistical perspective, this is correct as the log-fold changes are indeed different. However, from a biological perspective, this may be less interesting as the qualitative behaviour of the gene does not change between conditions. The second thing to note is that this approach will fail to reject the null hypothesis if you have counts of zero (in the numerator or denominator of the log-fold change calculation) for both conditions. This is because the log-fold change in either condition could be any very large value, such that it becomes impossible to reject the null.

One more warning: the ratio/log-fold change of cell-type-specific expression to whole tissue expression will depend on the composition of the tissue, which may make things difficult to interpret. If the cell composition changes between control and treatment conditions, this will result in changes to the ratio even if the cell type itself has no DE. Presumably, you also have pulldown biases in the cell type-specific samples that are not present in the whole tissue samples. This may cause the background enrichment in the former to differ from the expression profile of the latter, further complicating any comparison of ratios. It's a similar problem to the use of input controls in ChIP-seq, where my attitude is to trust the direction but not the magnitude of ChIP enrichment versus the control.