Question: Merging RNAseq Datasets of Different Depths
0
14 months ago by
liam.salleh0 wrote:

Hi,

I have two small single cell datasets containing RNASeq data from 12, and 20 cells respectively, which were manually isolated.

Our hope is to potentially project this data onto another, annotated scRNA-seq dataset which should contain these cells, to see where our data falls into the other dataset and hopefully do some differential expression against other celtypes.

However, the first thing I'd need to do is to combine the two aforementioned datasets, which were isolated and sequenced at different times, as well as at different depths, and I was wondering what the best way to combine these datasets was.

I have used Salmon (separately for each dataset) for the quantification, followed by tximport with the hopes of later using this data as a SingleCellExperiment or Seurat object, in order to make use of some projection functions.

I searched this forum and found these two posts which seem relevant to my question:

Single cells batch effects

Pool Single-cells: Analysis

Would reading the data in from both salmon runs using tximport (with countsFromAbundance set to "lengthScaledTPM" as described here) and then using the txi\$counts as the count matrix to create a SingleCellExperiment object, followed by limma batch correction be appropriate here?

Thanks,

Liam

modified 14 months ago by Aaron Lun24k • written 14 months ago by liam.salleh0
Answer: Merging RNAseq Datasets of Different Depths
1
14 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

I don't use salmon or tximport, but assuming you get some kind of count matrix from them, it should be simple enough to proceed:

1. Create a SingleCellExperiment object from the count matrix. I don't know if there's any consequences from running salmon separately for the two batches compared to running them all together; if there is, you should consider re-running salmon on all samples to avoid unnecessary batch effects arising from how you get from the reads to the counts.
2. Run normalize to obtain log-transformed normalized values. There are a few functions to compute size factors, but I'm not sure how useful they will be here. computeSumFactors really needs more cells (at least 100 to yield stable estimates), while computeSpikeFactors needs, well, spike-ins. If you run normalize without providing any size factors, it will compute size factors from the library sizes. This is probably your only option, as inaccurate as the library sizes are for estimating cell-specific bias. (Well, perhaps it's not so much of a problem if you only have cells of the same type, as there shouldn't be any composition biases.)
3. Run removeBatchEffect to remove the batch effect. This assumes that the composition of cells is the same between batches, which should be true of replicates. If this is not the case, you could consider using mnnCorrect, though you would have to turn down k considerably due to the low numbers of cells involved.

Combining your dataset with the published data is yet another kettle of fish. Assuming that your cell type exists in the published dataset, a simple nearest neighbors approach would probably be sufficient to identify the matching cells in the latter. As in, just identify the nearest neighbouring published cell(s) to each of your cells, and that is the putative corresponding population.

Finally,a direct DE analysis between your data and the published data would be suspect due to the batch effects involved. It would be a lot safer to use your data to identify the matching published cell type, and then do a within-batch DE analyses with that cell type.

1

Just want to add to Aaron's excellent answer that you may try the scmap Bioconductor package to match your cells to published data.

Hi Davide,

Thank you for the response. Yes I previously ran across scmap and do also intend to use it in this case. Would you happen to be aware if my small number of cells would be problematic for using scmap? I have been doing some reading and so far it seems like it should be fine.

Hi Aaron,

Thanks for the in depth response.

I'll try find out if performing separate runs of salmon would result in batch effects but in the meantime, I just wanted to clarify a couple of things:

When you say to use a nearest neighbours approach do you mean, mnnCorrect? or something more simple just to identify the most similar population, but without the batch correction (such as scmap perhaps?)

If you do mean mnnCorrect, why would a DE analysis between the datasets be suspect due to to batch effects if methods such as mnnCorrect and runCCA are for batch correction in order to carry out downstream analyses such as differential expression?

Identifying a similar subpopulation in the published data would be good, but ideally, we would really like to be able to do some other downstream analyses such as differential expression between the datasets as we are quite certain our cells our what we think, due to the method of isolation, and the marker genes for our manually isolated cells and perhaps the GO terms from said genes are what we are interested in. As our cell population is all one cell type, we obviously need the published dataset to compare to, to find said marker genes.

Also, if you didn't mean mnnCorrect, is there a reason you didn't suggest it for the batch correction between our manually isolated cells and the published dataset? and If it's not appropriate to merge the published dataset with our two, is there even another possible way to profile to find marker genes for our cells in this case? Simply looking at the most highly expressed genes just results in a lot of ribosomal genes etc.

Sorry for the long winded reply.

1

For your first question: I literally meant k-nearest neighbours. No need for MNNs if you're sure that all cells of the query population are also present in the reference population.

For your second question: the most obvious reason is that the corrected values are not on the same scale as the original log-expression values. mnnCorrect yields cosine-normalized values for which the differences cannot be interpreted as log-fold changes in magnitude, and runCCA yields... well, canonical components. Now, all of this is fine if you're interested in the relative differences between cells (e.g., clustering, t-SNE visualization, trajectory reconstruction), but for a DE analysis, you really want to know the log-fold change, and this is no longer possible with the corrected values. And that's not even considering the more subtle distortions of the distribution and mean-variance relationship resulting from the correction procedure, which would make parametric DE methods (and even non-parametric ones) not happy.

Hence our recommendation to use the corrected values for clustering and the like, and then use the identified clusters (e.g., cell types) for a DE analysis on the original data, with the batch effects as blocking factors. In your case, it doesn't make sense to do an interaction because you should only have one cell type in your query set. Thus, you can only model the batch as an additive factor, and if that's the case, the fitted values for the cells in the query batch would be fully determined by the batch term. Hence the pointlessness of doing the DE analysis between cells in different batches in this specific situation.

Hi Aaron,

Regarding runCCA , I was under the impression that because it corrects a lower dimensional subspace rather than the expression matrix itself, downstream differential expression analyses were an option. The Satija lab also has a tutorial using runCCA in which they perform differential expression analyses after merging two datasets here (Although I'm not quite sure if it's appropriate for our data anyway, as I think it might require datasets that contain the same cell types across both, whereas we obviously only have one shared cell type between datasets).

If we were to just find nearest neighbours, and cluster our cells with the reference dataset, would scmap and mnnCorrect be fine for these purposes?

Finally, after the clustering, if we want to look at interesting genes expressed by our cells (not the ones in the reference dataset), is there an unbiased method to do so if we can't do differential expression against other cells, besides simply looking at the most highly expressed genes? (many of which are ribosomal, making it hard). There are some genes we suspect we might see/not see but would rather not just cherry pick said genes and check for read counts significantly over 0.

I am new to RNA-seq and the only other thing I could think of to look at our genes without cherry picking specific things, would be to maybe convert all our genes to human orthologs, get a list of mitochondrial/ribosomal genes to remove from our data (I could not find gene lists for mitochondrial and ribosomal genes in our organism) and then look at highly expressed genes?

Thanks again,

Liam

1

1) I'm not familiar with the exact details of Seurat's workflow, but I don't think there's any contradiction with what I've already said. Differential expression analysis on the corrected low-dimensional values would be very difficult to interpret; you wouldn't get DE genes, you'd get DE components. Differential expression on the original expression values, using the newly found cluster assignments with a blocking term for the batch-of-origin, is much more sensible - this seems to be what they're doing with the meta-analysis.

2) You'll have to be more precise. Do you want to merge your dataset with the reference, and then re-cluster the merged results (mnnCorrect)? Or do you want to use an existing clustering of the reference dataset, and identify the cluster(s) to which your dataset belong (scmap and related k-NN mapping approaches)?

3) No, there is no way - at all - to distinguish between biological and batch effects in your desired comparison (other than to use the reference cells instead). This is a classic case of confounding variables, which we take great care to avoid during experimental design for bulk RNA-seq experiments. But it doesn't matter - if you accept that the mapping works correctly, then the cells in the reference are equivalent to the cells in your dataset sans the batch effect, so you should just be able to use the former in your DE analysis.

Hi Aaron,

At the moment I am more concerned with the feasibility of using a dataset such as ours, with the different methods.

i.e. I would be interested in looking at the output of both mnnCorrect and scmap to see where my cells group, but I am worried about whether the small homogeneous cell population violates any of the assumptions that either method might have as I am assuming they were developed with larger, heterogeneous scRNA-seq datasets in mind.

It seems to me that mnnCorrect should be fine, but it will just remove any differences between my cells and the nearest reference cells as batch effects in turn making them identical and DE analysis after that is not possible anyway as you say. From reading, it seems like scmap should also be OK.

For my last point, I was more wondering if there is another approach besides differential expression analyses I could use to get information from my cells, without comparison to another population (in the case that DE via runCCA is not applicable). For example, simply looking at the genes with the highest counts after somehow removing housekeeping genes, or even cherry picking specific genes, taking count numbers above some threshold to mean the gene is being expressed? - although I am not sure about the scientific accuracy of this.

I would prefer to explore the data from our small dataset rather than the reference as, though our cells should exist in the reference, they may just be very similar, not exactly the same, whereas we are highly certain our cells are what we say they are due to the way they were isolated.

Liam

1

Regarding the size of your population: at least for mnnCorrect, the number of neighbours to consider is determined by the size of the subpopulations within the dataset. The default of k=20 means that we expect at least 20 cells in any given subpopulation. This is not a problem in your case, as you have a combined 32 cells of the same type for mapping (and presumably the reference dataset has much more and is not of concern here).

Regarding "getting information from your cells": well, unfortunately, your current situation is the inevitable result of your (lack of) experimental design. Most definitions of cell types/states are relative, which fundamentally means that you need to compare your population with other cells to characterize them. With your experiment, a meaningful comparison to cells from other datasets is difficult due to the batch effects. Now, as I mentioned, batch correction does allow you to perform DE analyses, but you don't want to use that, so that leaves you out of options.

There are a few established markers for which the expression is binary across cell types (e.g., CD3 in T cells, CD19 in B cells), and you could use them to characterize your cells in an "absolute" sense, i.e., based on whether they have detectable expression of these genes or not. However, these should be so obvious that they would not add anything to what you already know about your cells.

Hi Aaron,

Thanks for all the replies,

So just to clarify, in the case that runCCA does not allow for differential expression between our cells and the reference dataset, our options include:

mnnCorrect for reclustering our cells with reference dataset

scmap for projection onto the reference dataset

and DE analysis between cell types within the reference dataset?

Liam

1

I'm loathe to comment on the capabilities of runCCA - if you want definitive answers about that function, you should contact the Seurat authors directly. But the options you have listed are essentially correct, aside from the fact that mnnCorrect does not do any clustering itself (it produces corrected values that can then be used in clustering, via hclust or buildSNNGraph or friends).

To sum up: if you want to perform a DE analysis between your cells and the cell populations in the reference data set, you can only do so by (i) identifying the reference population that matches to your cells, and (ii) assuming that the differences in expression between your cells and the matched reference population are solely caused by batch effects. I would say that it is impossible to work around this assumption, regardless of what package or function you are using. This is because any biological differences between the matched reference and your cells would be confounded with batch - at that point, it's game over.

OK thanks for the clarification.

Yes I have previously contacted the Seurat authors but they are not currently replying to emails so I will be trying some of the options you suggested while I wait for their reply, thanks for the advice.