How to load scRNA-seq data from GEO
2
0
Entering edit mode
@61b63a9f
Last seen 14 hours ago
United Kingdom

I'll start by apologising for my ignorance. I'm a computational biologist by trade but not a bioinformatician. I do computational modelling mostly. I've been spending weeks going around in circles with bioconductor tutorials and I can't seem to make any headway.

What I'm trying to do in a nut shell. I'm trying to load the gene expression omnibus series GSE242423 as a suitable R object (I'm assuming that would be a SingleCellExperiment object) so I can do some basic pre processing and spit out a reduced data set for a computational model I'm building.

The problem is I just can't seem to import it. I think it has a non standard format? Can anyone point me to a tutorial where this format is imported? Downloading the file it's a zipped folder of files each of which seems to be a sparse matrix (3 column tsv). There are lots of good tutorials for various bioconductor packages out there but I'm falling at the first hurdle of importing the data into something those packages use.

SingleCellExperiment GEO • 2.7k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States

There are three things you need; the barcodes, the market matrix file, and the row data. I downloaded the first set of barcodes and the associated market matrix file, as well as the genes file.

> library(SingleCellExperiment)
> library(Matrix)
> mm <- readMM("GSM7763419_D0.matrix.mtx.gz")
> bc <- read.table("GSM7763419_D0.barcodes.tsv.gz")
> gns <- read.table("GSE242423_scRNA_genes.tsv.gz")
> head(gns)
               V1          V2   V3
1 ENSG00000243485 MIR1302-2HG Gene
2 ENSG00000237613     FAM138A Gene
3 ENSG00000186092       OR4F5 Gene
4 ENSG00000238009  AL627309.1 Gene
5 ENSG00000239945  AL627309.3 Gene
6 ENSG00000239906  AL627309.2 Gene
          V4
1 Expression
2 Expression
3 Expression
4 Expression
5 Expression
6 Expression
> names(gns)[1:2] <- c("ENSEMBL","SYMBOL")
> head(gns)
          ENSEMBL      SYMBOL   V3
1 ENSG00000243485 MIR1302-2HG Gene
2 ENSG00000237613     FAM138A Gene
3 ENSG00000186092       OR4F5 Gene
4 ENSG00000238009  AL627309.1 Gene
5 ENSG00000239945  AL627309.3 Gene
6 ENSG00000239906  AL627309.2 Gene
          V4
1 Expression
2 Expression
3 Expression
4 Expression
5 Expression
6 Expression
> se <- SingleCellExperiment(assays = list(counts = mm), colData = DataFrame(barcodes = as(bc, "DataFrame")), rowData = as(gns[,1:2], "DataFrame"))
> se
class: SingleCellExperiment 
dim: 36601 2517452 
metadata(0):
assays(1): counts
rownames: NULL
rowData names(2): ENSEMBL SYMBOL
colnames: NULL
colData names(1): barcodes.V1
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

I only used the first market matrix and barcode file, but you can always read in more and rbind the market matrix files and c the barcodes to make a larger SingleCellExperiment object.

0
Entering edit mode

You need to convert the sparse matrix to the Csparse format for downstream analysis to make sense. mx <- as("CsparseMatrix", mm) and then build the SCE with this. Official reference beyond my personal opinion: https://github.com/MarioniLab/DropletUtils/blob/devel/R/read10xCounts.R#L283

ADD REPLY
0
Entering edit mode

Unfortunately this seems to cause my computer to hang in limbo forever. Should this conversion really take upwards of 20 minuets?

ADD REPLY
0
Entering edit mode

the paper I'm working off suggests removing all cells where the UMI is less than 2000. I assume we don't have enough data for umi counts. could we approximate using read counts? I don't suppose there would be an easy way to do this before applying this conversion? the data set is pretty huge. I'm guessing if I could figure out which column of the sparse matrix is cell barcodes I could just filter based on duplicate numbers?

ADD REPLY
0
Entering edit mode

thank you this is really really helpful. For the purpose of analysing this as a time series would it be better to have each time point as a different object? Loaded as multiple assays or bound into one single assay which is what I think you were suggesting?

ADD REPLY
0
Entering edit mode

You will presumably want to identify cell types and then do a pseudo-bulk analysis within one or more cell types. In which case it's simpler to have one object that you cluster and identify the cells. You can then generate pseudo-bulk data and subset out the different cell types for differential expression analysis.

0
Entering edit mode

no what i really want is to A) filter out "bad cells" and B) identify the most variable genes along the time serise (down to maybe 200 or so) and then maybe do a bit of primative tracking of how they change over time. This entire step is just pre-processing to ready a much smaller data set (in terms of genes) for gene regulatory network inference. (curently likely to use this peice of softwear https://doi.org/10.1371/journal.pcbi.1010962)

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 14 minutes ago
WEHI, Melbourne, Australia

Alternatively you can read the data into the edgeR package. See ?read10X for a simple example.

ADD COMMENT
0
Entering edit mode

This expands data to ordinary rather than sparse matrix, no? So it won't scale with the size of the dataset at hand here at all.

ADD REPLY
0
Entering edit mode

Yes, it produces a dense matrix. My lab has published many papers using 10x scRNA-seq and dense matrix analyses. This includes datasets that are larger than that linked to by OP, for example the two papers cited below for which I was joint corresponding author. Sparse matrices have their own complications and the memory saving that they offer is limited if downstream analyses require dense matrices. Even when sparse matrices are used, I find it helpful to read the individual 10x files into dense matrices for initial QC and exploration.

Pal B, Chen Y, Milevskiy MJG, Vaillant F, Prokopuk L, Dawson C, Capaldo BD, Song X, Jackling F, Timpson P, Lindeman GJ, Smyth GK, Visvader JE (2021). Single cell transcriptome atlas of mouse mammary epithelial cells across development. Breast Cancer Research 23(1), 69.

Pal B, Chen Y, Vaillant F, Capaldo BD, Joyce R, Song X, Bryant VL, Penington JS, Di Stefano L, Ribera NT, Wilcox S, Mann GB, kConFab, Papenfuss AT, Lindeman GJ, Smyth GK, Visvader JE (2021). A single-cell RNA expression atlas of normal, preneoplastic and tumorigenic states in the human breast. EMBO Journal 40(11), e3107333.

ADD REPLY
0
Entering edit mode

I am not arguing for or against sparse matrices, but simply want to make OP aware that it does not scale. I have a 128GB workstation and it cannot load this dataset with the read10X function from edgeR:

library(edgeR) # 4.4.2

y <- read10X(
  mtx = "GSM7763419_D0.matrix.mtx.gz",
  genes = "GSE242423_scRNA_genes.tsv.gz", 
  barcodes = "GSM7763419_D0.barcodes.tsv.gz"
)

# > Error: cannot allocate vector of size 343.3 Gb
ADD REPLY
1
Entering edit mode

Yes, that's a fair point and I completely agree that read10X() is not useable for this dataset. The files turn out to be much larger than I was expecting. The problem is that the authors have uploaded the unfiltered files from CellRanger instead of the filtered files that CellRanger provides for most purposes. IMO this is unhelpful because it ensures that the file consists >99% of cells with almost no reads that cannot be included in any sensible downstream anaysis. The first mtx file contains data from >2.5 million cells whereas the filtered file would probably have contained only 10,000 or so.

ADD REPLY
0
Entering edit mode

I can confirm my workstation definitely does not have over 300GB of ram for this project.

ADD REPLY
0
Entering edit mode

I suggest you use readMM() as advised by James, but then do a massive amount of cell filtering before doing any other conversions or creating more elaborate data objects. There is lots of advice about single cell filtering on the internet. For a detailed example from my lab, see:

Chen Y, Pal B, Lindeman GJ, Visvader JE, Smyth GK (2022). R code and downstream analysis objects for the scRNA-seq atlas of normal and tumorigenic human breast tissue. Scientific Data 9(1), 96. https://www.nature.com/articles/s41597-022-01236-2

ADD REPLY

Login before adding your answer.

Traffic: 389 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6