I'll start by apologising for my ignorance. I'm a computational biologist by trade but not a bioinformatician. I do computational modelling mostly. I've been spending weeks going around in circles with bioconductor tutorials and I can't seem to make any headway.
What I'm trying to do in a nut shell. I'm trying to load the gene expression omnibus series GSE242423 as a suitable R object (I'm assuming that would be a SingleCellExperiment object) so I can do some basic pre processing and spit out a reduced data set for a computational model I'm building.
The problem is I just can't seem to import it. I think it has a non standard format? Can anyone point me to a tutorial where this format is imported? Downloading the file it's a zipped folder of files each of which seems to be a sparse matrix (3 column tsv). There are lots of good tutorials for various bioconductor packages out there but I'm falling at the first hurdle of importing the data into something those packages use.
You need to convert the sparse matrix to the Csparse format for downstream analysis to make sense.
mx <- as("CsparseMatrix", mm)
and then build the SCE with this. Official reference beyond my personal opinion: https://github.com/MarioniLab/DropletUtils/blob/devel/R/read10xCounts.R#L283Unfortunately this seems to cause my computer to hang in limbo forever. Should this conversion really take upwards of 20 minuets?
the paper I'm working off suggests removing all cells where the UMI is less than 2000. I assume we don't have enough data for umi counts. could we approximate using read counts? I don't suppose there would be an easy way to do this before applying this conversion? the data set is pretty huge. I'm guessing if I could figure out which column of the sparse matrix is cell barcodes I could just filter based on duplicate numbers?
thank you this is really really helpful. For the purpose of analysing this as a time series would it be better to have each time point as a different object? Loaded as multiple assays or bound into one single assay which is what I think you were suggesting?
You will presumably want to identify cell types and then do a pseudo-bulk analysis within one or more cell types. In which case it's simpler to have one object that you cluster and identify the cells. You can then generate pseudo-bulk data and subset out the different cell types for differential expression analysis.
no what i really want is to A) filter out "bad cells" and B) identify the most variable genes along the time serise (down to maybe 200 or so) and then maybe do a bit of primative tracking of how they change over time. This entire step is just pre-processing to ready a much smaller data set (in terms of genes) for gene regulatory network inference. (curently likely to use this peice of softwear https://doi.org/10.1371/journal.pcbi.1010962)