I am starting to process single cell RNA sequencing data and I noticed that all of the bioconductor tutorials for single cell (https://f1000research.com/articles/5-2122/v2) start from well groomed data that is already in a count matrix with cells for columns and genes for rows . This is pretty far from the output of the Instrument and more should be done to facilitate getting the count data necessary for the main methods of obtaining single cell sequencing (Illumina and PacBio). I was given as the output of bcl2fastq three fastq files, R1, R2 (paired) data and an index fastq that has the barcodes that were used to multiplex the samples. After googling extensively, that there are not a lot of options and what I see is that people use Cell Ranger (software from 10X genomics) to do the analysis and then from there, export count data. None of this is very satisfactorily explained despite having excellent bioconductor tutorials for single cell data (that all start from well groomed count data), like: https://www.bioconductor.org/help/course-materials/2017/BioC2017/Day2/Workshops/singleCell/doc/workshop.html.
Cell Ranger uses STAR and it seems like it does more than you would want, if you intend to use the R/Bioconductor software, or process the data in a method similar to what you would do with bulk RNA-seq.
R1, R2 regular paired-end fastqs
@K00124:391:HWNTHBBXX:3:1101:4219:1309 1:N:0:TTCCCGAT TTCCCGAT + A-A<FA-- @K00124:391:HWNTHBBXX:3:1101:7101:1309 1:N:0:TTCCCGAC TTCCCGAC + AAA<FF-- @K00124:391:HWNTHBBXX:3:1101:7222:1309 1:N:0:GCAGTAGC GCAGTAGC
What is your method for getting count data given R1, R2, and I1?
What is the best way to export this count data into R? HDF5Array?? Which hdf5 files do you use from the output of cellranger count? (or aggr)
Any comments or advice is greatly appreciated, and will most likely enrich the community as 10X genomics increases in popularity. It is not like people aren't already trying to get help, they are just not getting much (https://www.biostars.org/p/356000/)