Question: What are the methods to get count data per cell from single cell fastq given only R1, R2, and I1 fastq files
1
gravatar for Matthew Thornton
8 months ago by
USA, Los Angeles, USC
Matthew Thornton330 wrote:

Hello,

I am starting to process single cell RNA sequencing data and I noticed that all of the bioconductor tutorials for single cell (https://f1000research.com/articles/5-2122/v2) start from well groomed data that is already in a count matrix with cells for columns and genes for rows . This is pretty far from the output of the Instrument and more should be done to facilitate getting the count data necessary for the main methods of obtaining single cell sequencing (Illumina and PacBio). I was given as the output of bcl2fastq three fastq files, R1, R2 (paired) data and an index fastq that has the barcodes that were used to multiplex the samples. After googling extensively, that there are not a lot of options and what I see is that people use Cell Ranger (software from 10X genomics) to do the analysis and then from there, export count data. None of this is very satisfactorily explained despite having excellent bioconductor tutorials for single cell data (that all start from well groomed count data), like: https://www.bioconductor.org/help/course-materials/2017/BioC2017/Day2/Workshops/singleCell/doc/workshop.html.

Cell Ranger uses STAR and it seems like it does more than you would want, if you intend to use the R/Bioconductor software, or process the data in a method similar to what you would do with bulk RNA-seq.

R1, R2 regular paired-end fastqs

I2

@K00124:391:HWNTHBBXX:3:1101:4219:1309 1:N:0:TTCCCGAT
TTCCCGAT
+
A-A<FA--
@K00124:391:HWNTHBBXX:3:1101:7101:1309 1:N:0:TTCCCGAC
TTCCCGAC
+
AAA<FF--
@K00124:391:HWNTHBBXX:3:1101:7222:1309 1:N:0:GCAGTAGC
GCAGTAGC

What is your method for getting count data given R1, R2, and I1?

What is the best way to export this count data into R? HDF5Array?? Which hdf5 files do you use from the output of cellranger count? (or aggr)

Any comments or advice is greatly appreciated, and will most likely enrich the community as 10X genomics increases in popularity. It is not like people aren't already trying to get help, they are just not getting much (https://www.biostars.org/p/356000/)

Thank you

single cell fastq • 652 views
ADD COMMENTlink modified 8 months ago by Gordon Smyth39k • written 8 months ago by Matthew Thornton330
Answer: What are the methods to get count data per cell from single cell fastq given onl
4
gravatar for Aaron Lun
8 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Cell Ranger uses STAR and it seems like it does more than you would want

I would say that CellRanger does the necessary amount of work that needed to get a count matrix. One should not underestimate the complexity of the 10X sequencing construct, which involves at least four pieces of 10X-specific information split across each read pair:

  • Cell barcode
  • UMI
  • Gene sequence
  • Sample barcode

... not including any Illumina-related pieces. (The sample barcode is technically 10X's design, I believe, so I'm counting that above.) Any pre-processing pipeline has to do a lot of work to get to a count table, e.g., demultiplexing on the sample barcode, matching the cell barcode to the whitelist, aligning the gene sequence, and removing PCR duplicates using the UMIs. Add in the data munging and you end up with something big like CellRanger.

What is your method for getting count data given R1, R2, and I1?

I just use CellRanger. It sounds like you don't want to use it, but the safest bet for pre-processing such a complex data type is to use the software developed by the same company that designs the protocol! However, if you need a R/Bioconductor solution, scPipe is a good place to start.

What is the best way to export this count data into R?

Importing CellRanger outputs is the bread and butter of DropletUtils. Note that you'll need the BioC-devel version of this package to import count tables from CellRanger version 3.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Aaron Lun25k

Thank you very much for your explanation. I would have to install CellRanger on a large multipurpose (academic) linux cluster. I was hoping to avoid this, as I have STAR installed already. I will give scPipe a try. I will probably also use Cell Ranger locally.

ADD REPLYlink written 8 months ago by Matthew Thornton330

Hello! I just asked a question related to the molecule_info.h5 file into DropletUtils. I will install the development package. that may fix it. Thank you!

ADD REPLYlink written 7 months ago by Matthew Thornton330

Hello! I just asked a question related to the molecule_info.h5 file into DropletUtils. I will install the development package. that may fix it. Thank you!

ADD REPLYlink written 7 months ago by Matthew Thornton330
Answer: What are the methods to get count data per cell from single cell fastq given onl
1
gravatar for swbarnes2
8 months ago by
swbarnes2340
swbarnes2340 wrote:

If you want a pipeline that goes from fastqs to gene counts that is less of a black box than 10xGenomics Cellranger, you can use what the McCarroll lab cooked up for Drop-seq

https://github.com/broadinstitute/Drop-seq/releases

The principle is pretty much the same; get alignments, gene assignments, cell barcodes, and UMIs all together, filter away UMI duplicates, the total everything up for each cell barcode.

The major difference between 10XGenomics and Drop-Seq is that 10xGenomics cell barcodes all derive from a white list, and Drop-Seq ones do not, and Drop-seq cell barcodes are prone to being short, which has to be corrected for.

(Also note that newer versions of cellranger do not need the index file separate like that; they just want R1 and R2 fastqs as they are made from Illumina software)

ADD COMMENTlink written 8 months ago by swbarnes2340
Answer: What are the methods to get count data per cell from single cell fastq given onl
1
gravatar for Gordon Smyth
8 months ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

You can export count data from Cell Ranger in a compact text format. As an example of Cell Ranger output, see the three supplementary files here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2759556

These files can be read quickly and simply into R using edgeR::read10X. There are analogous functions Read10X and read10XCounts in the Seurat and DropletUtils packages.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Gordon Smyth39k

I will do that. Thank you!!

ADD REPLYlink written 8 months ago by Matthew Thornton330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 220 users visited in the last hour