News: Experimental data package 'seqc'
5
4.9 years ago by
Wei Shi3.2k
Australia
Wei Shi3.2k wrote:

We have created a new experimental data package called 'seqc'. It includes gene-level read count data generated by the SEQC (SEquencing Quality Control) project, which is the third stage of the well-known MAQC project (a US FDA initiative). The SEQC/MAQC-III Consortium produced benchmark RNA-seq data for the assessment of RNA sequencing technologies and data analysis methods (published recently on Nature Biotechnology - http://www.ncbi.nlm.nih.gov/pubmed/25150838):

Sequence reads were aligned to human reference genome hg19 using the Subread aligner and were then summarized to genes using the featureCounts program. This package includes the gene-level read count data for 2,758 libraries. It can be downloaded from the following link (188MB):

http://bioconductor.org/packages/release/data/experiment/html/seqc.html

In addition to the read count data, this package also includes exon-exon junction data generated for human brain reference RNA and universal human reference RNA samples. Exon-exon junctions were detected by using the Subjunc aligner.

Moreover, TaqMan RT-PCR validation data for ~1000 genes and ERCC spike-in sequence data are included in this package as well.

We hope this package is a useful resource for the community.

Wei

modified 4.9 years ago • written 4.9 years ago by Wei Shi3.2k
4
4.9 years ago by
Denali
Steve Lianoglou12k wrote:

Thanks a lot for processing and annotating the data in the way that you have. This will be a super useful resource ... especially since I already have a need for it ;-)

I've created some helper functions that allow you to create a (semi-decently) annotated ExpressionSet from the data given some user specified criteria and put it in the gist here. Perhaps something like this would be useful to include in the package?

You would use it like so:

## Fetch all of the RefSeq data from all centers and sequencing platforms:
R> e <- seqc.eSet('gene', 'refseq')
platform sample replicate lane  flowcell center
1|      ILM      A         1  L01 FlowCellA    AGR
2|      ILM      A         1  L01 FlowCellB    AGR
3|      ILM      A         1  L02 FlowCellA    AGR
4|      ILM      A         1  L02 FlowCellB    AGR
5|      ILM      A         1  L03 FlowCellA    AGR
6|      ILM      A         1  L03 FlowCellB    AGR

R> with(pData(e), table(platform, center))
center
platform AGR BGI CNL COH LIV MAY MGP NVS NWU NYU PSU SQW
ILM 256 384 360 128   0 384   0 320   0   0   0   0
LIF   0   0   0   0  50   0   0   0 285   0 288 288
ROC   0   0   0   0   0   0   4   0   0   4   0   4

## Fetch just the Illumina RefSeq data from all centers:
R> ilm <- seqc.eSet('gene', 'refseq', 'ILM')
R> with(pData(ilm), table(platform, center))
center
platform AGR BGI CNL COH MAY NVS
ILM 256 384 360 128 384 320

Currently I've only implemented this parsing/aggregating for gene-level features (ie. no junction or taqman data), but I can add those later if you think these would be helpful to include in the package.

0
4.9 years ago by
Wei Shi3.2k
Australia
Wei Shi3.2k wrote:

Thanks for the code, Steve. I have just added them to the package and committed to svn devel repository ...

That was quick! Thanks for incorporating that ... I of course now feel compelled to round off the functionality so that one could get ExpressionSets for all of the data. I'll let you know when the gist is updated with that ...

Happy to incorporate them when you code are updated! It will be helpful if you could provide .Rd files as well ...