We have created a new experimental data package called 'seqc'. It includes gene-level read count data generated by the SEQC (SEquencing Quality Control) project, which is the third stage of the well-known MAQC project (a US FDA initiative). The SEQC/MAQC-III Consortium produced benchmark RNA-seq data for the assessment of RNA sequencing technologies and data analysis methods (published recently on Nature Biotechnology - http://www.ncbi.nlm.nih.gov/pubmed/25150838):
Sequence reads were aligned to human reference genome hg19 using the Subread aligner and were then summarized to genes using the featureCounts program. This package includes the gene-level read count data for 2,758 libraries. It can be downloaded from the following link (188MB):
In addition to the read count data, this package also includes exon-exon junction data generated for human brain reference RNA and universal human reference RNA samples. Exon-exon junctions were detected by using the Subjunc aligner.
Moreover, TaqMan RT-PCR validation data for ~1000 genes and ERCC spike-in sequence data are included in this package as well.
We hope this package is a useful resource for the community.
Thanks a lot for processing and annotating the data in the way that you have. This will be a super useful resource ... especially since I already have a need for it ;-)
I've created some helper functions that allow you to create a (semi-decently) annotated ExpressionSet from the data given some user specified criteria and put it in the gist here. Perhaps something like this would be useful to include in the package?
You would use it like so:
## Fetch all of the RefSeq data from all centers and sequencing platforms:
R> e <- seqc.eSet('gene', 'refseq')
R> head(pData(e))
platform sample replicate lane flowcell center
1| ILM A 1 L01 FlowCellA AGR
2| ILM A 1 L01 FlowCellB AGR
3| ILM A 1 L02 FlowCellA AGR
4| ILM A 1 L02 FlowCellB AGR
5| ILM A 1 L03 FlowCellA AGR
6| ILM A 1 L03 FlowCellB AGR
R> with(pData(e), table(platform, center))
center
platform AGR BGI CNL COH LIV MAY MGP NVS NWU NYU PSU SQW
ILM 256 384 360 128 0 384 0 320 0 0 0 0
LIF 0 0 0 0 50 0 0 0 285 0 288 288
ROC 0 0 0 0 0 0 4 0 0 4 0 4
## Fetch just the Illumina RefSeq data from all centers:
R> ilm <- seqc.eSet('gene', 'refseq', 'ILM')
R> with(pData(ilm), table(platform, center))
center
platform AGR BGI CNL COH MAY NVS
ILM 256 384 360 128 384 320
Currently I've only implemented this parsing/aggregating for gene-level features (ie. no junction or taqman data), but I can add those later if you think these would be helpful to include in the package.
That was quick! Thanks for incorporating that ... I of course now feel compelled to round off the functionality so that one could get ExpressionSets for all of the data. I'll let you know when the gist is updated with that ...
That was quick! Thanks for incorporating that ... I of course now feel compelled to round off the functionality so that one could get
ExpressionSets
for all of the data. I'll let you know when the gist is updated with that ...Happy to incorporate them when you code are updated! It will be helpful if you could provide .Rd files as well ...