I'm trying out the recount package. Looks very useful!
I'm following the example in the vignette:
library(recount) project_info <- abstract_search('GSE32465') download_study(project_info$project) load(file.path(project_info$project, 'rse_gene.Rdata')) rse_gene
I'm surprised by the number of reads (this experiment has single-end), in the hundreds of millions:
!> colSums(assay(rse_gene))/1e6 SRR387777 SRR387778 SRR387779 SRR387780 SRR389077 SRR389078 SRR389079 SRR389080 779.0177 861.1909 1182.2936 929.8780 522.0735 570.7484 1493.5273 1024.2130 SRR389081 SRR389082 SRR389083 SRR389084 652.9738 566.4900 686.3686 668.7259 !> rowData(rse_gene) DataFrame with 58037 rows and 3 columns gene_id bp_length symbol <character> <integer> <CharacterList> 1 ENSG00000000003.14 4535 TSPAN6 2 ENSG00000000005.5 1610 TNMD 3 ENSG00000000419.12 1207 DPM1 4 ENSG00000000457.13 6883 SCYL3 5 ENSG00000000460.16 5967 C1orf112 ... ... ... ... 58033 ENSG00000283695.1 61 NA 58034 ENSG00000283696.1 997 NA 58035 ENSG00000283697.1 1184 LOC101928917 58036 ENSG00000283698.1 940 NA 58037 ENSG00000283699.1 60 MIR4481
At the SRA the first sample has ~30 million spots:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR387777
Second run, 37 million spots:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR387778
Also in the counts file, hundreds of millions of reads:
!> colSums(read_tsv("http://duffel.rail.bio/recount/SRP009615/counts_gene.tsv.gz"))/1e6 Parsed with column specification: ... SRR387777 SRR387778 SRR387779 SRR387780 SRR389077 SRR389078 SRR389079 SRR389080 779.0177 861.1909 1182.2936 929.8780 522.0735 570.7484 1493.5273 1024.2130 SRR389081 SRR389082 SRR389083 SRR389084 652.9738 566.4900 686.3686 668.7259
Thanks Leo. Of course, it was right there in the vignette I just didn't get that far before posting here :)
"The recount project records the sum of the base level coverage for each gene (or exon). These raw counts have to be scaled and there are several ways in which you can choose to do so. The function scale_counts() helps you scale them in a way that is tailored to Rail-RNA output."
In theory I prefer just:
...to
scale_counts(rse_gene)
if the library sizes happen to be very different than 40 million. In particular if some subset of the libraries are much less than 40 million and others much more. The relative precision information could be useful for the NB GLM. In practice, how much does it matter for the majority of experiments? Probably not much.Hehe, even I forgot that it was on the vignette :P Thanks Mike!
When I wrote scale_counts() I imagined that a paper/blog post could be written about different ways to scale the base-pair coverage counts. Plus some way to normalize the data when working with samples from different SRA projects.
I edited my code chunk above to include the number of reads. Does that look right to you?
The `avg_read_length` is the average read length reported by SRA, which is the length of the 2 pairs for paired-end data. So instead of multiplying by `numreads`, I would divide by it to get the fragment counts (a paired-end read would give you 2 fragments) or leave as-is for paired-end read counts. The paper supplement describes the variables in colData() https://www.nature.com/nbt/journal/v35/n4/extref/nbt.3838-S1.pdf (pages 4 and 5).
From https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=DRR001622 you can see that this sample has 14.5 million paired-end reads (each read in pair is 76 bp long). In the colData(), there's almost 30 million single-end reads reported by SRA, downloaded and about 27.3 million mapped. The paired-end average read length is 152 bp.
Great, thanks again. I'm going to edit my function above to remove numreads then.
hi Leo,
Thanks for your help with the above. This is a really useful resource.
I had one small note I wanted to tack on: you might consider specifying the genome() of the rowRanges in a next iteration.
The supplementary materials from the Nature Biotechnology paper say GENCODE v24 which implies GRCh38?
It's basically hg38 (PRI) + chrEBV. You can find the chr information at https://github.com/nellore/runs/blob/master/gtex/hg38.sizes and chrEBV is there https://github.com/nellore/runs/blob/master/gtex/hg38.sizes#L40. I did use that file myself but I forgot to add the genome() https://github.com/leekgroup/recount-website/blob/master/rse/create_rse.R#L370-L376. Like you said, it'd be good to add this for a next iteration.
awesome, thanks for the confirmation. i'm using recount rse's for a comp bio course, which is why i'm diving into the details :)
Awesome Mike! If you post the slides online let me know ^^.