Question

recount counts in the example experiment

3

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

I'm trying out the recount package. Looks very useful!

I'm following the example in the vignette:

library(recount)
project_info <- abstract_search('GSE32465')
download_study(project_info$project)
load(file.path(project_info$project, 'rse_gene.Rdata'))
rse_gene

I'm surprised by the number of reads (this experiment has single-end), in the hundreds of millions:

!> colSums(assay(rse_gene))/1e6
 SRR387777 SRR387778 SRR387779 SRR387780 SRR389077 SRR389078 SRR389079 SRR389080
  779.0177  861.1909 1182.2936  929.8780  522.0735  570.7484 1493.5273 1024.2130
 SRR389081 SRR389082 SRR389083 SRR389084
  652.9738  566.4900  686.3686  668.7259

!> rowData(rse_gene)
 DataFrame with 58037 rows and 3 columns
                  gene_id bp_length          symbol
              <character> <integer> <CharacterList>
 1     ENSG00000000003.14      4535          TSPAN6
 2      ENSG00000000005.5      1610            TNMD
 3     ENSG00000000419.12      1207            DPM1
 4     ENSG00000000457.13      6883           SCYL3
 5     ENSG00000000460.16      5967        C1orf112
 ...                  ...       ...             ...
 58033  ENSG00000283695.1        61              NA
 58034  ENSG00000283696.1       997              NA
 58035  ENSG00000283697.1      1184    LOC101928917
 58036  ENSG00000283698.1       940              NA
 58037  ENSG00000283699.1        60         MIR4481

At the SRA the first sample has ~30 million spots:

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR387777

Second run, 37 million spots:

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR387778

Also in the counts file, hundreds of millions of reads:

!> colSums(read_tsv("http://duffel.rail.bio/recount/SRP009615/counts_gene.tsv.gz"))/1e6
 Parsed with column specification:
...
 SRR387777 SRR387778 SRR387779 SRR387780 SRR389077 SRR389078 SRR389079 SRR389080
  779.0177  861.1909 1182.2936  929.8780  522.0735  570.7484 1493.5273 1024.2130
 SRR389081 SRR389082 SRR389083 SRR389084
  652.9738  566.4900  686.3686  668.7259

recount • 2.3k views

ADD COMMENT • link updated 7.6 years ago by Leonardo Collado Torres ★ 1.1k • written 7.6 years ago by Michael Love 43k

score 0 · Answer 1 · 2017-05-15

0

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 5 days ago

United States

I double checked the `recount` documentation and looks like I need to add a section describing in more detail the assays()$counts slot for the RangedSummarizedExperiment objects we are providing in the vignette. A short version is included in the help page for scale_counts() but it seems fair to put this in a more prominent place. Anyhow, I'm thinking of making some images and get this done before Bioc2017.

In the meantime, assays()$counts for the gene and exon objects are not read counts directly. For a gene, we sum the base level coverage at every base-pair of exonic sequence. If you take the assays()$counts and divide them by the read length, you should get the number of read counts, which is what you wanted. Now, Rail-RNA does soft-clipping of reads so we include in the metadata the average read length and the resulting number might not necessarily be an integer.

> colSums(assays(rse_gene)$counts) / colData(rse_gene)$avg_read_length / 1e6
SRR387777 SRR387778 SRR387779 SRR387780 SRR389077 SRR389078 SRR389079 SRR389080 SRR389081 SRR389082 SRR389083 SRR389084 
 21.63938  23.92197  32.84149  25.82994  14.50204  15.85412  41.48687  28.45036  18.13816  15.73583  19.06580  18.57572

You can compare this number against the number of reads as reported by SRA or the number of reads mapped by Rail-RNA, and it'll be smaller. We also have the number of reads downloaded by Rail-RNA, which is not always the same as the number of reads reported by SRA.

> summary(colData(rse_gene)$read_count_as_reported_by_sra / 1e6 - colSums(assays(rse_gene)$counts) / colData(rse_gene)$avg_read_length / 1e6)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.409   5.317   7.173   8.717  13.139  15.905 

> summary(colData(rse_gene)$mapped_read_count / 1e6 - colSums(assays(rse_gene)$counts) / colData(rse_gene)$avg_read_length / 1e6)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.464   3.683   4.311   5.878   7.681  12.336 

> summary(colData(rse_gene)$reads_downloaded / 1e6 - colSums(assays(rse_gene)$counts) / colData(rse_gene)$avg_read_length / 1e6)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.409   5.317   7.173   8.717  13.139  15.905

In the vignette and everywhere where we have R code for using recount, we encourage users to use `scale_counts()` which by default will normalize the gene/exon coverage data to read counts for a library size of 40 million 100 bp aligned reads. The colSums() might not be 40 million since not all aligned reads overlap exonic sequence.

Best,

Leo

ADD COMMENT • link 7.6 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Thanks Leo. Of course, it was right there in the vignette I just didn't get that far before posting here :)

"The recount project records the sum of the base level coverage for each gene (or exon). These raw counts have to be scaled and there are several ways in which you can choose to do so. The function scale_counts() helps you scale them in a way that is tailored to Rail-RNA output."

In theory I prefer just:

my_scale_counts <- function(rse_gene, round=TRUE) {
  assays(rse_gene)$counts <- t(t(assays(rse_gene)$counts) / 
                               (rse_gene$avg_read_length))
  if (round) {
    assays(rse_gene)$counts <- round(assays(rse_gene)$counts)
  }
  rse_gene
}

...to scale_counts(rse_gene) if the library sizes happen to be very different than 40 million. In particular if some subset of the libraries are much less than 40 million and others much more. The relative precision information could be useful for the NB GLM. In practice, how much does it matter for the majority of experiments? Probably not much.

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

Hehe, even I forgot that it was on the vignette :P Thanks Mike!

When I wrote scale_counts() I imagined that a paper/blog post could be written about different ways to scale the base-pair coverage counts. Plus some way to normalize the data when working with samples from different SRA projects.

ADD REPLY • link 7.6 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

I edited my code chunk above to include the number of reads. Does that look right to you?

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

The `avg_read_length` is the average read length reported by SRA, which is the length of the 2 pairs for paired-end data. So instead of multiplying by `numreads`, I would divide by it to get the fragment counts (a paired-end read would give you 2 fragments) or leave as-is for paired-end read counts. The paper supplement describes the variables in colData() https://www.nature.com/nbt/journal/v35/n4/extref/nbt.3838-S1.pdf (pages 4 and 5).

> m <- all_metadata()

> m[which(m$paired_end)[1], ]
DataFrame with 1 row and 21 columns
      project      sample  experiment         run read_count_as_reported_by_sra
  <character> <character> <character> <character>                     <integer>
1   DRP000499   DRS001140   DRX001120   DRR001622                      28923418
  reads_downloaded proportion_of_reads_reported_by_sra_downloaded paired_end
         <integer>                                      <numeric>  <logical>
1         28923418                                              1       TRUE
  sra_misreported_paired_end mapped_read_count        auc sharq_beta_tissue
                   <logical>         <integer>  <numeric>       <character>
1                      FALSE          27336347 2008826370                NA
  sharq_beta_cell_type biosample_submission_date biosample_publication_date
           <character>               <character>                <character>
1                   NA   2012-06-01T09:05:13.320    2012-05-30T00:00:00.000
    biosample_update_date avg_read_length geo_accession  bigwig_file       title
              <character>       <integer>   <character>  <character> <character>
1 2014-11-12T03:29:43.000             152            NA DRR001622.bw          NA
  characteristics
  <CharacterList>
1              NA

From https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=DRR001622 you can see that this sample has 14.5 million paired-end reads (each read in pair is 76 bp long). In the colData(), there's almost 30 million single-end reads reported by SRA, downloaded and about 27.3 million mapped. The paired-end average read length is 152 bp.

ADD REPLY • link 7.6 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Great, thanks again. I'm going to edit my function above to remove numreads then.

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

hi Leo,

Thanks for your help with the above. This is a really useful resource.

I had one small note I wanted to tack on: you might consider specifying the genome() of the rowRanges in a next iteration.

The supplementary materials from the Nature Biotechnology paper say GENCODE v24 which implies GRCh38?

library(recount)
proj <- "SRP001540"
download_study(proj)
load(file.path(proj,"rse_gene.Rdata"))
!> seqinfo(rowRanges(rse_gene))
 Seqinfo object with 25 sequences (1 circular) from an unspecified genome; no seqlengths:
   seqnames seqlengths isCircular genome
   chr1           <NA>       <NA>   <NA>
   chr2           <NA>       <NA>   <NA>
   chr3           <NA>       <NA>   <NA>
   chr4           <NA>       <NA>   <NA>
   chr5           <NA>       <NA>   <NA>
   ...             ...        ...    ...
   chr21          <NA>       <NA>   <NA>
   chr22          <NA>       <NA>   <NA>
   chrX           <NA>       <NA>   <NA>
   chrY           <NA>       <NA>   <NA>
   chrM           <NA>       TRUE   <NA>

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

It's basically hg38 (PRI) + chrEBV. You can find the chr information at https://github.com/nellore/runs/blob/master/gtex/hg38.sizes and chrEBV is there https://github.com/nellore/runs/blob/master/gtex/hg38.sizes#L40. I did use that file myself but I forgot to add the genome() https://github.com/leekgroup/recount-website/blob/master/rse/create_rse.R#L370-L376. Like you said, it'd be good to add this for a next iteration.

ADD REPLY • link 7.6 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

awesome, thanks for the confirmation. i'm using recount rse's for a comp bio course, which is why i'm diving into the details :)