Question

scale_counts targetSize normalization for recount2 counts

0

Entering edit mode

sidreed34 • 0

@sidreed34-20993

Last seen 5.4 years ago

I am currently trying to conduct a differential expression analysis using the recount data. I want to compare the GTEx and TCGA data against a specific set of separate runs I have picked out.

My issue is with the targetSize argument in the scale_counts function. By default it is set 4e7 for all runs in the RSE object, but should it not be specific to each run since it represents the number of single-end mapped reads? I would expect the scaling should then be:

(mappedreadcount / (2 if paired_end == TRUE else 1)) /auc

where mappedreadcount is the column in colData(rse).

My questions is why is the targetSize argument constant across samples, should it not be specific to each sample?

P.S. I intend to use TMM normalization for differential expression analysis on the scaled counts afterwards, if that's relevant.

recount • 1.8k views

ADD COMMENT • link updated 5.6 years ago by Leonardo Collado Torres ★ 1.1k • written 5.6 years ago by sidreed34 • 0

score 1 · Answer 1 · 2019-06-17

1

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 19 days ago

United States

Hi,

For normalizing the coverage and visualizing the base-pair level data as in Figure 1 (top row) of https://f1000research.com/articles/6-1558/v1, we scale the data to the same number of reads (of the same length). In this case it matters that the AUC is the same for each sample after scaling, so paired-end vs single-end becomes irrelevant when we have base-pair coverage data as the input. Scaling to the same AUC is useful for analyses with derfinder for example https://doi.org/10.1093/nar/gkw852.

For your case, you might be interested in recount::read_counts() which will use the paired-end information and gives you read counts without forcing the mapped library sizes to 40 million reads for each sample (that is, it doesn't force the AUCs of the samples to be the same). You can then normalize these read counts with TMM and use for your DE analysis. Note that you shouldn't use read_counts() and scale_counts() together or you will get non-sensical results.

Best, Leonardo

AUC = area under coverage. See Figure 9 of https://f1000research.com/articles/6-1558/v1.

ADD COMMENT • link 5.6 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Hello,

Thank you for your response, however I am still confused about some things.

When you say scaling to the same AUC, do you mean dividing each sample column by it's AUC value, to account for coverage differences between samples?
Is accounting for coverage and mapped reads redundant (or erroneous?) for subsequent DE analysis then, as the base-pair coverage provided by recount2 can by scaled using the sample coverages using scale_counts(by='auc') or converted to read counts by read_counts()?
When doing PCA of the data samples only using expression from genes on chromosome X and Y, I see very strong batch effects by SRA project ID (SRPXXXXX) and no differentiation between the samples with sex labeled. This is slightly ameliorated by using the scale_counts(by='auc') as opposed to the read_counts() method, would you still recommend using read_counts()? Below are the PCA plots I am referring to

PCA of raw base pair coverage of X,Y chromosome genes

PCA of read_counts() of X,Y chromosome genes

PCA of scale_counts(by='auc') of X,Y chromosome genes

ADD REPLY • link 5.6 years ago sidreed34 • 0

0

Entering edit mode

Hi again,

(1)

No. Say that we have two samples, one with 45 million reads with an average mapped length of 75 base-pairs. The sum of the total coverage (AUC) for sample 1 is 45 mi * 75 = 3,375 x 10^6 For sample 2, we have 40 million reads with an average mapped length of 80 base-pairs, so sample 2 has an AUC of 40 mi * 80 = 3,200 x 10^6. If we want to make their base-pair coverage comparable, we scale them to a target size of say 40 million 100 bp reads (target AUC = 4x10^9). To do so, we multiply the current values (current_bp_counts) by the target_AUC/ current_AUC. At this point, you could plot the coverage base-pair level data for each sample and they would be on the same scale. Or you can compute their mean and use it for other purposes, like we do with derfinder. See https://doi.org/10.1093/nar/gkw852 for more.

Now, if we want to bin the data into reads (figures 7 and 8 of https://f1000research.com/articles/6-1558/v1), we have to divide the base-pair counts by the average target read length. So it becomes: current_bp_counts * (target_AUC / current_AUC ) / avg_target_read_Length. Since target_AUC = avg_target_read_Length *targetSize, then the math simplifies to current_bp_counts * targetSize / current_AUC. That is what scale_counts() basically does by default. See the "scaling coverage counts" section in https://f1000research.com/articles/6-1558/v1.

2

I don't know what you mean here. Maybe post some pseudo-code to clarify.

In any case, read_counts() takes the coverage base-pair counts and divides them by the average mapped read length. To obtain the average mapped read length, we take the current AUC and divide it by the number of reads mapped (divided by 2 in paired-end scenarios). That's because Rail-RNA soft clips some reads.

3

Regardless of whether you use scale_counts() or read_counts(), you'll have to normalize the data using functions from other packages.

Best, Leonardo

Related more detailed question: https://support.bioconductor.org/p/119399/

ADD REPLY • link 5.6 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Thank you very much, this helps clarify a lot for me.

For question 2 a better way to phrase it would have been "Why specifically would running scalecounts() and readcounts() return nonsensical results?"

ADD REPLY • link 5.6 years ago sidreed34 • 0

0

Entering edit mode

Definitions

We have

scale_counts(current_bp_counts) = current_bp_counts * (target_AUC / current_AUC ) / avg_target_read_Length = current_bp_counts * targetSize / current_AUC

and

read_counts(current_bp_counts) =current_bp_counts / avg_observed_read_length = current_bp_counts / [ current_AUC / (mapped_reads) ] = current_bp_counts * mapped_reads / current_AUC

(where mapped_reads = mapped_read_counts / 2 for paired-end)

Derivations

So

read_counts(scale_counts(current_bp_counts)) = read_counts(current_bp_counts * targetSize / current_AUC) = current_bp_counts * targetSize / current_AUC * mapped_reads / current_AUC = current_bp_counts * targetSize * mapped_reads / current_AUC^2

Given how we defined target_AUC earlier, this is also equal to:

read_counts(scale_counts(current_bp_counts)) = current_bp_counts * target_AUC * mapped_reads / ( avg_target_read_length * current_AUC^2)

If we go back to the initial definition of read_counts() we also get:

read_counts(scale_counts(current_bp_counts)) = current_bp_counts * targetSize / ( current_AUC * avg_observed_read_length)

No matter how you look at it, the end result doesn't make sense (to me). Thus, this composition of functions yields nonsensical results.

code

Note that the order doesn't matter (as long as you set round = FALSE in scale_counts() which is TRUE by default) as shown below:

library('recount')
testthat::expect_equivalent(
    assays(read_counts(scale_counts(rse_gene_SRP009615, round = FALSE)))$counts,
    assays(scale_counts(read_counts(rse_gene_SRP009615), round = FALSE))$counts
)

## Reproducibility info
> packageVersion('recount')
[1] ‘1.10.7’
> packageVersion('testthat')
[1] ‘2.1.1’

ADD REPLY • link 5.5 years ago Leonardo Collado Torres ★ 1.1k