I am currently trying to conduct a differential expression analysis using the recount data. I want to compare the GTEx and TCGA data against a specific set of separate runs I have picked out.
My issue is with the targetSize argument in the scale_counts function. By default it is set 4e7 for all runs in the RSE object, but should it not be specific to each run since it represents the number of single-end mapped reads? I would expect the scaling should then be:
(mappedreadcount / (2 if paired_end == TRUE else 1)) /auc
where mappedreadcount is the column in colData(rse).
My questions is why is the targetSize argument constant across samples, should it not be specific to each sample?
P.S. I intend to use TMM normalization for differential expression analysis on the scaled counts afterwards, if that's relevant.
Hello,
Thank you for your response, however I am still confused about some things.
When you say scaling to the same AUC, do you mean dividing each sample column by it's AUC value, to account for coverage differences between samples?
Is accounting for coverage and mapped reads redundant (or erroneous?) for subsequent DE analysis then, as the base-pair coverage provided by recount2 can by scaled using the sample coverages using scale_counts(by='auc') or converted to read counts by read_counts()?
When doing PCA of the data samples only using expression from genes on chromosome X and Y, I see very strong batch effects by SRA project ID (SRPXXXXX) and no differentiation between the samples with sex labeled. This is slightly ameliorated by using the scale_counts(by='auc') as opposed to the read_counts() method, would you still recommend using read_counts()? Below are the PCA plots I am referring to
PCA of raw base pair coverage of X,Y chromosome genes
PCA of read_counts() of X,Y chromosome genes
PCA of scale_counts(by='auc') of X,Y chromosome genes
Hi again,
(1)
No. Say that we have two samples, one with 45 million reads with an average mapped length of 75 base-pairs. The sum of the total coverage (AUC) for sample 1 is 45 mi * 75 = 3,375 x 10^6 For sample 2, we have 40 million reads with an average mapped length of 80 base-pairs, so sample 2 has an AUC of 40 mi * 80 = 3,200 x 10^6. If we want to make their base-pair coverage comparable, we scale them to a target size of say 40 million 100 bp reads (target AUC = 4x10^9). To do so, we multiply the current values (
current_bp_counts
) by thetarget_AUC
/current_AUC
. At this point, you could plot the coverage base-pair level data for each sample and they would be on the same scale. Or you can compute their mean and use it for other purposes, like we do withderfinder
. See https://doi.org/10.1093/nar/gkw852 for more.Now, if we want to bin the data into reads (figures 7 and 8 of https://f1000research.com/articles/6-1558/v1), we have to divide the base-pair counts by the average target read length. So it becomes:
current_bp_counts
* (target_AUC
/current_AUC
) /avg_target_read_Length
. Sincetarget_AUC
=avg_target_read_Length
*targetSize
, then the math simplifies tocurrent_bp_counts
*targetSize
/current_AUC
. That is whatscale_counts()
basically does by default. See the "scaling coverage counts" section in https://f1000research.com/articles/6-1558/v1.2
I don't know what you mean here. Maybe post some pseudo-code to clarify.
In any case,
read_counts()
takes the coverage base-pair counts and divides them by the average mapped read length. To obtain the average mapped read length, we take the current AUC and divide it by the number of reads mapped (divided by 2 in paired-end scenarios). That's because Rail-RNA soft clips some reads.3
Regardless of whether you use
scale_counts()
orread_counts()
, you'll have to normalize the data using functions from other packages.Best, Leonardo
Related more detailed question: https://support.bioconductor.org/p/119399/
Thank you very much, this helps clarify a lot for me.
For question 2 a better way to phrase it would have been "Why specifically would running scalecounts() and readcounts() return nonsensical results?"
Definitions
We have
scale_counts(current_bp_counts)
=current_bp_counts
* (target_AUC
/current_AUC
) /avg_target_read_Length
=current_bp_counts
*targetSize
/current_AUC
and
read_counts(current_bp_counts)
=current_bp_counts
/avg_observed_read_length
=current_bp_counts
/ [current_AUC
/ (mapped_reads
) ] =current_bp_counts
*mapped_reads
/current_AUC
(where
mapped_reads = mapped_read_counts / 2
for paired-end)Derivations
So
read_counts(scale_counts(current_bp_counts))
=read_counts(current_bp_counts
*targetSize
/current_AUC
) =current_bp_counts
*targetSize
/current_AUC
*mapped_reads
/current_AUC
=current_bp_counts
*targetSize
*mapped_reads
/current_AUC^2
Given how we defined
target_AUC
earlier, this is also equal to:read_counts(scale_counts(current_bp_counts))
=current_bp_counts
*target_AUC
*mapped_reads
/ (avg_target_read_length
*current_AUC^2
)If we go back to the initial definition of
read_counts()
we also get:read_counts(scale_counts(current_bp_counts))
=current_bp_counts
*targetSize
/ (current_AUC
*avg_observed_read_length
)No matter how you look at it, the end result doesn't make sense (to me). Thus, this composition of functions yields nonsensical results.
code
Note that the order doesn't matter (as long as you set
round = FALSE
inscale_counts()
which isTRUE
by default) as shown below: