Counts from recount3 for DESeq2 analysis
1
1
Entering edit mode
@7a8c2d00
Last seen 27 days ago
France

Hi,

As the topic title suggests, I want to use RSE objects from recount3 for differential expression analysis with DESeq2. This is an ongoing project that I took over. Looking at what has already been done, there are two points I wonder about :

1 - I found that the recount3::transform_counts() function (described here) was sometimes applied on the recount3 RSE objects before creating the DESeqDataSets. In the DESeq2 vignette, it is indicated :

The DESeq2 model internally corrects for library size, so transformed or normalized values ​​such as counts scaled by library size should not be used as input

However, transform_counts does take into account the size of the library. Counts scaled with this function should not be used with DESEq2, right?

2 - Sometimes transform_counts was not applied and raw recount3 RSE object was given as input to DESeq2::DESeqDataSet(). I believe this is also an error because the counts directly provided by recount3 are raw base-pair coverage counts (as stated here) and not raw read counts requested by DESeq2 :

As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of integer values. The value in the i-th row and the j-th column of the matrix tells how many reads can be assigned to gene i in sample j.

In view of points 1- and 2-, I think I have to retrieve the read counts of RSE recount3 objects with the recount3::compute_read_counts() function (described here) then use these objects to create the DEseqDataSets without any prior normalization. Am I right?

Thank you for your consideration,

William

recount3 DESeq2 • 245 views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 6 hours ago
United States

A few years back I took a look at the rse objects from recount and I wrote this function to go from basepair coverage to what I think would be good input to DESeq2:

https://github.com/biodatascience/compbio_src/blob/master/bioc/my_scale_counts.R

ADD COMMENT
2
Entering edit mode

Hi,

This question is related to recount counts in the example experiment from Mike a few years ago (which he alluded to). That led to the introduction in recount (from the _recount2 project) of a function to compute the read counts, for which we have an equivalent in recount3. Which @william.benit found at http://research.libd.org/recount3/reference/compute_read_counts.html. The exact code behind it is: https://github.com/LieberInstitute/recount3/blob/6eb14b844062ebdf45fe5a356577e3ea0483c97e/R/transform_counts.R#L351-L362.

It does the same as https://github.com/biodatascience/compbio_src/blob/master/bioc/my_scale_counts.R although it's tailored for recount3 objects that have an assayName() called raw_counts. We introduced that assayName in recount3 to highlight how the counts provided are raw base-pair coverage counts, not read counts.

Best, Leo

ADD REPLY
1
Entering edit mode

@william.benit

If recount3 has an assay raw_counts, which deals with single-end / paired-end experiments and converts from the coverage to the count values, this is the preferred solution. Just use this assay.

ADD REPLY
0
Entering edit mode

Understood ! Thank you both for your help, I can continue this project now ! :)

ADD REPLY
0
Entering edit mode

Thank you Michael !!! I changed your function a bit based on my data:

my_scale_counts <- function(rse_gene, round=TRUE) {
  cts <- SummarizedExperiment::assay(rse_gene)
  # mapped_read_count includes the count for both reads of a pair
  # here, mapped_read_counts = recount_qc.star.all_mapped_reads_both
  paired <- 2
  x <- (colData(rse_gene)$recount_qc.star.all_mapped_reads_both / paired) / colSums(cts)
  assay(rse_gene) <- t(t(assay(rse_gene)) * x)
  if (round) {
    assay(rse_gene) <- round(assay(rse_gene))
  }
  rse_gene
}

I'm not sure I understood some objects, please, correct me if I'm wrong.

  • cts must be the count matrix of the recount rse object.
  • paired is originally a vector indicating which samples have been paired-end sequenced. I don't have a paired-end column in these rse. They originate from the GTEx and TCGA projects: I'm sure that those of the GTEx are paired-end and I read in an article that this is also the case for those of the TCGA. Also, in each colData of my rse, the "recount_qc.star.all_mapped_reads" column is equal to the "recount_qc.star.all_mapped_reads_both" column. That's why I put paired at 2.

And so, what kind of values do we have in the rse_gene assay now?

Thank you again for you help,

William

ADD REPLY
1
Entering edit mode

These seem to be reasonable changes.

The values in the original assay are coverage values which is roughly read/fragment counts * basepairs. We divide out the sum and multiply so the columns now add to the number of mapped fragments. These values should now correspond to fragment counts. They are _not_ corrected for sequencing depth.

ADD REPLY
1
Entering edit mode

@william.benit: I would encourage you to use http://research.libd.org/recount3/reference/compute_read_counts.html instead of introducing a new function.

ADD REPLY
0
Entering edit mode

Understood ! Thank you both for your help, I can continue this project now ! :)

ADD REPLY

Login before adding your answer.

Traffic: 289 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6