Question

featureCounts to RangedSummarizedExperiment

0

Entering edit mode

karl.nordstrom • 0

@karlnordstrom-11313

Last seen 7.7 years ago

Dear Developers,

thanks for this software!

Is there any thought of introducing the windowed mode into the subread software featureCounts. Perhaps this is the wrong forum to ask :) Of course defining the windows isn't too problematic.

Anyhow, what I really would appreciate would be a constructor parsing an featureCounts file and returning a RangedSummarizedExperiment object that could be used with csaw.

Best,

Karl

windowcounts featurecounts • 1.4k views

ADD COMMENT • link updated 7.7 years ago by Aaron Lun ★ 28k • written 7.7 years ago by karl.nordstrom • 0

score 2 · Accepted Answer · 2016-08-18

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

The literal answer to your question is that it's very easy to construct a RangedSummarizedExperiment from the output of featureCounts (at least, if you're calling the latter from Rsubread):

reg.out <- fc.out$annotation
win.coords <- GRanges(reg.out$Chr, IRanges(reg.out$Start, reg.out$End))
rse <- SummarizedExperiment(fc.out$counts, win.coords)

I assume you did counting without meta-feature summarization (as that wouldn't make much sense for windows) so Start and End should just be numeric values rather than comma-separated lists.

Now, for the actual answer to your question, some extra fields have to be added to rse to make it behave like the output of windowCounts when you use the downstream functions in csaw. At the very least, you'll need to fill in rse$totals to specify the sequencing depth of each library, and also metadata(rse)$rlen to specify the final extension length for each read (if you performed directional extension, otherwise set it to the read length). I would say that, if you want to perform an analysis in csaw, it's probably easier to stick with the csaw functions unless you know what you're doing. windowCounts mightn't be the fastest, but it does the job.

ADD COMMENT • link 7.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

I think library(GenomicRanges); as(reg.out, "GRanges") will work (makeGRangesFromDataFrame() if one wants more control) and might be a shorter way to create win.coords.

ADD REPLY • link 7.7 years ago Martin Morgan 25k

0

Entering edit mode

Thanks for the answer,

I figured out how to generate the objects, I think. I didn't go through Rsubread in the end

library(csaw)
library(readr)
nbam=4
data.raw<-read_tsv("featurecounts.fc",col_types=paste(c("cciici",rep("i",nbam)),collapse=""),skip=1)

chrSizes<-tapply(data.raw[["End"]],data[["Chr"]],max)

all.regions<-GRanges(data[["Chr"]], IRanges(data[["Start"]], data[["End"]]))
seqlevels(all.regions) <- names(chrSizes)
seqlengths(all.regions) <- as.vector(chrSizes)

paramlist <- lapply(seq_len(nbam), FUN=function(x) { readParam(pe="none", forward=NA) }) dim(paramlist)<-c(nbam,1) 
colnames(paramlist) <- "param"

data<-SummarizedExperiment(
    assays = SimpleList(counts=as.matrix(data[,-(1:6)])),
    rowRanges = all.regions,
    colData = DataFrame(
        bam.files = colnames(data[-(1:6)]),
        totals = as.vector(colSums(data[,-(1:6)])),
        ext = 1,
        paramlist
      ),
    metadata = list(
        spacing = 50,
        width = 50,
        shift = 0,
        final.ext = 1
      )
)

I only wanted to count start sites, and hence I kept the ext-parameters to 1. The above works, and takes me through the pipeline with satisfying results at the end. That's about as much I can say about the correctness :) For instance, with another setup I assume the param object has to be modified accordingly.

ADD REPLY • link 7.6 years ago karl.nordstrom • 0

0

Entering edit mode

Well, if you keep ext to 1, that implies you're only considering the 5' end of each read during counting (as in the literal 5' base position of the read, and nothing else). By default, featureCounts should count any overlaps between the read and the window, so if you're using the default settings you'd be better off setting ext to the read length. Obviously, if you did any extension with featureCounts, then this should be reflected in the value of ext and final.ext. This value is important for scaling the threshold (computed based on binned counts) during filtering of the relatively smaller windows.

Also, setting totals to the column sums assumes you count each read exactly once into any window, otherwise it won't be quite correct. Storage of incorrect totals will be problematic if you're computing normalization factors based on bin counts to eliminate composition biases, as this relies on the totals being the same across different counting schemes.

I'm still a bit bemused about why you wouldn't want to use windowCounts - it would certainly save you more time trying to figure out whether or not the rest of the code is interacting properly with your custom RangedSummarizedExperiment object, and it's not like you can't save the returned object with saveRDS for use in future R sessions. If you insist on doing it your own way, I can only advise you to proceed with caution. Mind you, if you don't do any csaw-specific filtering or normalization, then you might as well skip directly to creating a DGEList for use with edgeR, and avoid the middleman altogether.

ADD REPLY • link 7.6 years ago Aaron Lun ★ 28k