Search
Question: Variance between runs of Rsubread
0
2.0 years ago by
University of Melbourne
Isaac Virshup0 wrote:

Hello,

I've noticed some differences between the count tables generated by repeated runs of RSubread. Most of the differences are a read or two, but for a few particular genes (so far, only ones in repetitive areas) the discrepancies can be hundreds of counts.

Is this expected behavior? If so, has there been any profiling which explores where/ how much variance can be expected?

This behavior was first noticed with RSubread 1.20.6 from bioconductor using R v3.2.2, but can be replicated with the current release.

The command to align follows the following convention:

align(index=indexname,
input_format="FASTQ",
output_format="BAM",
output_file=outfile,
unique=TRUE,
nBestLocations=1,
maxMismatches=5,
indels=5,
nthreads=6)

Here is the command used to generate count tables:

featureCounts(inputfiles,
annot.ext=gtf,
isGTFAnnotationFile=T,
useMetaFeatures=T,
isPairedEnd=ispaired,
allowMultiOverlap=T,
minMQS=30,
ignoreDup=T)

Though md5 sums of the bam files output by align change between runs. In all test cases R was built with gcc v4.4.7.

Thanks,

–Isaac

UPDATE

The discrepancies I'm asking about were from alignments made using a single install of RSubread. Essentially, I would like to know if the output of RSubread is deterministic, or if there is some stochasticity. My impression had been the program was deterministic, though results I'm seeing don't match that.

Here's a gist containing the contents of the \$stats field for a couple runs using a single install of release Rsubread. For each run output looks similar, though counts for each entry do vary by a few.

modified 2.0 years ago • written 2.0 years ago by Isaac Virshup0

There could be slight differences in counting results between different Rsubread versions. But you need to provide more details, including your commands, versions and counting summaries, so that the differences you observed can be better understood.

I've updated the question to include commands and versions used. I'm not sure what you mean by counting summaries, would you mind clarifying?

0
2.0 years ago by
Wei Shi2.9k
Australia
Wei Shi2.9k wrote:

The difference you observed is likely to be caused by the changes in the align() function. We made improvement on the reporting of multi-mapping reads (this is why you saw differences for genes in repetitive regions) and also on the reporting of mapping quality scores.

The 'stat' component included in the object returned by featureCounts includes counting summary, which gives the breakdown of counting results for different groups of reads (multimapping reads, reads passing MQS cutoff etc). This summary is helpful for you to understand what caused the counting differences between your different runs.

Sorry, but I think there has a been a misunderstanding about my question. I'm seeing differences between alignments generated from a single install of RSubread. I'll add an update to my question to clarify this.

Could you run sessionInfo() command and provide its output?

Session info associated with the linked gist above:

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

loaded via a namespace (and not attached):
[1] tools_3.3.1

Hi Isaac, could you provide the bam file so we can take a close look? Could you also let us know which genome you mapped your reads to?

The alignment was against GRCh37.69. How should I send you a link to the files?

You may send us the files via dropbox.

Not sure if there's quite enough space in my dropbox at the moment... I've uploaded them to CloudStor (should be similar) here: https://cloudstor.aarnet.edu.au/plus/index.php/s/gh2AdmTmhr6Vdox

Wei, have you had a chance to retrieve those files? I'd like to take those files down from my cloudstor drive.

Best,

–Isaac

Wei,

On the recommendation of a co-worker I tried running subread using a single thread to see if that produced deterministic output. I've run it a few times now and have been getting the same output – while multi-threaded output still varies by run. I ran this using conda installed R and Rsubread using the human genome from ensembl release 76. Hope this helps resolve this issue!

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Rsubread_1.23.0

Hi Isaac, this will be fixed in the next version which should be released soon.

That's great! Thanks for looking into this.

Will the new version be available with the next release of bioconductor or will it be out separately?