I've noticed some differences between the count tables generated by repeated runs of RSubread. Most of the differences are a read or two, but for a few particular genes (so far, only ones in repetitive areas) the discrepancies can be hundreds of counts.
Is this expected behavior? If so, has there been any profiling which explores where/ how much variance can be expected?
This behavior was first noticed with RSubread 1.20.6 from bioconductor using R v3.2.2, but can be replicated with the current release.
The command to align follows the following convention:
align(index=indexname, readfile1=readfile1, readfile2=readfile2, input_format="FASTQ", output_format="BAM", output_file=outfile, unique=TRUE, nBestLocations=1, maxMismatches=5, indels=5, nthreads=6)
Here is the command used to generate count tables:
featureCounts(inputfiles, annot.ext=gtf, isGTFAnnotationFile=T, useMetaFeatures=T, isPairedEnd=ispaired, allowMultiOverlap=T, reportReads=T, minMQS=30, nthreads=ncpu, ignoreDup=T)
Though md5 sums of the bam files output by
align change between runs. In all test cases R was built with gcc v4.4.7.
The discrepancies I'm asking about were from alignments made using a single install of RSubread. Essentially, I would like to know if the output of RSubread is deterministic, or if there is some stochasticity. My impression had been the program was deterministic, though results I'm seeing don't match that.
Here's a gist containing the contents of the
$stats field for a couple runs using a single install of release Rsubread. For each run output looks similar, though counts for each entry do vary by a few.