Search
Question: Variance between runs of Rsubread
0
gravatar for Isaac Virshup
13 months ago by
University of Melbourne
Isaac Virshup0 wrote:

Hello,

I've noticed some differences between the count tables generated by repeated runs of RSubread. Most of the differences are a read or two, but for a few particular genes (so far, only ones in repetitive areas) the discrepancies can be hundreds of counts.

Is this expected behavior? If so, has there been any profiling which explores where/ how much variance can be expected?

This behavior was first noticed with RSubread 1.20.6 from bioconductor using R v3.2.2, but can be replicated with the current release.

The command to align follows the following convention:

align(index=indexname,
    readfile1=readfile1,
    readfile2=readfile2,
    input_format="FASTQ",
    output_format="BAM",
    output_file=outfile,
    unique=TRUE,
    nBestLocations=1,
    maxMismatches=5,
    indels=5,
    nthreads=6)

Here is the command used to generate count tables:

featureCounts(inputfiles, 
    annot.ext=gtf, 
    isGTFAnnotationFile=T, 
    useMetaFeatures=T, 
    isPairedEnd=ispaired, 
    allowMultiOverlap=T, 
    reportReads=T, 
    minMQS=30, 
    nthreads=ncpu, 
    ignoreDup=T)

Though md5 sums of the bam files output by align change between runs. In all test cases R was built with gcc v4.4.7.

Thanks,

–Isaac

UPDATE

The discrepancies I'm asking about were from alignments made using a single install of RSubread. Essentially, I would like to know if the output of RSubread is deterministic, or if there is some stochasticity. My impression had been the program was deterministic, though results I'm seeing don't match that.

Here's a gist containing the contents of the $stats field for a couple runs using a single install of release Rsubread. For each run output looks similar, though counts for each entry do vary by a few.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Isaac Virshup0

There could be slight differences in counting results between different Rsubread versions. But you need to provide more details, including your commands, versions and counting summaries, so that the differences you observed can be better understood.

ADD REPLYlink written 13 months ago by Wei Shi2.7k

I've updated the question to include commands and versions used. I'm not sure what you mean by counting summaries, would you mind clarifying?

ADD REPLYlink written 13 months ago by Isaac Virshup0
0
gravatar for Wei Shi
13 months ago by
Wei Shi2.7k
Australia
Wei Shi2.7k wrote:

The difference you observed is likely to be caused by the changes in the align() function. We made improvement on the reporting of multi-mapping reads (this is why you saw differences for genes in repetitive regions) and also on the reporting of mapping quality scores.

The 'stat' component included in the object returned by featureCounts includes counting summary, which gives the breakdown of counting results for different groups of reads (multimapping reads, reads passing MQS cutoff etc). This summary is helpful for you to understand what caused the counting differences between your different runs.

ADD COMMENTlink written 13 months ago by Wei Shi2.7k

Sorry, but I think there has a been a misunderstanding about my question. I'm seeing differences between alignments generated from a single install of RSubread. I'll add an update to my question to clarify this.

ADD REPLYlink written 13 months ago by Isaac Virshup0

Could you run sessionInfo() command and provide its output?

ADD REPLYlink written 13 months ago by Wei Shi2.7k

Session info associated with the linked gist above:

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.22.3

loaded via a namespace (and not attached):
[1] tools_3.3.1
ADD REPLYlink modified 13 months ago • written 13 months ago by Isaac Virshup0

Hi Isaac, could you provide the bam file so we can take a close look? Could you also let us know which genome you mapped your reads to?

ADD REPLYlink written 13 months ago by Wei Shi2.7k

The alignment was against GRCh37.69. How should I send you a link to the files?

ADD REPLYlink written 13 months ago by Isaac Virshup0

You may send us the files via dropbox.

ADD REPLYlink written 13 months ago by Wei Shi2.7k

Not sure if there's quite enough space in my dropbox at the moment... I've uploaded them to CloudStor (should be similar) here: https://cloudstor.aarnet.edu.au/plus/index.php/s/gh2AdmTmhr6Vdox

ADD REPLYlink written 13 months ago by Isaac Virshup0

Wei, have you had a chance to retrieve those files? I'd like to take those files down from my cloudstor drive.

Best,

–Isaac

ADD REPLYlink written 12 months ago by Isaac Virshup0

Yes we have downloaded the files. Thanks.

ADD REPLYlink written 12 months ago by Wei Shi2.7k

Wei,

On the recommendation of a co-worker I tried running subread using a single thread to see if that produced deterministic output. I've run it a few times now and have been getting the same output – while multi-threaded output still varies by run. I ran this using conda installed R and Rsubread using the human genome from ensembl release 76. Hope this helps resolve this issue!

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.23.0
ADD REPLYlink written 10 months ago by Isaac Virshup0

Hi Isaac, this will be fixed in the next version which should be released soon.

ADD REPLYlink written 10 months ago by Wei Shi2.7k

That's great! Thanks for looking into this.

Will the new version be available with the next release of bioconductor or will it be out separately?

ADD REPLYlink written 10 months ago by Isaac Virshup0

We aim to release it in the next few weeks. Cheers, Wei

ADD REPLYlink written 10 months ago by Wei Shi2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 146 users visited in the last hour